In deep learning, optimization algorithms are essential for efficient learning and convergence in neural networks. One of the most popular choices is the Adam optimizer. So, what is the Adam optimizer? This blog post explores its inner workings, advantages, and practical tips for effective use.
This article was published as a part of the Data Science Blogathon.
The Adam optimizer, short for “Adaptive Moment Estimation,” is an iterative optimization algorithm used to minimize the loss function during the training of neural networks. Adam can be looked at as a combination of RMSprop and Stochastic Gradient Descent with momentum. Developed by Diederik P. Kingma and Jimmy Ba in 2014, Adam has become a go-to choice for many machine learning practitioners.
It uses the squared gradients to scale the learning rate like RMSprop, and it takes advantage of momentum by using the moving average of the gradient instead of the gradient itself, like SGD with momentum. This combines Dynamic Learning Rate and Smoothening to reach the global minima.
Time needed: 3 minutes
Adam optimizer is like a smart helper for training neural networks. It helps adjust the network’s settings (called parameters) to make it better at its job, like recognizing images or understanding text.
Here’s are Steps that how Adam Optimizer works:
First, Adam sets up two things to keep track of how the network is doing. One is for the average (mean) of how steep the slope is when it’s figuring out how to improve (this is called the first moment). The other is for the average of how fast the slope is changing (this is called the second moment). Both start at zero.
During training, Adam checks how steep the slope is by looking at how the network’s guesses compare to the correct answers.
Adam then figures out the average slope over time. It’s like remembering how steep the hill has been in the past.
Adam also figures out the average of how fast the slope is changing over time. This helps to understand if the slope is getting steeper or gentler.
At the beginning, since the averages start at zero, they might not be very accurate. So, Adam makes some adjustments to make them more accurate.
Finally, Adam uses these averages to help adjust the network’s settings a bit. It’s like gently nudging the network in the right direction to improve its performance.
By doing all of this, Adam helps the neural network learn more efficiently and effectively. It’s like having a good coach who guides the network to become better at its task.
Where w is model weights, b is for bias, and eta (looks like the letter n) is the step size (it can depend on iteration). And that’s it, that’s the update rule for Adam, which is also known as the Learning rate.
Adam adjusts the learning rates for each parameter individually. It calculates a moving average of the first-order moments (the mean of gradients) and the second-order moments (the uncentered variance of gradients) to scale the learning rates adaptively. This makes it well-suited for problems with sparse gradients or noisy data.
To counteract the initialization bias in the first moments, Adam applies bias correction during the early iterations of training. This ensures faster convergence and stabilizes the training process.
Unlike some optimization algorithms that require storing a history of gradients for each parameter, Adam only needs to maintain two moving averages per parameter. This makes it memory-efficient, especially for large neural networks.
First, let’s define an optimization function. We will use a simple two-dimensional function that squares the input of each dimension and defines the range of valid inputs from -1.0 to 1.0.
The objective() function below implements this function.
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
def objective(x, y):
return x**2.0 + y**2.0
# define range for input
range_min, range_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(range_min, range_max, 0.1)
yaxis = arange(range_min, range_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
results = objective(x, y)
figure = pyplot.figure()
axis = figure.add_subplot(111, projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()
Here’s a simplified Python code example demonstrating how to use the Adam optimizer in a neural network training scenario using the popular deep learning library TensorFlow. In this example, we’ll use TensorFlow’s Keras API for creating and training a simple neural network for image classification:
import keras # Import the Keras library
from keras.datasets import mnist # Load the MNIST dataset
from keras.models import Sequential # Initialize a sequential model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
import numpy as np
# Load the MNIST dataset from Keras
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Print the shape of the training and test data
print(x_train.shape, y_train.shape)
# Reshape the training and test data to 4 dimensions
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
# Define the input shape
input_shape = (28, 28, 1)
# Convert the labels to categorical format
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
# Convert the pixel values to floats between 0 and 1
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalize the pixel values by dividing them by 255
x_train /= 255
x_test /= 255
# Define the batch size and number of classes
batch_size = 60
num_classes = 10
# Define the number of epochs to train the model for
epochs = 10
"""
Builds a CNN model for MNIST digit classification.
Args:
optimizer: The optimizer to use for training the model.
Returns:
A compiled Keras model.
"""
def build_model(optimizer):
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
optimizers=['Adagrad','Adam','SGD']#import csv
histories ={opt:build_model(opt).fit(x_train,y_train,batch_size=batch_size,
epochs=epochs,verbose=1,validation_data=(x_test,y_test)) for opt in optimiers}
ADAM paper presented diagrams that showed even better results:
Recent research has raised concerns about the generalization capabilities of the Adam optimizer in deep learning, indicating that it may not always converge to optimal solutions, particularly in tasks like image classification on CIFAR datasets. Wilson et al.’s study highlighted that adaptive optimizers like Adam might not generalize as effectively as SGD with momentum across various deep-learning tasks.
Nitish Shirish Keskar and Richard Socher proposed a solution called SWATS, where training begins with Adam but switches to SGD as learning saturates. SWATS has shown promise in achieving competitive generalization performance compared to SGD with momentum, prompting a reevaluation of optimization choices in deep learning.
Adam is one of the best optimization algorithms for deep learning, and its popularity is growing quickly. Its adaptive learning rates, efficiency in optimization, and robustness make it a popular choice for training neural networks. As deep learning evolves, optimization algorithms like Adam optimizer will remain essential tools. Still, practitioners should stay open to alternative approaches and strategies for achieving optimal model performance.
I want to take a moment to say thank you. Thank you for taking the time to read this blog and for your interest in Adam Optimizer in Neural Networks.
Until next time, stay curious and keep learning!
A: The Adam optimizer adjusts neural network parameters by combining momentum and adaptive learning rates for efficient training and faster convergence.
A: Adam often outperforms SGD, especially in complex models, due to its adaptive learning rates and faster convergence, though performance may vary by task.
A: Adam uses adaptive learning rates and moment estimates, while gradient descent employs a fixed learning rate, leading to different convergence behaviors.
A. Address potential problems, such as sensitivity to hyperparameters or convergence issues, and offer solutions.
A: The default learning rate for Adam is typically set to 0.001, but it can be adjusted based on specific tasks and datasets.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.