Optimizers in Deep Learning: A Detailed Guide

ayush Last Updated : 04 Apr, 2025

15 min read

Deep learning is the subfield of machine learning which is used to perform complex tasks such as speech recognition, text classification, etc. The deep learning model consists of an activation function, input, output, hidden layers, loss function, etc. All deep learning algorithms try to generalize the data using an algorithm and try to make predictions on unseen data. We need an algorithm that maps the examples of inputs to that of the outputs along with an optimization algorithm. An optimization algorithm finds the value of the parameters (weights) that minimize the error when mapping inputs to outputs. This article will tell you all about such optimization algorithms or optimizer in deep learning.In this guide, we will learn about different optimizers used in building a deep learning model, their pros and cons, and the factors that could make you choose an optimizer instead of others for your application.

This article was published as a part of the Data Science Blogathon.

What is Optimizer?
What are Optimizers in Deep Learning?
- Choosing the Right Optimizer
Important Deep Learning Terms
Gradient Descent Deep Learning Optimizer
Stochastic Gradient Descent Deep Learning Optimizer
Stochastic Gradient Descent With Momentum Deep Learning Optimizer
Mini Batch Gradient Descent Deep Learning Optimizer
Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer
RMS Prop (Root Mean Square) Deep Learning Optimizer
- RMS Prop Formula
AdaDelta Deep Learning Optimizer
Adam Optimizer in Deep Learning
- Adam Optimizer Formula
Hands-on Optimizers

What is Optimizer?

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s error or loss function, enhancing performance. Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameter values for improved predictions efficiently.

What are Optimizers in Deep Learning?

In deep learning, optimizers are crucial as algorithms that dynamically fine-tune a model’s parameters throughout the training process, aiming to minimize a predefined loss function. These specialized algorithms facilitate the learning process of neural networks by iteratively refining the weights and biases based on the feedback received from the data. Well-known optimizers in deep learning encompass Stochastic Gradient Descent (SGD), Adam, and RMSprop, each equipped with distinct update rules, learning rates, and momentum strategies, all geared towards the overarching goal of discovering and converging upon optimal model parameters, thereby enhancing overall performance.

Choosing the Right Optimizer

Optimizer algorithms are essential for enhancing the performance of deep learning models by improving accuracy and training speed. They adjust the neural network’s weights and learning rates during each training epoch to minimize the loss function. Given that deep learning models often have millions of parameters, selecting the right optimization algorithm is crucial for effective training. Therefore, a solid understanding of these algorithms is vital for data scientists venturing into the field. Check out how to Choose optimizer:

Optimizers adjust weights and learning rates in machine learning models.
The choice of optimizer depends on the specific application.
Beginners may be tempted to try all optimizers to find the best one.
Randomly selecting optimizers can waste time with large datasets.
A single epoch can be time-consuming when working with hundreds of gigabytes of data.
This guide covers various deep-learning optimizers, including Gradient Descent and others.
Optimizers discussed include Stochastic Gradient Descent, Mini-Batch Gradient Descent, Adagrad, RMSProp, AdaDelta, and Adam.
By the end of the article, readers will compare different optimizers and understand their procedures

Important Deep Learning Terms

Before proceeding, there are a few terms that you should be familiar with.

Epoch – The number of times the algorithm runs on the whole training dataset.
Sample – A single row of a dataset.
Batch – It denotes the number of samples to be taken to for updating the model parameters.
Learning rate – It is a parameter that provides the model a scale of how much model weights should be updated.
Cost Function/Loss Function – A cost function is used to calculate the cost, which is the difference between the predicted value and the actual value.
Weights/ Bias – The learnable parameters in a model that controls the signal between two neurons.

Now let’s explore each optimizer.

Gradient Descent Deep Learning Optimizer

Gradient Descent can be considered the popular kid among the class of optimizers in deep learning. This optimization algorithm uses calculus to consistently modify the values and achieve the local minimum. Before moving ahead, you might question what a gradient is.

In simple terms, consider you are holding a ball resting at the top of a bowl. When you lose the ball, it goes along the steepest direction and eventually settles at the bottom of the bowl. A Gradient provides the ball in the steepest direction to reach the local minimum which is the bottom of the bowl.

Gradient Descent Deep Learning Optimizer formula

The above equation means how the gradient is calculated. Here alpha is the step size that represents how far to move against each gradient with each iteration.

Gradient descent works as follows:

Initialize Coefficients: Start with initial coefficients.
Evaluate Cost: Calculate the cost associated with these coefficients.
Search for Lower Cost: Look for a cost value lower than the current one.
Update Coefficients: Move towards the lower cost by updating the coefficients’ values.
Repeat Process: Continue this process iteratively.
Reach Local Minimum: Stop when a local minimum is reached, where further cost reduction is not possible.

A Comprehensive Guide on Optimizers in Deep Learning

Gradient descent works best for most purposes. However, it has some downsides too. It is expensive to calculate the gradients if the size of the data is huge. Gradient descent works well for convex functions, but it doesn’t know how far to travel along the gradient for nonconvex functions.

Stochastic Gradient Descent Deep Learning Optimizer

At the end of the previous section, you learned why there might be better options than using gradient descent on massive data. To tackle the challenges large datasets pose, we have stochastic gradient descent, a popular approach among optimizers in deep learning. The term stochastic denotes the element of randomness upon which the algorithm relies. In stochastic gradient descent, instead of processing the entire dataset during each iteration, we randomly select batches of data. This implies that only a few samples from the dataset are considered at a time, allowing for more efficient and computationally feasible optimization in deep learning models.

The procedure is first to select the initial parameters w and learning rate n. Then randomly shuffle the data at each iteration to reach an approximate minimum.

Since we are not using the whole dataset but the batches of it for each iteration, the path taken by the algorithm is full of noise as compared to the gradient descent algorithm. Thus, SGD uses a higher number of iterations to reach the local minima. Due to an increase in the number of iterations, the overall computation time increases. But even after increasing the number of iterations, the computation cost is still less than that of the gradient descent optimizer. So the conclusion is if the data is enormous and computational time is an essential factor, stochastic gradient descent should be preferred over batch gradient descent algorithm.

Stochastic Gradient Descent With Momentum Deep Learning Optimizer

As discussed in the earlier section, you have learned that stochastic gradient descent takes a much more noisy path than the gradient descent algorithm when addressing optimizers in deep learning. Due to this, it requires a more significant number of iterations to reach the optimal minimum, and hence, computation time is very slow. To overcome the problem, we use stochastic gradient descent with a momentum algorithm.

What the momentum does is helps in faster convergence of the loss function. Stochastic gradient descent oscillates between either direction of the gradient and updates the weights accordingly. However, adding a fraction of the previous update to the current update will make the process a bit faster. One thing that should be remembered while using this algorithm is that the learning rate should be decreased with a high momentum term.

Stochastic Gradient Descent with Momentum Deep Learning Optimizer image

In the above image, the left part shows the convergence graph of the stochastic gradient descent algorithm. At the same time, the right side shows SGD with momentum. From the image, you can compare the path chosen by both algorithms and realize that using momentum helps reach convergence in less time. You might be thinking of using a large momentum and learning rate to make the process even faster. But remember that while increasing the momentum, the possibility of passing the optimal minimum also increases. This might result in poor accuracy and even more oscillations.

Checkout this article – Coding Neural Networks From Scratch in Python and R

Mini Batch Gradient Descent Deep Learning Optimizer

In this variant of gradient descent, instead of using all the training data, we only use a subset of the dataset to calculate the loss function. Since we use a batch of data instead of the whole dataset, we need fewer iterations. That is why the mini-batch gradient descent algorithm is faster than both stochastic gradient descent and batch gradient descent algorithms. This algorithm is more efficient and robust than the earlier variants of gradient descent. As the algorithm uses batching, you do not need to load all the training data into memory, which makes the process more efficient to implement. Moreover, the cost function in mini-batch gradient descent is noisier than the batch gradient descent algorithm but smoother than that of the stochastic gradient descent algorithm. Because of this, mini-batch gradient descent is ideal and provides a good balance between speed and accuracy.

Despite all that, the mini-batch gradient descent algorithm has some downsides too. It requires a hyperparameter called ‘mini-batch-size,’ which you must tune to achieve the required accuracy. A batch size of 32 is generally appropriate for almost every case. Also, in some cases, it results in poor final accuracy. Due to this, there needs a rise to look for other alternatives too.

Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer

The adaptive gradient descent algorithm is slightly different from other gradient descent algorithms. This is because it uses different learning rates for each iteration. The change in learning rate depends upon the difference in the parameters during training. The more the parameters get changed, the more minor the learning rate changes. This modification is highly beneficial because real-world datasets contain sparse as well as dense features. So it is unfair to have the same value of learning rate for all the features. The Adagrad algorithm uses the below formula to update the weights. Here the alpha(t) denotes the different learning rates at each iteration, n is a constant, and E is a small positive to avoid division by 0.

Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer 2

The benefit of using Adagrad is that it abolishes the need to modify the learning rate manually. It is more reliable than gradient descent algorithms and their variants, and it reaches convergence at a higher speed.

One downside of the AdaGrad optimizer is that it decreases the learning rate aggressively and monotonically. There might be a point when the learning rate becomes extremely small. This is because the squared gradients in the denominator keep accumulating, and thus the denominator part keeps on increasing. Small learning rates prevent the model from acquiring more knowledge, which compromises its accuracy.

RMS Prop (Root Mean Square) Deep Learning Optimizer

RMS prop is one of the popular optimizers among deep learning enthusiasts. This is maybe because it hasn’t been published but is still very well-known in the community. RMS prop is ideally an extension of the work RPPROP. It resolves the problem of varying gradients. The problem with the gradients is that some of them were small while others may be huge. So, defining a single learning rate might not be the best idea. RPPROP uses the gradient sign, adapting the step size individually for each weight. In this algorithm, the two gradients are first compared for signs. If they have the same sign, we’re going in the right direction, increasing the step size by a small fraction. If they have opposite signs, we must decrease the step size. Then we limit the step size and can now go for the weight update.

The problem with RPPROP is that it doesn’t work well with large datasets and when we want to perform mini-batch updates. So, achieving the robustness of RPPROP and the efficiency of mini-batches simultaneously was the main motivation behind the rise of RMS prop. RMS prop is an advancement in AdaGrad optimizer as it reduces the monotonically decreasing learning rate.

RMS Prop Formula

The algorithm mainly focuses on accelerating the optimization process by decreasing the number of function evaluations to reach the local minimum. The algorithm keeps the moving average of squared gradients for every weight and divides the gradient by the square root of the mean square.

RMS Prop(Root Mean Square) Deep Learning Optimizer

where gamma is the forgetting factor. Weights are updated by the below formula

In simpler terms, if there exists a parameter due to which the cost function oscillates a lot, we want to penalize the update of this parameter. Suppose you built a model to classify a variety of fishes. The model relies on the factor ‘color’ mainly to differentiate between the fishes. Due to this, it makes a lot of errors. What RMS Prop does is, penalize the parameter ‘color’ so that it can rely on other features too. This prevents the algorithm from adapting too quickly to changes in the parameter ‘color’ compared to other parameters. This algorithm has several benefits as compared to earlier versions of gradient descent algorithms. The algorithm converges quickly and requires lesser tuning than gradient descent algorithms and their variants.

The problem with RMS Prop is that the learning rate has to be defined manually, and the suggested value doesn’t work for every application.

AdaDelta Deep Learning Optimizer

AdaDelta can be seen as a more robust version of the AdaGrad optimizer. It is based upon adaptive learning and is designed to deal with significant drawbacks of AdaGrad and RMS prop optimizer. The main problem with the above two optimizers is that the initial learning rate must be defined manually. One other problem is the decaying learning rate which becomes infinitesimally small at some point. Due to this, a certain number of iterations later, the model can no longer learn new knowledge.

To deal with these problems, AdaDelta uses two state variables to store the leaky average of the second moment gradient and a leaky average of the second moment of change of parameters in the model.

Here St and delta Xt denote the state variables, g’t denotes rescaled gradient, delta Xt-1 denotes squares rescaled gradients, and epsilon represents a small positive integer to handle division by 0.

Adam Optimizer in Deep Learning

Adam optimizer, short for Adaptive Moment Estimation optimizer, serves as an optimization algorithm commonly used in deep learning. It extends the stochastic gradient descent (SGD) algorithm and updates the weights of a neural network during training.

The name ‘Adam’ comes from ‘adaptive moment estimation,’ highlighting its ability to adaptively adjust the learning rate for each network weight individually. Unlike SGD, which maintains a single learning rate throughout training, Adam optimizer dynamically computes individual learning rates based on the past gradients and their second moments.

The creators of Adam optimizer incorporated the beneficial features of other optimization algorithms such as AdaGrad and RMSProp. Similar to RMSProp, Adam optimizer considers the second moment of the gradients, but unlike RMSProp, it calculates the uncentered variance of the gradients (without subtracting the mean).

By incorporating both the first moment (mean) and second moment (uncentered variance) of the gradients, Adam optimizer achieves an adaptive learning rate that can efficiently navigate the optimization landscape during training. This adaptivity helps in faster convergence and improved performance of the neural network.

In summary, Adam optimizer is an optimization algorithm that extends SGD by dynamically adjusting learning rates based on individual weights. It combines the features of AdaGrad and RMSProp to provide efficient and adaptive updates to the network weights during deep learning training.

Adam Optimizer Formula

The adam optimizer has several benefits, due to which it is used widely. It is adapted as a benchmark for deep learning papers and recommended as a default optimization algorithm. Moreover, the algorithm is straightforward to implement, has a faster running time, low memory requirements, and requires less tuning than any other optimization algorithm.

The above formula represents the working of adam optimizer. Here B1 and B2 represent the decay rate of the average of the gradients.

If the adam optimizer uses the good properties of all the algorithms and is the best available optimizer, then why shouldn’t you use Adam in every application? And what was the need to learn about other algorithms in depth? This is because even Adam has some downsides. It tends to focus on faster computation time, whereas algorithms like stochastic gradient descent focus on data points. That’s why algorithms like SGD generalize the data in a better manner at the cost of low computation speed. So, the optimization algorithms can be picked accordingly depending on the requirements and the type of data.

The above visualizations create a better picture in mind and help in comparing the results of various optimization algorithms.

Hands-on Optimizers

We have learned enough theory, and now we need to do some practical analysis. It’s time to try what we have learned and compare the results by choosing different optimizers on a simple neural network. As we are talking about keeping things simple, what’s better than the MNIST dataset? We will train a simple model using some basic layers, keeping the batch size and epochs the same but with different optimizers. For the sake of fairness, we will use the default values with each optimizer.

The steps for building the network are given below:

Import Necessary Libraries

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape)

Load the Dataset

x_train= x_train.reshape(x_train.shape[0],28,28,1)
x_test=  x_test.reshape(x_test.shape[0],28,28,1)
input_shape=(28,28,1)
y_train=keras.utils.to_categorical(y_train)#,num_classes=)
y_test=keras.utils.to_categorical(y_test)#, num_classes)
x_train= x_train.astype('float32')
x_test= x_test.astype('float32')
x_train /= 255
x_test /=255

Build the Model

batch_size=64

num_classes=10

epochs=10

def build_model(optimizer):

model=Sequential()

model.add(Conv2D(32,kernel_size=(3,3),activation='relu',input_shape=input_shape))

model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(256, activation='relu'))

model.add(Dropout(0.5))

model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy, optimizer= optimizer, metrics=['accuracy'])

return model

Train the Model

optimizers = ['Adadelta', 'Adagrad', 'Adam', 'RMSprop', 'SGD']

for i in optimizers:

model = build_model(i)

hist=model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test,y_test))

We have run our model with a batch size of 64 for 10 epochs. After trying the different optimizers, the results we get are pretty interesting. Before analyzing the results, what do you think will be the best optimizer for this dataset?

Table Analysis

Optimizer	Epoch 1 Val accuracy \| Val loss	Epoch 5 Val accuracy \| Val loss	Epoch 10 Val accuracy \| Val loss	Total Time
Adadelta	.4612 \| 2.2474	.7776 \| 1.6943	.8375 \| 0.9026	8:02 min
Adagrad	.8411 \| .7804	.9133 \| .3194	.9286 \| 0.2519	7:33 min
Adam	.9772 \| .0701	.9884 \| .0344	.9908 \| .0297	7:20 min
RMSprop	.9783 \| .0712	.9846 \| .0484	.9857 \| .0501	10:01 min
SGD with momentum	.9168 \| .2929	.9585 \| .1421	.9697 \| .1008	7:04 min
SGD	.9124 \| .3157	.9569 \| 1451	.9693 \| .1040	6:42 min

The above table shows the validation accuracy and loss at different epochs. It also contains the total time that the model took to run on 10 epochs for each optimizer. From the above table, we can make the following analysis.

The adam optimizer shows the best accuracy in a satisfactory amount of time.
RMSprop shows similar accuracy to that of Adam but with a comparatively much larger computation time.
Surprisingly, the SGD algorithm took the least time to train and produced good results as well. But to reach the accuracy of the Adam optimizer, SGD will require more iterations, and hence the computation time will increase.
SGD with momentum shows similar accuracy to SGD with unexpectedly larger computation time. This means the value of momentum taken needs to be optimized.
Adadelta shows poor results both with accuracy and computation time.

You can analyze the accuracy of each optimizer with each epoch from the below graph.

We’ve now reached the end of this comprehensive guide. To refresh your memory, we will go through a summary of every optimization algorithm that we have covered in this guide. To refresh your memory, we will go through a summary of every optimization algorithm that we have covered in this guide.

Conclusion

SGD is a very basic algorithm and is hardly used in applications now due to its slow computation speed. One more problem with that algorithm is the constant learning rate for every epoch. Moreover, it is not able to handle saddle points very well. Adagrad works better than stochastic gradient descent generally due to frequent updates in the learning rate. It is best when used for dealing with sparse data. RMSProp shows results similar to the gradient descent algorithm with momentum; it just differs in how the algorithm calculates the gradients.

Lastly comes the Adam optimizer that inherits the good features of RMSProp and other algorithms. The results of the Adam optimizer are generally better than every other optimization algorithm, have faster computation time, and require fewer parameters for tuning. Adam serves as the recommended default optimizer for most applications due to its effectiveness and versatility. Choosing the Adam optimizer for your application might give you the best probability of getting the best results.

But by the end, we learned that even Adam optimizer has some downsides. Also, there are cases when algorithms like SGD might be beneficial and perform better than Adam optimizer. So, it is of utmost importance to know your requirements and the type of data you are dealing with to choose the best optimization algorithm and achieve outstanding results.

Clear your understanding about the Machine Learning Algorithms

Key Takeaways

Gradient Descent, Stochastic Gradient Descent, Mini-batch Gradient Descent, Adagrad, RMS Prop, AdaDelta, and Adam are all popular deep-learning optimizers.
Each optimizer has its own strengths and weaknesses, and you will choose the right one based on the specific deep-learning task and the characteristics of the data.
The choice of optimizer can significantly impact the speed and quality of convergence during training, as well as the final performance of the deep learning model.

Frequently Asked Questions

Q1. What are some of the use cases where a deep learning model is trained?

A. Deep learning models train for image and speech recognition, natural language processing, recommendation systems, fraud detection, autonomous vehicles, predictive analytics, medical diagnosis, text generation, and video analysis.

Q2. How does artificial intelligence contribute to deep learning optimization through optimizers?

A. AI enhances deep learning optimizers by automating and improving neural network training using algorithms like gradient descent, adaptive learning rates, and momentum. AI-powered optimizers like Adam, Adagrad, and RMSProp adjust learning rates and hyperparameters for efficient optimization.

Q3. What role do optimizers play in computer vision with deep learning?

A. In computer vision, deep learning optimizers minimize loss by adjusting model parameters, ensuring optimal training results. The right optimizer enhances training speed and accuracy, crucial for high-performance computer vision applications.

Q4. What is the definition of an optimizer?

A. An optimizer in machine learning is an algorithm that adjusts model parameters to minimize or maximize a specific objective function, such as minimizing loss in neural network training, by iteratively updating parameter values based on gradients or other criteria.

ayush

Hi there. Writing and exploring are some of my hobbies. I love Machine learning because of its endless applications and scope for improvement. I enjoy problem-solving and learning about new things. I believe to learn any new skill one should have the will to learn it. Ask the right questions and the rest, Google search will take care of it. In my free time, I like listening to music and jamming on my guitar.

You can connect with me on LinkedIn, and send me any suggestions or questions. I'll be happy to reply.
Keep Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Optimizer	Epoch 1 Val accuracy \| Val loss	Epoch 5 Val accuracy \| Val loss	Epoch 10 Val accuracy \| Val loss	Total Time
Adadelta	.4612 \| 2.2474	.7776 \| 1.6943	.8375 \| 0.9026	8:02 min
Adagrad	.8411 \| .7804	.9133 \| .3194	.9286 \| 0.2519	7:33 min
Adam	.9772 \| .0701	.9884 \| .0344	.9908 \| .0297	7:20 min
RMSprop	.9783 \| .0712	.9846 \| .0484	.9857 \| .0501	10:01 min
SGD with momentum	.9168 \| .2929	.9585 \| .1421	.9697 \| .1008	7:04 min
SGD	.9124 \| .3157	.9569 \| 1451	.9693 \| .1040	6:42 min

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Optimizers in Deep Learning: A Detailed Guide

Table of contents

What is Optimizer?

What are Optimizers in Deep Learning?

Choosing the Right Optimizer

Important Deep Learning Terms

Gradient Descent Deep Learning Optimizer

Stochastic Gradient Descent Deep Learning Optimizer

Stochastic Gradient Descent With Momentum Deep Learning Optimizer

Mini Batch Gradient Descent Deep Learning Optimizer

Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer

RMS Prop (Root Mean Square) Deep Learning Optimizer

RMS Prop Formula

AdaDelta Deep Learning Optimizer

Adam Optimizer in Deep Learning

Adam Optimizer Formula

Hands-on Optimizers

Import Necessary Libraries

Load the Dataset

Build the Model

Train the Model

Table Analysis

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID