The Challenge of Vanishing/Exploding Gradients in Deep Neural Networks

Yash Last Updated : 06 Dec, 2024

9 min read

Training deep neural networks often encounters challenges like the vanishing gradient and exploding gradient problems. These issues can hinder learning by either shrinking gradients to near-zero or causing them to grow uncontrollably. This article explains the problem of exploding and vanishing gradients while training a deep neural network and the techniques that can be used to overcome this impediment. In this article, you will get a clarification about vanishing and exploding gradients and their problem and solutions. With that, you will learn about exploding and vanishing agents, what they are, and how their problem is solved in this article. So, let’s begin.

Overview:

Explore the concepts of vanishing gradient là gì and the exploding gradient issue, clarifying these fundamental challenges in training deep neural networks.
Discuss how the gradients shrink excessively (vanishing gradient) or grow uncontrollably (exploding gradient), hindering the training process and model stability.
Analyze why these issues occur, focusing on activation functions and improper weight initialization as key factors contributing to vanishing and exploding gradients.
Highlight techniques like proper weight initialization, non-saturating activation functions, batch normalization, and gradient clipping to address these challenges and ensure stable model training.
Provides examples and coding practices to integrate solutions into machine learning models, making it easier to mitigate vanishing gradient and exploding gradient problems.

This article was published as a part of the Da ta Science Blogathon

A Glimpse of the Backpropagation Algorithm
Understanding the Problems
- What is the Vanishing Gradient?
- What is an Exploding Gradient?
Why Do the Gradients Even Vanish/Explode?
How to Know if Our Model is Suffering From the Exploding/Vanishing Gradient Problem?
Solutions
Conclusion
Frequently Asked Questions

A Glimpse of the Backpropagation Algorithm

We know that the backpropagation algorithm is the heart of neural network training. Let’s have a glimpse of this algorithm, which has proved to be a harbinger of the evolution and revolution of Deep Learning.

exploding and vanishing gradients back propagation — Source: Google Images

After propagating the input features forward to the output layer through the various hidden layers consisting of different/same activation functions, we arrive at a predicted probability of a sample belonging to the positive class ( generally, for classification tasks).

Now, the backpropagation algorithm propagates backwards from the output layer to the input layer, calculating the error gradients on the way.
After computing the cost function’s gradients with respect to each parameter (weights and biases) in the neural network, the algorithm uses these gradients to take a gradient descent step towards the minimum, updating the value of each parameter in the network.

exploding and vanishing gradients gif — GIF Source: gyfcat.com

Understanding the Problems

What is the Vanishing Gradient?

As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero, eventually leaving the weights of the initial or lower layers nearly unchanged. As a result, the gradient descent never converges to the optimum. This is known as the vanishing gradients problem.

What is an Exploding Gradient?

On the contrary, the gradients keep getting larger in some cases as the backpropagation algorithm progresses. This, in turn, causes large weight updates and causes the gradient descent to diverge. This is known as the exploding gradient problem.

Why Do the Gradients Even Vanish/Explode?

Certain activation functions, like the logistic function (sigmoid), have a huge difference between the variance of their inputs and the outputs. In simpler words, they shrink and transform a larger input space into a smaller output space between the range of [0,1].

exploding and vanishing gradients why — Image source: Google Images

Observing the above graph of the Sigmoid function, we can see that for larger inputs (negative or positive), it saturates at 0 or 1 with a derivative very close to zero. Thus, when the backpropagation algorithm chips in, it virtually has no gradients to propagate backwards in the network, and whatever little residual gradients exist keep on diluting as the algorithm progresses down through the top layers. So, this leaves nothing for the lower layers.

Similarly, suppose the initial weights assigned to the network generate some large loss in some cases. Now, the gradients can accumulate during an update and result in very large gradients, which eventually results in large updates to the network weights and leads to an unstable network. The parameters can sometimes become so large that they overflow, resulting in NaN values.

exploding and vanishing gradients google images — Image Source: Google Images

How to Know if Our Model is Suffering From the Exploding/Vanishing Gradient Problem?

Following are some signs that can indicate that our gradients are vanishing and exploding gradients :

Exploding	Vanishing
There is an exponential growth in the model parameters.	The parameters of the higher layers change significantly whereas the parameters of lower layers would not change much (or not at all).
The model weights may become NaN during training.	The model weights may become 0 during training.
The model experiences avalanche learning.	The model learns very slowly and perhaps the training stagnates at a very early stage just after a few iterations.

Certainly, neither do we want our signal to explode or saturate nor do we want it to die out. The signal needs to flow properly both in the forward direction when making predictions as well as in the backward direction while calculating gradients.

Solutions

Now that we understand the vanishing/exploding gradients problems, we can learn some techniques to fix them.

Proper Weight Initialization

In their paper, researchers Xavier Glorot, Antoine Bordes, and Yoshua Bengio proposed a way to alleviate this problem remarkably.

For the proper flow of the signal, the authors argue that:

The variance of outputs of each layer should be equal to the variance of its inputs.
The gradients should have equal variance before and after flowing through a layer in the reverse direction.

Although both conditions cannot hold for any layer in the network unless the number of inputs to the layer (fanin) equals the number of neurons in the layer (fanout), they proposed a well-proven compromise that works incredibly well in practice. They randomly initialize the connection weights for each layer in the network using the following equation, popularly known as Xavier initialization (after the author’s first name) or Glorot initialization (after his last name).

where  fan_avg= ( fan_in+ fan_out) / 2

Normal distribution with mean 0 and variance σ² = 1/ fan_avg
Or a uniform distribution between -r and +r , with r = sqrt( 3 / fan_avg )

Following are some more very popular weight initialization strategies for different activation functions, they only differ by the scale of variance and by the usage of either fan_avg or fan_in

for uniform distribution, calculate r as: r = sqrt( 3*σ²)

exploding and vanishing gradients 3 — Source : A book from Aurelien Geron

Using the above initialization strategies can significantly speed up the training and increase the odds of gradient descent converging at a lower generalization error.

Wait, but how do we put these strategies into code ??

Relax! We will not need to hardcode anything; Keras does it for us.

Keras uses Xavier’s initialization strategy with uniform distribution.
If we wish to use a strategy different from the default one, this can be done using the kernel_initializer parameter while creating the layer. For example :

keras.layer.Dense(25, activation = "relu", kernel_initializer="he_normal")

keras.layer.Dense(25, activation = "relu", kernel_initializer="he_uniform")

If we wish to use the initialization based on fan_avgrather than fanin , we can use the VarianceScaling initializer like this :

he_avg_init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
keras.layers.Dense(20, activation="sigmoid", kernel_initializer=he_avg_init)

Using Non-saturating Activation Functions

In an earlier section, while studying the nature of the sigmoid activation function, we observed that its saturating nature for larger inputs (negative or positive) was a major reason behind the vanishing and exploding gradients, thus making it non-recommendable to use in the hidden layers of the network.

So to tackle the issue regarding the saturation of activation functions like sigmoid and tanh, we must use some other non-saturating functions like ReLu and its alternatives.

ReLU ( Rectified Linear Unit )

Relu(z) = max(0,z)

Outputs 0 for any negative input.
Range: [0, infinity]

Unfortunately, the ReLu function is also not a perfect pick for the intermediate network layers “in some cases”. It suffers from a problem known as dying ReLus wherein some neurons just die out, meaning they keep on throwing 0 as outputs with the advancement in training.

Read about the dying relus problem in detail here.

Some popular alternative functions of the ReLU that mitigate the problem of vanishing gradients when used as activation for the intermediate layers of the network are LReLU, PReLU, ELU, Sand ELU :

LReLU (Leaky ReLU)

LeakyReLUα(z) = max(αz, z)

The amount of “leak” is controlled by the hyperparameter α, it is the slope of the function for z < 0.
The smaller slope for the leak ensures that the neurons powered by leaky Relu never die; although they might venture into a state of coma for a long training phase they always have a chance to eventually wake up.
The model can also train α, learning its value during training. This variant, where α is considered a parameter rather than a hyperparameter, is called parametric leaky ReLU (PReLU).

ELU (Exponential Linear Unit)

For z < 0, it takes on negative values, which allow the unit to have an average output closer to 0 thus alleviating the vanishing gradient problem.

For z < 0, the gradients are non-zero. This avoids the dead neurons problem.
For α = 1, the function is smooth everywhere. This speeds up the gradient descent since it does not bounce right and left around z=0.
A scaled version of this function ( SELU: Scaled ELU ) is often used in Deep Learning.

Batch Normalization

Using the initialization along with any variant of the ReLU activation function can significantly reduce the chances of vanishing/exploding problems at the beginning. However, it does not guarantee that the problem won’t reappear during training.

In 2015, Sergey Ioffe and Christian Szegedy proposed a paper in which they introduced batch normalization, a technique for addressing the problem of vanishing/exploding gradients.

The following key points explain the intuition behind BN and how it works:

It consists of adding an operation in the model just before or after the activation function of each hidden layer.
This operation simply zero-centres and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting.
In other words, the operation lets the model learn the optimal scale and mean of each layer’s input.
The algorithm needs to estimate each input’s mean and standard deviation to zero-centre and normalise the inputs.
It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “Batch Normalization”).

model = keras.models.Sequential([keras.layers.Flatten(input_shape=[28, 28]),keras.layers.BatchNormalization(),keras.layers.Dense(300, activation="relu"),keras.layers.BatchNormalization(),keras.layers.Dense(100, activation="relu"),keras.layers.BatchNormalization(),keras.layers.Dense(10, activation="softmax")])

We just added batch normalization after each layer (dataset: FMNIST)model.summary()

Gradient Clipping

Another popular technique to mitigate the exploding gradient problem is to clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping.

This optimizer will clip every component of the gradient vector to a value between –1.0 and 1.0.
This means we will clip all the partial derivatives of the loss for each trainable parameter between –1.0 and 1.0.

optimizer = keras.optimizers.SGD(clipvalue = 1.0)

The threshold is a hyperparameter we can tune.
The orientation of the gradient vector may change due to this: for example, let the original gradient vector be [0.9, 100.0], pointing mostly in the direction of the second axis. Once we clip it by some value, we get [0.9, 1.0], which now points somewhere around the diagonal between the two axes.
We should clip by norm rather than by value to ensure that the orientation remains intact even after clipping.

optimizer = keras.optimizers.SGD(clipnorm = 1.0)

If the threshold we pick is less than the ℓ2 norm, we will clip the whole gradient. For example, if clipnorm=1, we will clip the vector [0.9, 100.0] to [0.00899, 0.999995], thus preserving its orientation.

Conclusion

Understanding and addressing vanishing gradient and exploding gradient problems is crucial for training stable and effective deep neural networks. These issues arise from the inherent challenges of propagating gradients in deep architectures. Still, they can be mitigated with the right techniques, such as proper weight initialization, activation functions like ReLU, gradient clipping, and batch normalization. By implementing these solutions, practitioners can ensure smoother convergence and improved performance, paving the way for building robust and scalable deep learning models. Solving these challenges is not just about technical fixes—it’s about unlocking the full potential of deep learning across diverse applications.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Frequently Asked Questions

Q1. What are exploding gradients and vanishing gradients?

A. Exploding gradients occur when model gradients grow uncontrollably during training, causing instability. Vanishing gradients happen when gradients shrink excessively, hindering effective learning and updates.

Q2. What is meant by the vanishing gradient problem?

A. The vanishing gradient problem refers to the phenomenon where gradients diminish as they are propagated backwards through layers, making it difficult for models to learn and update weights, especially in deep networks.

Q3. What is the vanishing exploding gradient problem in RNN?

A. In RNNs, the vanishing gradient problem impedes learning long-term dependencies due to diminishing gradients. In contrast, the exploding gradient problem causes instability and divergent updates due to considerable gradients.

Q4. What is an exploding gradient plot?

A. An exploding gradient plot visualizes the sudden increase in gradient values during training, indicating instability and potential divergence. This issue often requires gradient clipping or other techniques to mitigate.

Yash

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

The Challenge of Vanishing/Exploding Gradients in Deep Neural Networks

Table of contents

A Glimpse of the Backpropagation Algorithm

Understanding the Problems

What is the Vanishing Gradient?

What is an Exploding Gradient?

Why Do the Gradients Even Vanish/Explode?

How to Know if Our Model is Suffering From the Exploding/Vanishing Gradient Problem?

Solutions

Proper Weight Initialization

Wait, but how do we put these strategies into code ??

Using Non-saturating Activation Functions

ReLU ( Rectified Linear Unit )

LReLU (Leaky ReLU)

ELU (Exponential Linear Unit)

Batch Normalization

Gradient Clipping

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit