One of the most common problems data science professionals face is avoiding over-fitting. Have you encountered a situation when your model performs very well on the training data but cannot accurately predict the test data? The reason is your model is overfitting. The solution to such a problem is regularization. In this article, you will learn what batch normalization is.
Note: If you are more interested in learning concepts in an audio-visual format, we have explained this entire article in the video below. If not, you may continue reading.
Regularization techniques help improve a model and allow it to converge faster. We have several regularization tools at our end. Some are early stopping, dropout, weight initialization techniques, and batch normalization in CNN. Regularization helps prevent the model from overfitting, and the learning process becomes more efficient.
In this article, you will learn about batch normalization, also called batch normalisation, and its significance in deep learning. We will explore how batch normalisation in deep learning enhances model performance, stabilizes training, and accelerates convergence.
Before entering into Batch normalization, let’s understand the term “Normalization”.
Normalization is a data pre-processing tool that brings numerical data to a common scale without distorting its shape.
Generally, when we input the data to a machine learning algorithm or deep learning algorithm, we tend to change the values to a balanced scale. We normalize partly to ensure that our model can generalize appropriately.
Now, back to batch normalization is a process that makes neural networks faster and more stable by adding extra layers to a deep neural network. The new layer performs the standardizing and normalizing operations on the input of a layer coming from a previous layer.
But what is the reason behind the term “Batch” in batch normalization? A typical neural network is trained using a collected input data set called batch. Similarly, the normalizing process takes place in batches, not as a single input.
Let’s understand this through an example. We have a deep neural network, as shown in the following image.
Initially, our inputs X1, X2, X3, and X4 are normalized as they come from the pre-processing stage. When the input passes through the first layer, it transforms as a sigmoid function applied over the dot product of input X and the weight matrix W. Similarly, this transformation will take place for the second layer and continue until the last layer L, as shown in the following image.
Although our input X was normalized with time, the output will no longer be on the same scale. As the data pass through multiple layers of the neural network and L activation functions are applied, it leads to an internal co-variate shift in the data.
Since we now have a clear idea of why we need Batch Normalization in CNN, let’s understand how it works. It is a two-step process. First, the input is normalized, and later, rescaling and offsetting are performed.
Normalization is the process of transforming the data to have a mean zero and standard deviation one. In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden activation.
Here, m is the number of neurons at layer h.
Once we have achieved our goal, the next step is to calculate the standard deviation of the hidden activations.
Further, as we have the mean and the standard deviation ready. We will normalize the hidden activations using these values. For this, we will subtract the mean from each input and divide the whole value with the sum of standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division by a zero value.
In the final operation, the input is re-scaled and offset. Here, two components of the BN algorithm come into the picture: γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from the previous operations.
These two are learnable parameters. During the training, the neural network ensures the optimal values of γ and β are used. This will enable the accurate normalization of each batch.
Batch normalization is a deep learning technique that helps our models learn and adapt quickly. It’s like a teacher who helps students by breaking down complex topics into simpler parts.
Imagine you’re trying to hit a moving target with a dart. It would be much harder than hitting a stationary one, right? Similarly, in deep learning, our target keeps changing during training due to the continuous updates in weights and biases. This is known as the “internal covariate shift”. Batch normalization helps us stabilize this moving target, making our task easier.
Batch normalization works by normalizing the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. However, these normalized values may not follow the original distribution. To tackle this, batch normalization introduces two learnable parameters, gamma and beta, which can shift and scale the normalized values.
Now, let’s look into the advantages the BN process offers.
By Normalizing the hidden layer activation, the Batch Normalization in CNN speeds up the training process.
It solves the problem of internal covariate shift. Through this, we ensure that the input for every layer is distributed around the same mean and standard deviation. If you are unaware of what is an internal covariate shift, look at the following example.
Suppose we train an image classification model that classifies images into Dog or Not Dog. Let’s say we have images of white dogs only. These images will also have a certain distribution. Using these images will update the model’s parameters.
Later, if we get a new set of images consisting of non-white dogs, these new images will have a slightly different distribution from the previous images. Now, the model will change its parameters according to these new images. Hence, the distribution of the hidden activation will also change. This change in hidden activation is known as an internal covariate shift.
However, according to a study by MIT researchers, batch normalization does not solve the problem of internal covariate shift.
In this research, they trained three models
Model-1: standard VGG network without batch normalization.
Model-2: Standard VGG network with batch normalization.
Model-3: Standard VGG with batch normalization and random noise.
This random noise has a non-zero mean and non-unit variance and is added after the batch normalization layer. This experiment reached two conclusions.
The second conclusion was the training accuracy of the second and third models is higher than the first model. So it can be concluded that internal co-variate shift might not be a contributing factor in the performance of the batch normalization.
Batch normalization smoothens the loss function, which, in turn, improves the model’s training speed by optimizing the model parameters.
Batch normalization is a topic of huge research interest, and a large number of researchers are working on it. If you are looking for further details on this, I recommend you go through the following links.
TensorFlow can be easily implemented using the tf.keras.layers.BatchNormalization
layer. Here’s a simple example of how to use it in a model:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Conv2D, MaxPooling2D
model = Sequential([
Conv2D(32, (3, 3), input_shape=(28, 28, 1), activation='relu'),
BatchNormalization(),
Conv2D(64, (3, 3), activation='relu'),
BatchNormalization(),
MaxPooling2D(),
Dense(10, activation='softmax')
])
In this example, Batch Normalization is applied after convolutional layers, which is a common practice to help stabilize the training of the model
To summarize, this article explained Batch Normalization and how it improves neural network performance. However, we need not perform all this manually, as deep learning libraries like PyTorch and TensorFlow handle the implementation complexities. Still, as a data scientist, it is worth understanding the intricacies of the back end.
A. Use batch normalization when training deep neural networks to stabilize and accelerate learning, improve model performance, and reduce sensitivity to network initialization and learning rates.
A. Normalization scales input data to a standard range, like 0 to 1. BatchNormalization normalizes intermediate layers’ activations during training, adjusting mean and variance to improve convergence.
A. In Keras, batch normalization standardizes each layer’s inputs to have a mean of zero and variance of one, thus stabilizing and accelerating the training process.
A. Batch normalization acts as a regularization method by reducing overfitting. It introduces noise through mini-batch statistics, which provides a slight regularizing effect similar to dropout.