In Neural Networks, we have the concept of Loss Functions, which tell us about the performance of our neural networks, i.e., at the current instant, how good or poor the model is performing. Now, to train our network to perform better on unseen datasets, we need to use loss. We aim to minimize the loss, as a lower loss implies that our model will perform better. So, Optimization means minimizing (or maximizing) any mathematical expression. In this article, we’ll explore and deep dive into the world of gradient-based optimizers for deep learning models. We will also discuss the foundational mathematics behind these optimizers and their advantages and disadvantages.
Overview:
This article was published as a part of the Data Science Blogathon
As discussed in the introduction, Optimizers update the parameters of neural networks, such as weights and learning rate, to minimize the loss function. Here, the loss function guides the terrain, telling the optimizer if it is moving in the right direction to reach the bottom of the valley, the global minimum.
Let us imagine a climber hiking down the hill with no direction. He doesn’t know the right way to reach the valley in the hills, but he can understand whether he is moving closer (going downhill) or further away (uphill) from his final destination. If he keeps taking steps in the correct direction, he will reach his aim i.,e the valley
This is the intuition behind optimizers- to reach a global minimum concerning the loss function.
Different instances of Gradient descent Optimizers are as follows:
Gradient descent is an optimization algorithm used when training deep learning models. It’s based on a convex function and updates its parameters iteratively to minimize a given function to its local minimum.
The notation used in the above Formula is given below,
In the above formula,
As you can see, the gradient represents the partial derivative of J(cost function) with respect to ϴj
Note that as we reach closer to the global minima, the slope or the gradient of the curve becomes less and less steep, which results in a smaller value of the derivative, which in turn reduces the step size or learning rate automatically.
It is the most basic but most used optimizer that directly uses the derivative of the loss function and learning rate to reduce the loss function and tries to reach the global minimum.
Thus, the Gradient Descent Optimization algorithm has many applications including:
The above-described equation calculates the gradient of the cost function J(θ) with respect to the network parameters θ for the entire training dataset:
Our aim is to reach the bottom of the graph(Cost vs. weight) or to a point where we can no longer move downhill–a local minimum.
In general, a Gradient represents the slope of the equation, while gradients are partial derivatives. They describe the change reflected in the loss function with respect to the small change in the function’s parameters. This slight change in loss functions can tell us about the next step to reduce the loss function’s output.
The learning rate represents our optimisation algorithm’s steps to reach the global minima. To ensure that the gradient descent algorithm reaches the local minimum, we must set the learning rate to an appropriate value that is neither too low nor too high.
Taking very large steps, i.e., a large learning rate value, may skip the global minima, and the model will never reach the optimal value for the loss function. On the contrary, taking very small steps, i.e., a small learning rate value, will take forever to converge.
Thus, the step size also depends on the gradient value.
As we discussed, the gradient represents the direction of increase. However, we aim to find the minimum point in the valley, so we have to go in the opposite direction of the gradient. Therefore, we update parameters in the negative gradient direction to minimize the loss.
Algorithm: θ=θ−α⋅∇J(θ)
In code, Batch Gradient Descent looks something like this:
for x in range(epochs):
params_gradient = find_gradient(loss_function, data, parameters)
parameters = parameters - learning_rate * params_gradient
To overcome some of the disadvantages of the GD algorithm, the SGD algorithm comes into the picture as an extension of the Gradient Descent. One of the disadvantages of the Gradient Descent algorithm is that it requires a lot of memory to load the entire dataset at a time to compute the derivative of the loss function. So, In the SGD algorithm, we compute the derivative by taking one data point at a time, i.e., try to update the model’s parameters more frequently. Therefore, the model parameters are updated after the loss computation on each training example.
So, let’s have a dataset that contains 1000 rows, and when we apply SGD, it will update the model parameters 1000 times in one complete cycle of a dataset instead of one time as in Gradient Descent.
Algorithm: θ=θ−α⋅∇J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples
We want the training, even more, faster, so we take a Gradient Descent step for each training example. Let’s see the implications in the image below:
Let’s try to find some insights from the above diagram:
It is observed that in SGD, the updates take more iterations to reach minima than in GD. On the contrary, GD takes fewer steps to reach minima. Still, the SGD algorithm is noisier and takes more iterations as the model parameters are frequently updated with high variance and fluctuations in loss functions at different intensities values.
Its code snippet simply adds a loop over the training examples and finds the gradient for each one.
for x in range(epochs):
np.random.shuffle(data)
for example in data:
params_gradient = find_gradient(loss_function, example, parameters)
parameters = parameters - learning_rate * params_gradient
To overcome the problem of large time complexity in the case of the SGD algorithm. MB-GD algorithm comes into the picture as an extension of the SGD algorithm. It’s not all, but it also overcomes the Gradient descent problem. Therefore, It’s considered the best among all the variations of gradient descent algorithms. MB-GD algorithm takes a batch of points or a subset of points from the dataset to compute derivate.
It is observed that the derivative of the loss function for MB-GD is almost the same as a derivate of the loss function for GD after several iterations. However, the number of iterations to achieve minima is large for MB-GD compared to GD, and the computation cost is also large.
Therefore, the weight updation depends on the derivate of loss for a batch of points. The updates in the case of MB-GD are much more noisy because the derivative does not always go towards minima.
It updates the model parameters after every batch. This algorithm divides the dataset into batches, updating the parameters after every batch.
Algorithm: θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples
In the code snippet, instead of iterating over examples, we now iterate over mini-batches of size 30:
for x in range(epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=30):
params_gradient = find_gradient(loss_function, batch, parameters)
parameters = parameters - learning_rate * params_gradient
In conclusion, gradient-based optimization techniques, such as Batch, Stochastic, and Mini-Batch Gradient Descent (the GD optimizer), play a crucial role in enhancing neural network performance by fine-tuning model parameters to minimize loss functions. Each optimizer offers unique advantages and faces specific challenges that impact convergence speed, memory efficiency, and stability. By understanding these gradient-based optimization methods, data scientists can tailor training strategies according to dataset characteristics, significantly improving the model’s potential for achieving optimal performance. By mastering these optimization techniques, data scientists can harness the power of gradient-based optimization in deep learning to its fullest capacity, facilitating advancements across various applications.
A. A gradient-based optimizer is an algorithm that adjusts the parameters of a machine learning model by utilizing gradients (derivatives) of the loss function. It aims to minimize the loss by iteratively updating parameters in the direction of the steepest descent, thus improving model performance.
A. A gradient descent optimizer is a specific type of gradient-based optimizer that updates model parameters by calculating the gradient of the loss function with respect to the parameters. It adjusts parameters iteratively to converge towards a local or global minimum of the loss function, enhancing model accuracy.
A. Gradient-based learning refers to a learning paradigm where algorithms optimize a model by minimizing a loss function using gradients. By computing these gradients, the model adjusts its parameters iteratively, leading to improved performance. This concept is foundational in training deep learning and neural network models.
A. The gradient search method is an optimization technique that finds the minimum or maximum of a function by following the direction of the gradient. It iteratively evaluates the function’s slope and updates parameters accordingly, gradually approaching the optimal solution. This method is widely used in machine learning and mathematical optimization.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Very Nice article Chirag. Easy to understand. Thanks