Gradient Descent Algorithm: How Does it Work in Machine Learning?

Crypto1 Last Updated : 04 Apr, 2025

11 min read

Imagine you’re lost in a dense forest with no map or compass. What do you do? You follow the path of the steepest descent, taking steps in the direction that decreases the slope and brings you closer to your destination. Similarly, gradient descent is the go-to algorithm for navigating the complex landscape of machine learning and deep learning. It helps models find the optimal set of parameters by iteratively adjusting them in the opposite direction of the gradient. This article will deeply dive into gradient descent, exploring its different flavors, applications, and challenges. Get ready to sharpen your optimization skills and join the ranks of the machine learning elite!

In this article, you will learn about gradient descent in machine learning, understand how gradient descent works, and explore the gradient descent algorithm’s applications.

Learning objectives:

Gradient Descent Basics: A simple rundown on how gradient descent helps optimize machine learning models by minimizing the cost function.
Types and Implementation: A quick look at the different types of gradient descent (batch, stochastic, and mini-batch) and how you can implement them in Python.
Challenges and Applications: Insight into common challenges like local optima and overfitting, and how gradient descent is used in models like linear regression and neural networks.

This article was published as a part of the Data Science Blogathon.

What is a Cost Function?
What is Gradient Descent?
Example of Gradient Descent Algorithm
Gradient Descent Python Implementation
How Does Gradient Descent Work?
Types of Gradient Descent Algorithm
Plotting the Gradient Descent Algorithm
- Alpha – The Learning Rate
- Local Minima
Code Implementation of Gradient Descent in Python
Advantages and Disadvantages
- Advantages
- Disadvantages
Challenges of Gradient Descent Algorithm
Frequently Asked Questions

What is a Cost Function?

It is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

After making a hypothesis with initial parameters, we calculate the Cost function. And with a goal to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the given data. Here’s the mathematical representation for it:

Cost Function | Gradient Descent Algorithm — *Source: Coursera*

What is Gradient Descent?

Gradient descent is an optimization algorithm used in machine learning to minimize the cost function by iteratively adjusting parameters in the direction of the negative gradient, aiming to find the optimal set of parameters.

The cost function represents the discrepancy between the predicted output of the model and the actual output. Gradient descent aims to find the parameters that minimize this discrepancy and improve the model’s performance.

The algorithm operates by calculating the gradient of the cost function, which indicates the direction and magnitude of the steepest ascent. However, since the objective is to minimize the cost function, gradient descent moves in the opposite direction of the gradient, known as the negative gradient direction.

By iteratively updating the model’s parameters in the negative gradient direction, gradient descent gradually converges towards the optimal set of parameters that yields the lowest cost. The learning rate, a hyperparameter, determines the step size taken in each iteration, influencing the speed and stability of convergence.

Gradient descent can be applied to various machine learning algorithms, including linear regression, logistic regression, neural networks, and support vector machines. It provides a general framework for optimizing models by iteratively refining their parameters based on the cost function.

Example of Gradient Descent Algorithm

Let’s say you are playing a game in which the players are at the top of a mountain and asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what approach do you think would make you reach the lake?

Take a moment to think about this before you read on.

The best way is to observe the ground and find where the land descends. From that position, step in the descending direction and iterate this process until we reach the lowest point.

Finding the lowest point in a hilly landscape.

Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.

To find the local minimum of a function using gradient descent, we must take steps proportional to the negative of the gradient (move away from the gradient) of the function at the current point. If we take steps proportional to the positive of the gradient (moving towards the gradient), we will approach a local maximum of the function, and the procedure is called Gradient Ascent.

Gradient descent was originally proposed by CAUCHY in 1847. It is also known as the steepest descent.

The goal of the gradient descent algorithm is to minimize the given function (say, cost function). To achieve this goal, it performs two steps iteratively:

Compute the gradient (slope), the first-order derivative of the function at that point
Make a step (move) in the direction opposite to the gradient. The opposite direction of the slope increases from the current point by alpha times the gradient at that point

Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the length of the steps.

Gradient Descent Python Implementation

Here how you can implement gradient descent Python:

import numpy as np

def gradient_descent(X, y, learning_rate, num_iters):
  """
  Performs gradient descent to find optimal weights and bias for linear regression.

  Args:
      X: A numpy array of shape (m, n) representing the training data features.
      y: A numpy array of shape (m,) representing the training data target values.
      learning_rate: The learning rate to control the step size during updates.
      num_iters: The number of iterations to perform gradient descent.

  Returns:
      A tuple containing the learned weights and bias.
  """

  # Initialize weights and bias with random values
  m, n = X.shape
  weights = np.random.rand(n)
  bias = 0

  # Loop for the number of iterations
  for i in range(num_iters):
    # Predict y values using current weights and bias
    y_predicted = np.dot(X, weights) + bias

    # Calculate the error
    error = y - y_predicted

    # Calculate gradients for weights and bias
    weights_gradient = -2/m * np.dot(X.T, error)
    bias_gradient = -2/m * np.sum(error)

    # Update weights and bias using learning rate
    weights -= learning_rate * weights_gradient
    bias -= learning_rate * bias_gradient

  return weights, bias

# Example usage
X = np.array([[1, 1], [2, 2], [3, 3]])
y = np.array([2, 4, 5])
learning_rate = 0.01
num_iters = 100

weights, bias = gradient_descent(X, y, learning_rate, num_iters)

print("Learned weights:", weights)
print("Learned bias:", bias)

This code creates a function called gradient_descent, which requires the training data, learning rate, and number of iterations as parameters. It carries out the Number of Steps :
1.Sets weights and bias to arbitrary values during initialization.
2.Executes a set number of iterations for loops.
3.Computes the estimated y values by utilizing the existing weights and bias.
4.Calculates the discrepancy between expected and real y values.
5.Determines the changes in the cost function based on weights and bias.
6.Adjusts the weights and bias by incorporating the gradients and learning rate.
7.Outputs the acquired weights and bias.

How Does Gradient Descent Work?

The algorithm optimizes to minimize the model’s cost function.
The cost function measures how well the model fits the training data and defines the difference between the predicted and actual values.
The cost function’s gradient is the derivative with respect to the model’s parameters and points in the direction of the steepest ascent.
The algorithm starts with an initial set of parameters and updates them in small steps to minimize the cost function.
In each iteration of the algorithm, it computes the gradient of the cost function with respect to each parameter.
The gradient tells us the direction of the steepest ascent, and by moving in the opposite direction, we can find the direction of the steepest descent.
The learning rate controls the step size, which determines how quickly the algorithm moves towards the minimum.
The process is repeated until the cost function converges to a minimum. Therefore indicating that the model has reached the optimal set of parameters.
Different variations of gradient descent include batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with advantages and limitations.
Efficient implementation of gradient descent is essential for performing well in machine learning tasks. The choice of the learning rate and the number of iterations can significantly impact the algorithm’s performance.

Types of Gradient Descent Algorithm

The choice of gradient descent algorithm depends on the problem at hand and the size of the dataset. Batch gradient descent is suitable for small datasets, while stochastic gradient descent algorithm is more suitable for large datasets. Mini-batch is a good compromise between the two and is often used in practice.

Batch Gradient Descent

Batch gradient descent updates the model’s parameters using the gradient of the entire training set. It calculates the average gradient of the cost function for all the training examples and updates the parameters in the opposite direction. Batch gradient descent guarantees convergence to the global minimum but can be computationally expensive and slow for large datasets.

Stochastic Gradient Descent

Stochastic gradient descent updates the model’s parameters using the gradient of one training example at a time. It randomly selects a training dataset example, computes the gradient of the cost function for that example, and updates the parameters in the opposite direction. Stochastic gradient descent is computationally efficient and can converge faster than batch gradient descent. However, it can be noisy and may not converge to the global minimum.

Mini-Batch Gradient Descent

Mini-batch gradient descent updates the model’s parameters using the gradient of a small batch size of the training dataset, known as a mini-batch. It calculates the average gradient of the cost function for the mini-batch and updates the parameters in the opposite direction. The mini-batch gradient descent algorithm combines the advantages of batch and stochastic gradient descent. It is the most commonly used method in practice. It is computationally efficient and less noisy than stochastic gradient descent while still being able to converge to a good solution.

Gradient Descent and its Types

Plotting the Gradient Descent Algorithm

When we have a single parameter (theta), we can plot the dependent variable cost on the y-axis and theta on the x-axis. If there are two parameters, we can go with a 3-D plot, with cost on one axis and the two parameters (thetas) along the other two axes.

It can also be visualized by using Contours. This shows a 3-D plot in two dimensions with parameters along axes and the response as a contour. The value of the response increases away from the center and has the same value as with the rings. The response is directly proportional to the distance of a point from the center (along a direction).

Alpha – The Learning Rate

We have the direction we want to move in. Now, we must decide the size of the step we must take.

*It must be chosen carefully to end up with local minima.

If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing without reaching the minima
If the learning rate is too small, the training might turn out to be too long

The learning rate is optimal, and the model converges to the minimum.
The learning rate is too small. It takes more time but converges to the minimum.
The learning rate is higher than the optimal value. It overshoots but converges ( 1/C < η <2/C).
The learning rate is very large. It overshoots and diverges, moves away from the minima, and performance decreases in learning.

Note: As the gradient decreases while moving towards the local minima, the size of the step decreases. So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.

Local Minima

The cost function may consist of many minimum points. Depending on the initial point (i.e., initial parameters(theta)) and the learning rate, the gradient may settle on any minima. Therefore, the optimization may converge to different starting points and learning rates.

Code Implementation of Gradient Descent in Python

Advantages and Disadvantages

Advantages

Easy to use: It’s like rolling the marble yourself – no fancy tools needed, you just gotta push it in the right direction.

Fast updates: Each push (iteration) is quick, you don’t have to spend a lot of time figuring out how hard to push.

Memory efficient: You don’t need a big backpack to carry around extra information, just the marble and your knowledge of the hill.

Usually finds a good spot: Most of the time, the marble will end up in a pretty flat area, even if it’s not the absolute flattest (global minimum).

Disadvantages

Slow for giant hills (large datasets): If the hill is enormous, pushing the marble all the way down each time can be super slow. There are better ways to roll for these giants.

Can get stuck in shallow dips (local minima): The hill might have many dips, and the marble could get stuck in one that isn’t the absolute lowest. It depends on where you start pushing it from.

Finding the perfect push (learning rate): You need to figure out how har to push the marble (learning rate). If you push too weakly, it’ll take forever to get anywhere. Push too hard, and it might roll right past the flat spot.

Challenges of Gradient Descent Algorithm

While gradient descent is a powerful optimization algorithm, it can also present some challenges affecting its performance. Some of these challenges include:

Local Optima: Gradient descent can converge to local optima instead of the global optimum, especially if the cost function has multiple peaks and valleys.
Learning Rate Selection: The choice of learning rate can significantly impact the performance of gradient descent. If the learning rate is too high, the algorithm may overshoot the minimum, and if it is too low, the algorithm may take too long to converge.
Overfitting: Gradient descent can overfit the training data if the model is too complex or the learning rate is too high. This can lead to poor generalization performance on new data.
Convergence Rate: The convergence rate of gradient descent can be slow for large datasets or high-dimensional spaces, making the algorithm computationally expensive.
Saddle Points: In high-dimensional spaces, saddle points can cause the gradient of the cost function to get stuck in a plateau, preventing gradient descent from converging to a minimum.

Researchers have developed several variations of gradient descent algorithms to overcome these challenges, such as adaptive learning rate, momentum-based, and second-order methods. Additionally, choosing the right regularization method, model architecture, and hyperparameters can also help improve the performance of the gradient descent algorithm.

Conclusion

The gradient descent algorithm is a cornerstone of machine learning optimization techniques. Much like finding your way out of a dense forest by following the path of the steepest descent, gradient descent guides ML models toward optimal performance by iteratively adjusting parameters to minimize the cost function. This method’s effectiveness in navigating the complex landscape of model training is unparalleled. Whether applied to linear regression model, neural networks, or deep learning frameworks.

Hope you like the article! Gradient descent is a powerful optimization technique used in machine learning. A gradient descent example illustrates how the gradient descent algorithm minimizes error, enhancing model accuracy through iterative updates in the gradient descent algorithm.

Importance of Gradient Descent

By mastering gradient descent, you equip yourself with a powerful tool to enhance machine learning models, making them more accurate and reliable. Whether working with small datasets or scaling up to deep learning applications, understanding and effectively implementing gradient descent will significantly elevate your optimization and machine learning expertise.

Boost your machine learning skills with the Analytics Vidhya AI & ML Blackbelt program. Get hands-on experience with the latest tools in AI, NLP, and deep learning. Enroll now and take the first step toward advancing your data science career!

Frequently Asked Questions

Q1. What is a gradient-based algorithm?

A. The gradient-based algorithm is an optimization method that finds the minimum or maximum of a function using its gradient. In machine learning, these algorithms adjust model parameters iteratively, reducing error by calculating the gradient of the loss function for each parameter.

Q2. What is the best gradient descent algorithm?

A. The “best” gradient descent algorithm depends on the specific problem and context. But Adam (Adaptive Moment Estimation) is widely regarded as one of the most effective and popular algorithms. This is due to its adaptive learning rate and momentum, which help to accelerate convergence and improve performance on a wide range of tasks.

Q3. What are the three types of gradient descent?

A. There are three types of gradient descent: batch gradient descent, stochastic gradient descent, and mini-batch gradient descent. These methods differ in updating the model’s parameters and the size of the data batches used in each iteration.

Q4. What is gradient descent in a linear regression model?

A. Gradient descent is an optimization algorithm that minimizes the cost function in linear regression. It iteratively updates the model’s parameters by computing the partial derivatives of the cost function concerning each parameter and adjusting them in the opposite direction of the gradient.

Q5. Which is faster gradient descent?

A. SGD is usually faster than batch gradient descent, especially for large datasets. But it can be noisier. Mini-batch give a good balance between speed and stability.

Crypto1

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Gradient Descent Algorithm: How Does it Work in Machine Learning?

Table of contents

What is a Cost Function?

What is Gradient Descent?

Example of Gradient Descent Algorithm

Gradient Descent Python Implementation

How Does Gradient Descent Work?

Types of Gradient Descent Algorithm

Batch Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

Plotting the Gradient Descent Algorithm

Alpha – The Learning Rate

Local Minima

Code Implementation of Gradient Descent in Python

Advantages and Disadvantages

Advantages

Disadvantages

Challenges of Gradient Descent Algorithm

Conclusion

Importance of Gradient Descent

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state