The gradient descent algorithm is an optimization algorithm mostly used in machine learning and deep learning. Gradient descent adjusts parameters to minimize particular functions to local minima. In linear regression, it finds weight and biases, and deep learning backward propagation uses the method.
The algorithm objective is to identify model parameters like weight and bias that reduce model error on training data.
This article was published as a part of the Data Science Blogathon
dy = change in y
dx = change in x
Learning Rate:
The algorithm designer can set the learning rate. If we use a learning rate that is too small, it will cause us to update very slowly, requiring more iterations to get a better solution.
There are three popular types that mainly differ in the amount of data they use:
Batch gradient descent, also known as vanilla gradient descent, calculates the error for each example within the training dataset. Still, the model is not changed until every training sample has been assessed. The entire procedure is referred to as a cycle and a training epoch.
Some benefits of batch are its computational efficiency, which produces a stable error gradient and a stable convergence. Some drawbacks are that the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. It also requires the entire training dataset to be in memory and available to the algorithm.
class GDRegressor:
def __init__(self,learning_rate=0.01,epochs=100):
self.coef_ = None
self.intercept_ = None
self.lr = learning_rate
self.epochs = epochs
def fit(self,X_train,y_train):
# init your coefs
self.intercept_ = 0
self.coef_ = np.ones(X_train.shape[1])
for i in range(self.epochs):
# update all the coef and the intercept
y_hat = np.dot(X_train,self.coef_) + self.intercept_
#print("Shape of y_hat",y_hat.shape)
intercept_der = -2 * np.mean(y_train - y_hat)
self.intercept_ = self.intercept_ - (self.lr * intercept_der)
coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
self.coef_ = self.coef_ - (self.lr * coef_der)
print(self.intercept_,self.coef_)
def predict(self,X_test):
return np.dot(X_test,self.coef_) + self.intercept_
Advantages
Disadvantages
By contrast, stochastic gradient descent (SGD) changes the parameters for each training sample one at a time for each training example in the dataset. Depending on the issue, this can make SGD faster than batch gradient descent. One benefit is that the regular updates give us a fairly accurate idea of the rate of improvement.
However, the batch approach is less computationally expensive than the frequent updates. The frequency of such updates can also produce noisy gradients, which could cause the error rate to fluctuate rather than gradually go down.
Advantages
Disadvantages
Implementation of sgd classifier in sklearn:
from sklearn.linear_model import SGDClassifier X = [[0., 0.], [1., 1.]] y = [0, 1] clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5) clf.fit(X, y) SGDClassifier(max_iter=5)
Since mini-batch gradient descent combines the ideas of batch gradient descent with SGD, it is the preferred technique. It divides the training dataset into manageable groups and updates each separately. This strikes a balance between batch gradient descent’s effectiveness and stochastic gradient descent’s durability.
Mini-batch sizes typically range from 50 to 256, although, like with other machine learning techniques, there is no set standard because it depends on the application. The most popular kind in deep learning, this method is used when training a neural network.
class MBGDRegressor: def __init__(self,batch_size,learning_rate=0.01,epochs=100): self.coef_ = None self.intercept_ = None self.lr = learning_rate self.epochs = epochs self.batch_size = batch_size def fit(self,X_train,y_train): # init your coefs self.intercept_ = 0 self.coef_ = np.ones(X_train.shape[1]) for i in range(self.epochs): for j in range(int(X_train.shape[0]/self.batch_size)): idx = random.sample(range(X_train.shape[0]),self.batch_size) y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_ #print("Shape of y_hat",y_hat.shape) intercept_der = -2 * np.mean(y_train[idx] - y_hat) self.intercept_ = self.intercept_ - (self.lr * intercept_der) coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx]) self.coef_ = self.coef_ - (self.lr * coef_der) print(self.intercept_,self.coef_) def predict(self,X_test): return np.dot(X_test,self.coef_) + self.intercept_
Advantages
Disadvantages
Configure Mini-Batch Gradient Descent:
The mini-batch steepest descent method is a variant of the steepest descent method recommended for most applications, intense learning.
Mini-batch sizes, commonly called “batch sizes” for brevity, are often tailored to some aspect of the computing architecture in which the implementation is running. For example, a power of 2 that matches the memory requirements of the GPU or CPU hardware, such as 32, 64, 128, and 256.
The stack size is a slider for the learning process.
Smaller values allow the learning process to converge quickly at the expense of noise in the training process. Larger values result in a learning process that slowly converges to an accurate estimate of the error gradient.
In this article, we learned about different types of gradient descent. The key takeaways from the article are:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.