This article was published as a part of the Data Science Blogathon
A deep neural network consists of multiple layers: an input layer, one or multiple hidden layers, and an output layer. In order to develop any deep learning model, one must decide on the most optimal values of a number of hyperparameters such as activation functions, batch size, and learning rate amongst others so as to be able to fine-tune each of these layers.
A hyperparameter controls the learning process and therefore their values directly impact other parameters of the model such as weights and biases which consequently impacts how well our model performs. The accuracy of any machine learning model is most often improved by fine-tuning these hyperparameters. It, therefore, becomes imperative for a machine learning researcher to understand these hyperparameters well.
In this article, we would dig deep into some of the above-mentioned hyperparameters and look at how changing them impacts the model performance.
We have considered the Pima Indian Diabetes Dataset here that contains information on 768 women from a population near Phoenix, Arizona, USA. The dependent variable here is Diabetes. 268 out of 768 women belong to the ‘positive’ class while the rest 500 belong to the ‘negative’ class. There are 8 attributes in the dataset – pregnancies, OGTT(Oral Glucose Tolerance Test), blood pressure, skin thickness, insulin, BMI(Body Mass Index), age, and pedigree diabetes function.
The structure of the neural network we used consisted of an 8-node input layer. This was followed by the first hidden layer that had 12 nodes. The second hidden layer had 8 nodes, all of which were then connected to a single node in the output layer. The output layer was for producing a binary output value with a sigmoid activation function since we were dealing with a binary classification problem.
The following table shows the settings of the different experiments we carried out in order to study the effect of the hyperparameters on our model performance.
The gradient descent technique is used to train neural networks. In this technique, the estimate of the error (difference between actual and predicted variables) based on a subset of the training dataset is used to update the weights in every iteration. Batch size is defined as the number of examples that are used from the training dataset for estimating the error gradient and is an important hyperparameter that influences the dynamics of the learning algorithm. In mini-batch gradient descent, the batch size is set to more than one and less than the total number of examples in the training dataset.
In Python, the batch size can be specified while training the model as follows:
history = model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=NB_EPOCHS,batch_size=BATCH_SIZE,verbose=0,shuffle=False)
The above plots show the classification accuracy of models with our train and test datasets with different batch sizes when using mini-batch gradient descent.
The plots show that –
An activation function in a neural network transforms the weighted sum of the inputs into an output from a node in a layer of the network. If an activation function is not used, the neural networks become just a linear regression model as these functions enable the non-linear transformation of the inputs making them capable to learn and perform more complex tasks. There are several commonly used non-linear activation functions like sigmoid, tanh, and ReLU.
In Python, the activation function for each hidden layer can be specified while building the model using Keras as follows:
from keras.models import Sequential model = Sequential() model.add(Dense(12, input_dim=8, activation='relu',kernel_initializer='uniform')) model.add(Dense(8, activation='relu',kernel_initializer='uniform')) model.add(Dense(1,activation='sigmoid' ,kernel_initializer='uniform'))
The above plots show the classification accuracy of models with our train and test datasets with different activation functions in the hidden layers.
While training a deep learning model, the weights and biases associated with each node of a layer are updated at every iteration with the objective of minimizing the loss function. This adjustment of weights is enabled by algorithms like stochastic gradient descent which are also known by the name of optimizers.
Other algorithms like adaptive learning rate optimizers adjust the learning rate throughout training in response to the performance of the model. Perhaps the simplest implementation is to make the learning rate smaller once the performance of the model plateaus, such as by decreasing the learning rate by a factor of two or an order of magnitude. Although no single method works best on all problems, there are many adaptive learning rate methods that have proven to be robust over many types of neural network architectures and problem types. They are AdaGrad, RMSProp, Adam, AdaMax, and all maintain and adapt learning rates for each of the weights in the model.
In Python, we can specify the type of optimizer while building the model in Keras as follows :
from keras import optimizers sgd = optimizers.SGD(lr=0.001) rmsprop = optimizers.RMSprop(lr=0.001) adagrad = optimizers.Adagrad(lr=0.001) adam = optimizers.Adam(lr=0.001) adamax = optimizers.Adamax(lr=0.001) # Compile the model model.compile(loss='binary_crossentropy', optimizer=rmsprop, metrics=['accuracy'])
From the plots given above, we can see that
The effects of various hyperparameter configurations on the performance of a deep learning model for diabetes prediction were investigated in this study.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.