Every ML Engineer and Data Scientist must understand the significance of “Hyperparameter Tuning Techniques” while selecting your right machine/deep learning model and improving the performance of the model(s).
Make it simple, for every single machine learning model selection is a major exercise and it is purely dependent on selecting the equivalent set of hyperparameters, and all these are indispensable to train a model. It is always referring to the parameters of the selected model and be remember it cannot be learnt from the data, and it needs to be provided before the model gets into the training stage, ultimately the performance of the machine learning model improves with a more acceptable choice of hyperparameter tuning in machine learning and selection techniques. The main intention of this article is to make you all aware of hyperparameter tuning.
Hyperparameter tuning is basically referred to as tweaking the parameters of the model, which is basically a prolonged process.
Before going into detail, let’s ask some valuable self-questions on hyperparameter tuning in machine learning, I am sure this would help you a lot on this magic word. Personally, I experienced that and explain it here.
In this article, you will explore hyperparameter tuning, including various hyperparameter optimization techniques and methods. Discover how these hyperparameter tuning methods can significantly enhance your machine learning model’s performance.
This article was published as a part of the Data Science Blogathon.
As we know that there are parameters that are internally learned from the given dataset and derived from the dataset, they are represented in making predictions, classification and etc., These are so-called Model Parameters, and they are varying with respect to the nature of the data we couldn’t control this since it depends on the data. Like ‘m‘ and ‘C‘ in linear equation, which is the value of coefficients learned from the given dataset.
Some set of parameters that are used to control the behaviour of the model/algorithm and adjustable in order to obtain an improvised model with optimal performance is so-called hyperparameter tuning in machine learning.
The best model algorithm(s) will sparkle if your best choice of Hyper-parameters.
Checkout this article about the Parameters and Hyperparameters
If you ask me what is Hyperparameters in simple words, the one-word answer is Configuration.
Without thinking too much, I can say quick Hyperparameter is “Train-Test Split Ratio (80-20)” in our simple linear regression model.
YES! now I can see that, you’re really starting to feel what could be HPs and how it would optimize the model. That’s why I have mentioned earlier in easy language this is configuring values.
Let me give one more example – You can compare this with selecting setting the font and its size for better readability and clarity while you document your content to be perfect and precise.
Coming back to machine learning and recalling Ridge Regression (L2 Regularization) and Lasso Regression (L1 Regularization), In regularized terms we use to have lambda (λ) I mean the Penalty Factor helps us to get a smooth surface instead of an irregular graph.
This term is used to push the coefficients(β) values near zero in terms of magnitude, For more details please refer to my earlier articles and tutorials on Study of Regularization Techniques of Linear Models and Its Roles. This is nothing but hypermeters, crucial for optimizing model performance.
For better clarity and understanding, here is one more classical representation for you.
From the above equation, you can understand a better view of what MODEL and HYPER PARAMETERS is.
Hyper parameter tuning are supplied as arguments to the model algorithm during initializing them as key, value and their values are picked by the data scientist, who is building the model in iterative mode.
As we know that there is a list of HPs for any selected algorithm(s) and our job is to figure out the best combination of HPs and to maximize the optimal results by tweaking them strategically, this process will be providing us with the platform for Hyperparameter Space and this combination leads to provide the best optimal results, no doubt in that but finding this combo is not so easy, we have to search throughout the space.
Here every combination of selected HP value is said to be the “MODEL” and have to evaluate the same on the spot. For this reason, there are two generic approaches to search effectively in the HP space are GridSearch CV and RandomSearch CV, and newer methods like Hyperband are gaining popularity due to their efficiency. Here CV denotes Cross-Validation.
Before going to apply the above-mentioned search options on the data/model, we must split the data into 3 different sets. I can understand your mind voice, already we are splitting the dataset as Train and Test, now one more track? Yes, there is a valid reason are there, that is nothing but to prevent the “DATA LEAKAGE” during Training, Validating and Testing. remember we shouldn’t touch the test data set until we move the model into production deployment.
Well! Now quickly will understand what is Data leakage in ML, this is mainly due to not following some of the recommended best practices during the Data Science/Machine Learning life cycle. The resulting
is Data Leakage, that’s fine, what is the issue here, after successful testing with perfect accuracy followed by training the model then the model has been planned to move into production. At this moment ALL Is Well.
Still, the actual/real-time data is applied to this model in the production environment, you will get poor scores. By this time, you may think that why did this happen and how to fix this. This is all because of the data that we split data into training data and testing subsets. During the training the model has the knowledge of data, which the model is trying to predict, this results in inaccurate and bad prediction outcomes after the model is deployed into production.
The bottom line is that we should avoid doing anything to our training dataset that involves having knowledge of the test dataset to ensure our model performs as a generalized model in production.
We will go through the available hyperparameters across various algorithms and discuss how to implement these factors to impact the model effectively.
Now, time to discuss a few hyperparameter tuning in machine learning and their influence on the model.
With the help of this, we use to set the test and train size for the given dataset and along with random state, this is permutations to generate the same set of splits., otherwise you will get a different set of test and train sets, tracing your model during evaluation is bit complex or if we omitted this system will generate this number and leads to unpredictable behaviour of the model. The random state provides the seed, for the random number generator, in order to stabilize the model.
train_test_split( X, y, test_size=0.4, random_state=0)
The parameter C in Logistic Regression Classifier is directly related to the regularization parameter λ but is inversely proportional to C=1/λ.
LogisticRegression(C=1000.0, random_state=0)LogisticRegression(C=1000.0, random_state=0)
As we know the k-nearest neighbour’s algorithm (KNN) is a non-parametric method used for regression and classification problems. Predominantly this is used for classification problems, in which the number of neighbours and power parameter.
KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
SVC(kernel='linear', C=1.0, random_state=0)
Here, the criterion is the function to measure the quality of a split, max_depth is the maximum depth of the tree, and random_state is the seed used by the random number generator.
DecisionTreeClassifier(criterion=’entropy’, max_depth=3, random_state=0)
Lasso(alpha = 0.1) the regularization parameter is alpha.
PCA(n_components = 4)
Perceptron (n_iter=40, eta0=0.1, random_state=0)
Understand More about the Machine Learning Algorithms
Overall, Hyper parameter tuning are influencing the below factors while designing your model. Please remember this.
So, once we started thinking about introducing the hyper parameter tuning in our model then the overall architecture model would be like below.
In the ML world, there are many Hyperparameter optimization techniques are available.
Note: When we implement Hyperparameters optimization techniques, we have to have the Cross-Validation techniques as well in the flow because we may not miss out on the best combinations that work on tests and training.
The name itself is self-explanatory that the data scientist can do the experiment with different combinations of hyperparameters and its values for the selected model perform the training and pick up the best model with the best performance and go for testing and move on to production deployment. Of Course, what you think is absolutely right is that this method will consume immense effort. Utilizing various machine learning frameworks enhances this iterative process significantly.
Let’s try this with a simple dataset. This example will serve as a practical tutorial on applying hyperparameter tuning to real-world data.
# Import necessary packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
df = pd.read_csv("pima-indians-diabetes.csv")
print(df.head())
Dataframe ready after load CSV and required libraries for further operations
# Train Test Split
#df = df.drop(['name','origin','model_year'], axis=1)
y = df['class']
X = df.drop(['class'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=30)
Train and Test are done with target and dependent variables identification.
Since we’re planning for manual search, I am creating 3 sets for DecisionTreeClassifier and fitting the model:
# sets of hyperparameters
params_1 = {'criterion': 'gini', 'splitter': 'best', 'max_depth': 50}
params_2 = {'criterion': 'entropy', 'splitter': 'random', 'max_depth': 70}
params_3 = {'criterion': 'gini', 'splitter': 'random', 'max_depth': 60}
params_4 = {'criterion': 'entropy', 'splitter': 'best', 'max_depth': 80}
params_5 = {'criterion': 'gini', 'splitter': 'best', 'max_depth': 40}
# Separate models
model_1 = DecisionTreeClassifier(**params_1)
model_2 = DecisionTreeClassifier(**params_2)
model_3 = DecisionTreeClassifier(**params_3)
model_4 = DecisionTreeClassifier(**params_4)
model_5 = DecisionTreeClassifier(**params_5)
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)
model_4.fit(X_train, y_train)
model_5.fit(X_train, y_train)
# Prediction sets
preds_1 = model_1.predict(X_test)
preds_2 = model_3.predict(X_test)
preds_3 = model_3.predict(X_test)
preds_4 = model_4.predict(X_test)
preds_5 = model_5.predict(X_test)
print(f'Accuracy on Model 1: {round(accuracy_score(y_test, preds_1), 3)}')
print(f'Accuracy on Model 2: {round(accuracy_score(y_test, preds_2), 3)}')
print(f'Accuracy on Model 3: {round(accuracy_score(y_test, preds_3), 3)}')
print(f'Accuracy on Model 4: {round(accuracy_score(y_test, preds_4), 3)}')
print(f'Accuracy on Model 5: {round(accuracy_score(y_test, preds_5), 3)}')
Output:
Accuracy on Model 1: 0.693
Accuracy on Model 2: 0.693
Accuracy on Model 3: 0.693
Accuracy on Model 4: 0.736
Accuracy on Model 5: 0.688
Look at the accuracy and its differences with different parameters that we have passed over the list. But this is a tedious job and running behind a number of permutations and combinations and finding the best one, hope you can understand the pain and code management.
To implement the Grid-Search, we have a Scikit-Learn library called GridSearchCV. The computational time would be long, but it would reduce the manual efforts by avoiding the ‘n’ number of lines of code. Library itself perform the search operations and returns the performing model and its score. In which each model are built for each permutation of a given hyperparameter, internally it would be evaluated and ranked across the given cross-validation folds.
Let’s implement this with the given dataset.
Getting KNeighborsClassifier object for my operation.
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
Assigning my Train and Test spilt to my KNN object
knn_clf.fit(X_train, y_train)
Output
KNeighborsClassifier()
Importing other required libraries
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
Defining a number of folders for GridSearchCV and assigning TT.
gs = GridSearchCV(knn_clf,param_grid,cv=10)
gs.fit(X_train, y_train)
Preparing a list of hyperparameters for my further actions with 4 different algorithm:
param_grid = {'n_neighbors': list(range(1,9)),'algorithm': ('auto', 'ball_tree', 'kd_tree' , 'brute') }
Output
GridSearchCV(cv=10, estimator=KNeighborsClassifier(),param_grid={'algorithm': ('auto', 'ball_tree', 'kd_tree', 'brute'),'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8]})
We will print all 4 algorithms for 8 sub-sets.
gs.cv_results_['params']
Output 32 combinations
[{'algorithm': 'auto', 'n_neighbors': 1},
{'algorithm': 'auto', 'n_neighbors': 2},
{'algorithm': 'auto', 'n_neighbors': 3},
{'algorithm': 'auto', 'n_neighbors': 4},
{'algorithm': 'auto', 'n_neighbors': 5},
{'algorithm': 'auto', 'n_neighbors': 6},
{'algorithm': 'auto', 'n_neighbors': 7},
{'algorithm': 'auto', 'n_neighbors': 8},
{'algorithm': 'ball_tree', 'n_neighbors': 1},
{'algorithm': 'ball_tree', 'n_neighbors': 2},
{'algorithm': 'ball_tree', 'n_neighbors': 3},
{'algorithm': 'ball_tree', 'n_neighbors': 4},
{'algorithm': 'ball_tree', 'n_neighbors': 5},
{'algorithm': 'ball_tree', 'n_neighbors': 6},
{'algorithm': 'ball_tree', 'n_neighbors': 7},
{'algorithm': 'ball_tree', 'n_neighbors': 8},
{'algorithm': 'kd_tree', 'n_neighbors': 1},
{'algorithm': 'kd_tree', 'n_neighbors': 2},
{'algorithm': 'kd_tree', 'n_neighbors': 3},
{'algorithm': 'kd_tree', 'n_neighbors': 4},
{'algorithm': 'kd_tree', 'n_neighbors': 5},
{'algorithm': 'kd_tree', 'n_neighbors': 6},
{'algorithm': 'kd_tree', 'n_neighbors': 7},
{'algorithm': 'kd_tree', 'n_neighbors': 8},
{'algorithm': 'brute', 'n_neighbors': 1},
{'algorithm': 'brute', 'n_neighbors': 2},
{'algorithm': 'brute', 'n_neighbors': 3},
{'algorithm': 'brute', 'n_neighbors': 4},
{'algorithm': 'brute', 'n_neighbors': 5},
{'algorithm': 'brute', 'n_neighbors': 6},
{'algorithm': 'brute', 'n_neighbors': 7},
{'algorithm': 'brute', 'n_neighbors': 8}]
Let’s get the best parameter from the list.
gs.best_params_
Output
{'algorithm': 'auto', 'n_neighbors': 6}
As per the Cross-Validation process, will figure out the mean and get the results
gs.cv_results_['mean_test_score']
Output
array([0.68134172, 0.71701607, 0.71331237, 0.71509434, 0.72075472,
0.73944794, 0.72085954, 0.73392732, 0.68134172, 0.71701607,
0.71331237, 0.71509434, 0.72075472, 0.73944794, 0.72085954,
0.73392732, 0.68134172, 0.71701607, 0.71331237, 0.71509434,
0.72075472, 0.73944794, 0.72085954, 0.73392732, 0.68134172,
0.71701607, 0.71331237, 0.71509434, 0.72075472, 0.73944794,
0.72085954, 0.73392732])
That’s fine. which one is the best accuracy from the above list, this is simple, already we found the best parameter from the list is {‘algorithm’: ‘auto’, ‘n_neighbors’: 6}, So compare the 32 combinations of different parameters and accuracy list. this answer is 0.73944794. is the highest value among the list and this is the BEST accuracy of the training model.
Best accuracy from training
print(gs.score(X_test,y_test))
Output
0.70129870
The Grid Search one that we have discussed above usually increases the complexity in terms of the computation flow, So sometimes GS is considered inefficient since it attempts all the combinations of given hyperparameters. But the Randomized Search is used to train the models based on random hyperparameters and combinations. obviously, the number of training models are small column than grid search.
Read More about Random Forest Algorithm for Beginners
In simple terms, In Random Search, in a given grid, the list of hyperparameters are trained and test our model on a random combination of given hyperparameters.
Getting RandomForestClassifier object for my operation.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
Assigning my Train and Test spilt to my RandomForestClassifier object
# build a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50)
Specifying the list of parameters and distributions
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
Defining the sample, distributions and cross-validation
samples = 8 # number of random samples
randomCV = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=samples,cv=3)
All parameters are set and, let’s do the fit model
randomCV.fit(X, y)
print(randomCV.best_params_)
Output
{'bootstrap': False, 'criterion': 'gini', 'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 7, 'min_samples_split': 8}
As per the Cross-Validation process, will figure out the mean and get the results
randomCV.cv_results_['mean_test_score']
Output
array([0.73828125, 0.69010417, 0.7578125 , 0.75911458, 0.73828125,
nan, nan, 0.7421875 ])
Best accuracy from training
print(randomCV.score(X_test,y_test))
Output
0.8744588744588745
You may have a question, now which technique is best to go. The straight answer is RandomSearshCV, let’s see why?
GridSearchCV | RandomSearshCV |
Grid is well-defined | Grid is not well defined |
Discrete values for HP-params | Continuos values and Statistical distribution |
Defined size for Hyperparameter space | No such a restriction |
Picks of the best combination from HP-Space | Picks up the samples from HP-Space |
Samples are not created | Samples are created and specified by the range and n_iter |
Low performance than RSCV | Better performance and result |
Guided flow to search for the best combination | The name itself says that, no guidance. |
The blow pictorial representation would give you the best understanding of GridSearchCV and RandomSearshCV.
Guys! So far we have discussed in a detailed study of Hyperparameter visions with respect to the Machine Learning point of view, please remember a few things before we go
Hope you like the exploration of hyperparameter tuning! By employing various hyperparameter optimization techniques and methods, you can significantly enhance your model’s performance and achieve better results.
Thanks for the time and will connect on different topics. Until then Bye! Cheers!
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
A. The objective function evaluates the performance of a machine learning model at various hyperparameter settings. The main goal during hyperparameter tuning is to find the optimal hyperparameters that maximize or minimize this function, which often requires extensive computational resources to explore the search space effectively.
A. Frameworks such as PyTorch and TensorFlow provide robust, flexible environments that simplify the implementation of complex models like SVMs or neural networks. They allow for efficient model training and tuning, utilizing powerful libraries that can handle the vast possible combinations of hyperparameters without excessive consumption of computational resources.
A. To prevent overfitting during hyperparameter tuning, it’s crucial to use a rigorous search algorithm like cross-validation. This method helps ensure that the model generalizes well to new data. Additionally, adjusting the learning algorithm to include regularization terms and selecting a search space that restricts overly complex models can significantly reduce the risk of overfitting.
GridSearchCV is like a hyperparameter tuning machine in machine learning. You give it a bunch of options for different settings (hyperparameters) and it tries them all out to find the best combo for your model. It’s like trying different ingredients in a recipe to see which one tastes the best!
Excellent explanation about the hyperparameter turning and the technique are explored in altmate manner. Thanks a lot. 👌👌