It all starts with mastering Python’s scikit-learn library. There are no two ways about it – sklearn offers us the path to learn, execute, and improve our machine learning models. It’s the first industry-grade Python library I learned and it has served me supremely well since!
Sklearn is the Swiss Army Knife of data science libraries. It is an indispensable tool in your data science armory that will carve a path through seemingly unassailable hurdles.
I personally love using the sklearn library. And that’s no surprise – sklearn offers a ton of flexibility and with it’s easy to understand documentation, it holds the mantle for the most popular machine learning library in Python.
Since sklearn is the bread and butter of our machine learning projects, it is imperative that we understand some lesser-known but powerful functions that will give you more control and functionality over the model building process.
If you’re new to sklearn and haven’t yet understood how it works, then go ahead and enroll in this free course. It will guide you through all the in’s and out’s of the wonderful Python library and set you up for your machine learning journey.
This is the fifth part of my Data Science hacks, tips, and tricks series. I highly recommend going through the previous articles to become a more efficient data scientist or analyst.
I have also converted my learning into a free course that you can check out:
Also, if you have your own Data Science hacks, tips, and tricks, you can share it with the open community on this GitHub repository: Data Science hacks, tips and tricks on GitHub.
We’ll cover the below sklearn hacks, tips, and tricks for data science in this article:
Let’s start our first hack with the most essential component – data. You can generate your own random data to perform linear regression by using sklearn’s make_regression() function. It’s very useful in situations where you need to debug your algorithm or simply when you need a small random dataset. The advantage of this function is that it gives you complete control over the behavior of your data.
make_regression() generates a dataset in which the independent variable and the dependent variable have a linear relationship.
You can tweak the parameters to obtain the desired dataset. Let’s understand how it works in this example:
A lot of times we stick to simple and conventional methods to impute missing values, such as using mean/ median for regression and mode for classification problem. But why do we limit ourselves when we have advanced methods at our disposal?
IterativeImputer is a multivariate imputation method, i.e, we use the entire set of features to estimate the missing values.
In the IterativeImputer strategy, a machine learning model is built to estimate the missing values. Each feature having missing values is taken as a function of other features. This is done in a round-robin fashion.
In simple words, a feature having missing value is “y” or the dependent variable and other feature columns become “X” or independent variables.
There are various ways to select features for your model but I have a favorite function to do this job – sklearn’s SelectFromModel. It is a meta-transformer for selecting features based on importance weights. You can choose from a range of estimators but keep in mind that the estimator must have either a feature_importances_ or coef_ attribute after fitting.
How would you judge your machine learning model? What is the basis of your comparison? The solution is a baseline model.
A baseline model is constructed with very simple and basic rules. After building a baseline model, you head on to make complex solutions that must perform better than the baseline model. Sklearn provides a very simple function to do the job – DummyClassifier. This has various strategies, such as:
Please note that this classifier is useful as a simple baseline to compare with real classifiers and these cannot be used for real problems.
A great addition to the latest sklearn release is the new function – plot_confusion_matrix. This generates an extremely intuitive and customizable confusion matrix for your classifier.
Bonus tip: You can specify the format of the numbers appearing in the boxes using the values_format parameter (‘n’ for whole numbers, ‘.2f’ for float, etc).
Real-world data is rarely homogenous, i.e, it contains columns with different data types. It becomes a challenge to apply different transformations separately on all columns.
Sklearn’s ColumnTransform provides an easy fix for this. ColumnTransform applies transformers to columns of an array or a Pandas DataFrame. We will simply input the transformers in the form of a list. This list of tuples (name, transformer, column(s)) specifying the transformer objects can be applied to subsets of the data.
ColumnTransformer comes in very handy during the data preprocessing stage and is widely used in data pipelines.
We typically create a function to write a reusable piece of code, but what should we do when we want to reuse our model? We use Pickle!
After training a machine learning model, you may want to use it without having to retrain it from scratch – this is known as model persistence. Let us see how you can save your machine learning model using Pickle:
In this article, we covered seven useful sklearn hacks, tips, and tricks across various sklearn modules and functions to become a better and efficient data scientist. I hope these hacks will help you with day-to-day niche tasks and save you a lot of time.
Let me know your Data Science hacks, tips, and tricks in the comments section below!