This article was published as a part of the Data Science Blogathon
Machine Learning is one of the fastest-growing technology in the modern era. New innovations in the field of ML and AI are made each and every day which supports the world to leap forward. Earlier for a person entering into the ML field finds it difficult to create accurate machine learning models, but now AutoML Libraries are created which helps the beginners to create an accurate model with less work involved. Many AutoML libraries take the data as input and provide a good model with better accuracy for the given data. In today’s article, we are discussing one of the commonly used AutoML library EvalML
Well, it’s okay if you haven’t! Automated Machine Learning or AutoML is simply the process of automating real-world machine learning tasks. Using AutoML proves to be a great benefit no only in its efficiency, but also in the quality and accuracy of the ML model. In the future, we can definitely expect more research in Automated Machine Learning(AutoML) and that it will play a crucial role in Data Science.
With the automated process in AutoML, we are able to validate a machine learning model if it’s best to use or if it should be replaced with another. Having a glimpse of its Industrial applications, we come to know that AutoML can optimize operations, create business models, increase product quality all with the use of advanced insights and analytics, thus providing value to your business. You can even build and operate ML models without data science skills too. But that doesn’t mean this is a method for non ML experts, knowing ML is a prerequisite.
EvalML is an open-source AutoML library written in python that automates a large part of the machine learning process and we can easily evaluate which machine learning pipeline works better for the given set of data. It builds and optimizes ML pipelines using specific objective functions. It can automatically perform feature selection, model building, hyper-parameter tuning, cross-validation, etc. It really has a wide range of tools for understanding models. It is combined with Featuretools, which is a framework to perform automated feature engineering, and Compose, a framework for automated prediction engineering.
Run the below command to get it installed on your pc. Note that your pc has python version 3.5 and above.
pip install evaml --extra-index-url https://install.featurelabs.com/<license>/
Install Via PIP
pip install evalml
Install from PyPI
pip install evalml
Objective functions are that which EvalML tries to maximize or minimize in a pipelined search. Since this feedback from the pipelines leads to the optimization of models, it is important to have an objective function. We are able to train and optimize a model for certain problems by either optimizing domain-specific objectives functions or by defining custom objective functions using EvalML. It’s just that you need to determine the objective of your use case.
Let us dive deeper into some industry applications, that will definitely get you closer to understanding EvalML.
EvalML can even provide data checks to problems with data before modeling. EvalML supports a wide range of supervised learning problems such as regression, binary classification, and multiclass classification. These are some of the data checks that EvalML does,
We are now discussing the usage of EvalML for an NLP task and for a Regression Problem.
Importing the dataset
The data is a spam classifier text data set.
from urllib.request import urlopen import pandas as pd data=urlopen('https://featurelabsstatic.s3.amazonaws.com/spam_text_messages_modified.csv') df = pd.read_csv(data) df.head()
Now separate our data into independent features and dependent features.
X=data.drop('Category',axis=1) y=data['Category']
Separate value count for both ham and spam is,
y.value_counts()
ham 0.750084
spam 0.249916
Name: Category, dtype: float64
Now let’s import our AutoML library EvalML.
import evalml
Performing train test splitting for converting into the training set and the test set.
X_train,X_test,y_train,y_test=evalml.preprocessing.split_data(X,y,problem_type='binary')
Since our problem is a binary classification problem, we are setting the problem type as “binary”.
Also, different problem types for EvalML are,
Let’s check the input data,
X_train.head()
Now let’s import the AutoMLSearch from EvalML and begin the pipeline search.
from evalml import AutoMLSearch automl=AutoMLSearch(X_train=X_train,y_train=y_train,problem_type='binary',max_batches=1,optimize_thresholds=True) automl.search()
Let’s look into the score for different pipelines
automl.rankings
So the best pipeline is
best_pipeline = automl.best_pipeline best_pipeline
Output
GeneratedPipeline(parameters={'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1},})
Let’s describe the best pipeline and find which model is used and which are the hyperparameters.
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
Let’s evaluate the test data.
scores = best_pipeline.score(X_test, y_test, objectives=evalml.objectives.get_core_objectives('binary')) print(f'Accuracy : {scores["Accuracy Binary"]}')
Accuracy : 0.9732441471571907
Our model is giving a good accuracy.
Now let’s find the best pipeline for a regression problem using the EvalML library. The dataset here we using is sklearn’s Boston house price prediction. So let’s import the necessary library and the dataset.
Importing necessary libraries and loading dataset
import pandas as pd import evalml from sklearn.datasets import load_boston data = load_boston() X = data.data y = data.target X = pd.DataFrame(X) X.head()
X_train,X_test,y_train,y_test=evalml.preprocessing.split_data(X,y,problem_type='regression') x_train.head()
from evalml import AutoMLSearch automl = AutoMLSearch(X_train = X_train, y_train=y_train, problem_type = "regression",max_batches=1,optimize_thresholds=True) automl.search()
Ranking of different models is
automl.rankings
So the best pipeline is
best_pipeline = automl.best_pipeline best_pipeline
Output
GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})
Let’s describe the best pipeline and find which model is used and which are the hyperparameters.
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
This is the best pipeline for our dataset.
We have so far discussed, all basics that you need to know about AutoML and EvalML. We have also gone through its applications in NLP and Regression. Yet, there’s a lot to know and explore. And that was all about AutoML library EvalML for text classification and regression. Note that this can also be used for regression, time series analysis, etc. I hope you liked this article!!
Thank You…
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.