Machine Learning Automation using EvalML Library

Basil Saji Last Updated : 20 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Introduction

Machine Learning is one of the fastest-growing technology in the modern era. New innovations in the field of ML and AI are made each and every day which supports the world to leap forward. Earlier for a person entering into the ML field finds it difficult to create accurate machine learning models, but now AutoML Libraries are created which helps the beginners to create an accurate model with less work involved. Many AutoML libraries take the data as input and provide a good model with better accuracy for the given data. In today’s article, we are discussing one of the commonly used AutoML library EvalML

Have you heard of AutoML before?

Well, it’s okay if you haven’t! Automated Machine Learning or AutoML is simply the process of automating real-world machine learning tasks. Using AutoML proves to be a great benefit no only in its efficiency, but also in the quality and accuracy of the ML model. In the future, we can definitely expect more research in Automated Machine Learning(AutoML) and that it will play a crucial role in Data Science.

With the automated process in AutoML, we are able to validate a machine learning model if it’s best to use or if it should be replaced with another. Having a glimpse of its Industrial applications, we come to know that AutoML can optimize operations, create business models, increase product quality all with the use of advanced insights and analytics, thus providing value to your business. You can even build and operate ML models without data science skills too. But that doesn’t mean this is a method for non ML experts, knowing ML is a prerequisite.

What is EvalML?

EvalML is an open-source AutoML library written in python that automates a large part of the machine learning process and we can easily evaluate which machine learning pipeline works better for the given set of data. It builds and optimizes ML pipelines using specific objective functions. It can automatically perform feature selection, model building, hyper-parameter tuning, cross-validation, etc. It really has a wide range of tools for understanding models. It is combined with Featuretools, which is a framework to perform automated feature engineering, and Compose, a framework for automated prediction engineering.

How to Install?

Run the below command to get it installed on your pc. Note that your pc has python version 3.5 and above.

pip install evaml --extra-index-url https://install.featurelabs.com/<license>/

Install Via PIP

pip install evalml

Install from PyPI

pip install evalml

What are Objective Functions?

Objective functions are that which EvalML tries to maximize or minimize in a pipelined search. Since this feedback from the pipelines leads to the optimization of models, it is important to have an objective function. We are able to train and optimize a model for certain problems by either optimizing domain-specific objectives functions or by defining custom objective functions using EvalML. It’s just that you need to determine the objective of your use case.

Applications and Features of EvalML

Let us dive deeper into some industry applications, that will definitely get you closer to understanding EvalML.

EvalML can even provide data checks to problems with data before modeling. EvalML supports a wide range of supervised learning problems such as regression, binary classification, and multiclass classification. These are some of the data checks that EvalML does,

Detecting target leakage by providing the model with information during training.
Checks for Invalid datatypes
Class imbalance
Redundant features like highly null columns, constant columns, etc.
Checks for columns not useful for modeling.

We are now discussing the usage of EvalML for an NLP task and for a Regression Problem.

NLP Task

Importing the dataset

The data is a spam classifier text data set.

from urllib.request import urlopen
import pandas as pd
data=urlopen('https://featurelabsstatic.s3.amazonaws.com/spam_text_messages_modified.csv')
df = pd.read_csv(data)
df.head()

Feature Engineering

Now separate our data into independent features and dependent features.

X=data.drop('Category',axis=1)
y=data['Category']

Separate value count for both ham and spam is,

y.value_counts()

ham 0.750084
spam 0.249916
Name: Category, dtype: float64

Now let’s import our AutoML library EvalML.

import evalml

Train Test Split

Performing train test splitting for converting into the training set and the test set.

X_train,X_test,y_train,y_test=evalml.preprocessing.split_data(X,y,problem_type='binary')

Since our problem is a binary classification problem, we are setting the problem type as “binary”.

Also, different problem types for EvalML are,

MULTICLASS: ‘multiclass’
REGRESSION: ‘regression’
TIME_SERIES_REGRESSION: ‘time-series regression’
TIME_SERIES_BINARY: ‘time-series binary’
TIME_SERIES_MULTICLASS: ‘time series multiclass’

Let’s check the input data,

X_train.head()

Searching for the best pipeline

Now let’s import the AutoMLSearch from EvalML and begin the pipeline search.

from evalml import AutoMLSearch
automl=AutoMLSearch(X_train=X_train,y_train=y_train,problem_type='binary',max_batches=1,optimize_thresholds=True)
automl.search()

Let’s look into the score for different pipelines

automl.rankings

So the best pipeline is

best_pipeline = automl.best_pipeline
best_pipeline

Output

GeneratedPipeline(parameters={'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1},})

Let’s describe the best pipeline and find which model is used and which are the hyperparameters.

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

Let’s evaluate the test data.

scores = best_pipeline.score(X_test, y_test,  objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy : {scores["Accuracy Binary"]}')

Accuracy : 0.9732441471571907

Our model is giving a good accuracy.

Regression

Now let’s find the best pipeline for a regression problem using the EvalML library. The dataset here we using is sklearn’s Boston house price prediction. So let’s import the necessary library and the dataset.

Importing necessary libraries and loading dataset

import pandas as pd
import evalml
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
X = pd.DataFrame(X)
X.head()

Train Test Split

X_train,X_test,y_train,y_test=evalml.preprocessing.split_data(X,y,problem_type='regression')
x_train.head()

Searching for the best pipeline.

from evalml import AutoMLSearch
automl = AutoMLSearch(X_train = X_train, y_train=y_train, problem_type = "regression",max_batches=1,optimize_thresholds=True)
automl.search()

Ranking of different models is

automl.rankings

So the best pipeline is

best_pipeline = automl.best_pipeline
best_pipeline

Output

GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})

Let’s describe the best pipeline and find which model is used and which are the hyperparameters.

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

This is the best pipeline for our dataset.

Conclusion

We have so far discussed, all basics that you need to know about AutoML and EvalML. We have also gone through its applications in NLP and Regression. Yet, there’s a lot to know and explore. And that was all about AutoML library EvalML for text classification and regression. Note that this can also be used for regression, time series analysis, etc. I hope you liked this article!!

Thank You…

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Basil Saji

Python Developer, ML Enthusiast, Blogger and an Electronics and Communication Engineering aspirant determined and motivated to finish tasks with atmost sincerity and dedication.I'am a good learner who ready to accept challenges to bring up my best even in the worst. Wish for a world with enough advancements and opportunities for the people.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Machine Learning Automation using EvalML Library

Introduction

Have you heard of AutoML before?

What is EvalML?

How to Install?

What are Objective Functions?

Applications and Features of EvalML

NLP Task

Feature Engineering

Train Test Split

Searching for the best pipeline

Regression

Train Test Split

Searching for the best pipeline.

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme