Have you ever solved a Machine Learning problem in just one go?
Solving a problem using machine learning isn’t straightforward. It involves various steps to come up with an accurate solution. The process/steps to be followed for solving an ml problem is known as ML Pipeline/ML Cycle.
As shown in the figure, the Machine Learning pipeline consists of different steps like:
Understand Problem Statement, Hypothesis Generation, Exploratory Data Analysis, Data Preprocessing, Feature Engineering, Feature Selection, Model Building, Model Tuning, and Model Deployment.
I would recommend going through the below articles for in detailed understanding of the Machine Learning pipeline:
The process of solving a machine learning problem involves a lot of time and human effort. Hip Hip Hooray! It’s no longer a tedious and time-consuming process! Thanks to AutoML for providing instant solutions to ML problems.
AutoML is all about automatically building the high-performance model with the least human intervention
AutoML libraries offer low-code and no-code programming.
You’ve probably heard of the terms “low-code” and “no-code.”
Though no-code platforms make it simple to train a Machine Learning model using a drag-and-drop interface, they are limited in terms of flexibility. Low-code ML, on the other hand, is the sweet spot and middle ground, as they offer both flexibility and easy-to-use code.
In this article, let us understand how to build a text classification model within a few lines of code using a low code AutoML library, PyCaret.
PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within a few minutes.
PyCaret is essentially a low-code library that replaces hundreds of lines of code in scikit learn to 5-6 lines of code. It increases the productivity of the team and helps the team to focus on understanding the problem and feature engineering rather than model optimization.
PyCaret is built on top of a scikit learn library. As a result, all the machine learning algorithms available in scikit learn are available in pycaret. As of now, PyCaret can solve problems related to Classification, Regression, Clustering, Anomaly detection, Text Classification, Associate Rule Mining, and Time Series.
Now, let us discuss the reasons behind using PyCaret.
PyCaret automatically builds the benchmark model given a dataset within 5-6 lines of code. Let’s see how pycaret simplifies each step in the machine learning pipeline.
Next, we will focus on solving a text classification problem in PyCaret.
Let’s solve a text classification problem in PyCaret using 2 different techniques-
I will touch upon each approach in detail
Topic Modeling, as the name conveys, is a technique to identify different topics present in the text data.
Topics are defined as a repeating group of statistically significant tokens (or words) in a corpus. Here, statistical significance refers to important words in the document. Generally, the frequently occurring words with higher TF-IDF scores are considered to be statistically significant words.
Topic modeling is an unsupervised technique to automatically find the hidden topics in text data. It can also be referred to as the text mining approach to find recurring patterns in text documents.
Some common use-cases of topic modeling are as follows:
Let’s say you work for a legal firm and you’re working with a company where there’s some money that’s been embezzled, and you know there’s some key information lying in the emails that have been set around the company.
As explained earlier, the objective of topic modeling is to extract different topics from the raw text. But, what’s the underlying algorithm to achieve it?
This drives us to the different algorithms/techniques to topic modeling – Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NNMF), Latent Semantic Allocation (LSA).
I would recommend you to go through the following resources to read in detail about the algorithms
Coming to topic modeling, it’s a 2 step process:
Having understood the topic modeling, we will see how to solve text classification using topic modeling with the help of an example.
Consider a corpus:
Topic modeling algorithm (LDA) identifies the most important topics in documents.
Next, assigns scores of each topic to documents as follows.
This matrix acts like features of the machine learning algorithm. Next, we’ll see about the bag of words.
Bag Of Words (BOW) is another popular algorithm for representing text in numbers. It relies on the frequency of the words in the document. BOW has numerous applications like document classification, topic modeling, and text similarity. In BOW, every document is represented as the frequency of words present in the document. So, the frequency of words represents the importance of the words in the document.
Follow the below article for a detailed understanding of Bag Of Words:
In the next section, we will solve the text classification problem in PyCaret.
Let us understand the problem statement prior to solving it.
Understanding Problem Statement
Steam is a video game digital distribution service with a vast community of gamers globally. A lot of gamers write reviews on the game page and have the option of choosing whether they would recommend this game to others or not. However, determining this sentiment automatically from the text can help Steam to automatically tag such reviews extracted from other forums across the internet and can help them better judge the popularity of games.
Given the review text with user recommendation, the task is to predict whether the reviewer recommended the game titles available in the test set on the basis of review text and other information.
In simpler terms, the task at hand is to identify whether a given user review is good or bad. You can download the dataset from here.
For classifying the Steam game reviews using PyCaret, I’ve discussed 2 different approaches in the article.
We will implement the BOW approach now.
Note: The tutorial is implemented on Google Colab. I would recommend running the code on the same.
You can install PyCaret just like any other python library.
As PyCaret doesn’t support count vectorizer, import the module CountVectorizer from sklearn.feature_extraction.
Then, I initialize a CountVectorizer object named ‘tf_vectorizer’.
What exactly does the fit_transform function do to your data?
Let’s convert the output of fit_transform to the data frame.
Now, concatenate the features and target along the column.
Next, we will split the dataset into train and test data.
Now that feature extraction is done. Let’s use these features to build different models. So, the next step is to set up the environment in PyCaret.
From the above output, we can observe that the metrics of the tuned model are better than the base model metrics.
Here, I’ve predicted the flag values for our processed dataset, ‘tuned_lightgbm’.
PyCaret, which trains machine learning models in a low-code environment, piqued my interest. From your preferred notebook environment, PyCaret helps you to go from preparing data to deploying models in seconds. Before using PyCaret, I tried other traditional methods to solve the JanataHack NLP hackathon problem, but the results weren’t very satisfactory!
PyCaret has proved to be exponentially fast and efficient in comparison to the other open-source machine learning libraries and also has the advantage of replacing several lines of code with just a few words.
Here, if you avoid the first part of my approach where I use the count vectorizer embedding techniques on my dataset and then moved on to setting up and creating models using PyCaret, then you can notice that all the transformations such as one-hot-encoding, imputing missing values, etc, will happen behind the scenes automatically, and then you get a data frame with predictions, just like what we got!
I hope I’ve made clear my overall approach for the hackathon.