Welcome to Pywedge – A Fast Guide to Preprocess and Build Baseline Models

Venkatesh Last Updated : 09 Oct, 2020

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

The machine learning process involves various stages such as,

Data Preparation Stage
1. Understanding the data
2. Handling the missing values
3. Handling categorical variables
4. Standardizing the data
Handling class imbalance
1. Modeling Stage
2. Splitting or cross-validating the dataset
3. Choosing the appropriate machine learning algorithm
4. Examine the base-line performance of the chosen algorithm
5. Fine-tune the model
Prediction Stage
1. Predict on the standout test data
2. check for under/overfitting

The above list is obviously not exhaustive & it might be just overwhelming to carry out all the above steps for each & every data set (barring the AutoML!). It is said that 80% of the time in machine learning is spent on data collection, cleaning & pre-processing & 20% of the time is spent on running the model.

The Preprocessing chunk!

When I started learning Python & data science, like every data science aspirant, I want to try out the machine learning models. Quickly I started working on the various datasets. It was very nice participating in hackathons & learned the practical application of machine learning.

But when I wanted to explore more on the modeling part on various types of datasets & on various models, I felt carrying out the pre-processing steps on various datasets and running various baseline model is time-consuming & less time spent on exploring modeling techniques, whereas I wanted to spend quality time on modeling tasks.

Of course in this era of exploding automl, the complete machine learning task can be done by automl. But it gives a special kind of feel building the complete machine learning model!

Also when running any of the automl, it returns various models & top-performing model’s predictions, but I wasn’t able to extract the cleaned datasets from such automl to further run some deep learning models. I had to pre-process the data manually & run the deep learning models & such deep learning model outputs may not be comparable to automl as the data has been pre-processed separately (unless the exact automl pre-processing steps have been replicated)

Here comes the idea of a package/library…

Pywedge

Pywedge is a pip installable python package that intends to,

Quickly preprocess the data by taking the user’s preferred choice of pre-processing techniques & it returns the cleaned datasets to the user in the first step.
In the second step, Pywedge offers a baseline class that has a classification summary method & regression summary method, which can return ten various baseline models, which can point the user to explore the best performing baseline model.

The intention of pywedge is to help the user by quickly preprocessing the data and to rightly point out the best performing baseline model for the given dataset so that the user can spend quality time tuning such a model algorithm.

Without wasting much more of your valuable time, let me dive into pywedge experiments.

Classification using pywedge

Let’s take cross-sell classification dataset from the Analytics Vidya hackathon for the below example-

!pip install pywedge
import pywedge as pw
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/taknev83/datasets/master/train_crosssell_classification.csv')
test = pd.read_csv('https://raw.githubusercontent.com/taknev83/datasets/master/test_crosssell_classification.csv')
sample_submission = pd.read_csv('https://raw.githubusercontent.com/taknev83/datasets/master/sample_submission_crosssell_classification.csv')
train.info()

pywedge

This dataset contains approx 380k line items, with a mix of numerical & categorical columns.

Instantiate the Pre_process_data class as below,

ppd = pw.Pre_process_data(train, test, c='id', y='Response')

Pre_process_data class takes the following arguments,

train = train dataframe
test = stand out test dataframe (without target column)
c = any redundant column to be removed (like ID column etc., at present supports a single column removal, the subsequent version will provision multiple column removal requirements)
y = target column name as a string
type = Classification / Regression

Run the dataframe_clean method under Pre_process_data class as below,

new_X, new_y, new_test = ppd.dataframe_clean()

The dataframe_clean method interactively asks the user to select the preprocessing choice as below,

pywedge

The existing class balance summary table is provided for user info, here we can see the class is imbalanced, we will select oversample in the next few steps,

The user is asked to select cathodes or getdummies to convert categorical variables, let’s select getdummies

In the next step, it asks for which standardization method to be used, let’s select Standardscalar

In the next step, it asks if we want to apply SMOTE to oversample, let’s select yes.

Once smote is completed, dataframe_clean method returns new_X, new_y & new_test.

Assign the new_X, new_y & new_test to new variables X, y & so_test for future use.

X = new_X
y = new_y
so_test = new_test

Instantiate the baseline class as below,

blm = pw.baseline_model(X,y)

Call the classification_summary method from baseline_model class as below,

blm.classification_summary()

The classification summary provides Top 10 feature importance (calculated using Adaboost feature importance).

The classification summary asks for the test size from the user, let’s take 20% as test size,

Next comes the cool part, the summary of baseline models,

For this baseline model summary, it’s observed that the catboost classifier performs well & the user can explore tuning the hyperparameters of catboost classifier to achieve further refined results. This hyperparameter tuning user can do separately with the cleaned dataset received from pywedge.

One quick interesting point here is, if we run the same classification_summary method without oversampling, take a look at the below baseline model results

The accuracy seems to be above 80%, but observe the roc_score, all the scores are around 50%, which shows the quick importance of oversampling in class-imbalanced datasets.

Whoa! you all the way read through this, many thanks.

In the same way, regression analysis can be done using a few lines of code, let me not clutter this blog with more examples.

The code examples are available in my GitHub repo.

Pywedge is in BETA version & the following additions are planned,

To handle NLP column
To handle time series dataset
To handle stock prices specific analysis
A separate method to produce good charts

Please feel free to pip install pywedge & use & share your valuable feedback, it will motivate me to fine-tune the pywedge. Thanks for reading 🙂

Venkatesh

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Jeff

Hello, Appreciate your efforts in trying to make life easier with Pywedge. I took it for a spin and ran into the following error..... import pywedge as pw ppd = pw.Pre_process_data(train, test, c='url', y='status', type="Classification") --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () 1 import pywedge as pw ----> 2 ppd = pw.Pre_process_data(train, test, c='url', y='status', type="Classification") AttributeError: module 'pywedge' has no attribute 'Pre_process_data'

Reading list

Welcome to Pywedge – A Fast Guide to Preprocess and Build Baseline Models

Introduction

The Preprocessing chunk!

Pywedge

Classification using pywedge

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Welcome to Pywedge – A Fast Guide to Preprocess and Build Baseline Models

Introduction

The Preprocessing chunk!

Pywedge

Classification using pywedge

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques