What is Data Quality in Machine Learning?

Tarak Last Updated : 12 Oct, 2024

9 min read

Introduction

Machine learning has become an essential tool for organizations of all sizes to gain insights and make data-driven decisions. However, the success of ML projects is heavily dependent on the quality of data used to train models. Poor data quality can lead to inaccurate predictions and poor model performance. Understanding the importance of data quality in ML and the various techniques used to ensure high-quality data is crucial.

This article will cover the basics of ML and the importance of data quality in the success of ML models. It will also delve into the ETL pipeline and the techniques used for data cleaning, preprocessing, and feature engineering. By the end of this article, you will have a solid understanding of the importance of data quality in ML and the techniques used to ensure high-quality data. This will help to implement these techniques in real-world projects and improve the performance of their ML models.

Learning Objectives

Understanding the basics of machine learning and its various applications.
Recognizing the importance of data quality in the success of machine learning models.
Familiarizing with the ETL pipeline and its role in ensuring data quality.
Learning multiple techniques for data cleaning, including handling missing and duplicate data, outliers and noise, and categorical variables.
Understanding the importance of data pre-processing and feature engineering in improving the quality of data used in ML models.
Practical experience in implementing an entire ETL pipeline using code, including data extraction, transformation, and loading.
Familiarizing with data injection and how it can impact the performance of ML models.
Understanding the concept and importance of feature engineering in machine learning.

This article was published as a part of the Data Science Blogathon.

What is Machine Learning?
Why is data critical in Machine learning?
Collection of Data Through ETL Pipeline?
What is Data Injection?
The Importance of Data Cleaning
What is Data Pre-processing?
A Dive into Feature Engineering
Complete code for the ETL-Pipeline
Conclusion

What is Machine Learning?

Machine learning is a form of artificial intelligence that enables computers to learn and improve based on experience without explicit programming. It plays a crucial role in making predictions, identifying patterns in data, and making decisions without human intervention. This results in a more accurate and efficient system.

Machine learning is an essential part of our lives and is used in applications ranging from virtual assistants to self-driving cars, healthcare, finance, transportation, and e-commerce.

Data, especially machine learning, is one of the critical components of any model. It always depends on the quality of the data you feed your model. Let’s examine why data is so essential for machine learning.

Why is Data Critical in Machine Learning?

We are surrounded by a lot of information every day. Tech giants like Amazon, Facebook, and Google collect vast amounts of data daily. But why are they collecting data? You’re right if you’ve seen Amazon and Google endorse the products you’re looking for.

Finally, data from machine learning techniques play an essential role in implementing this model. In short, data is the fuel that drives machine learning, and the availability of high-quality data is critical to creating accurate and reliable models. Many data types are used in machine learning, including categorical, numerical, time series, and text data. Data is collected through an ETL pipeline. What is an ETL pipeline?

Collection of Data Through ETL Pipeline

Data preparation for machine learning is often referred to as an ETL pipeline for extraction, transformation, and loading.

Extraction: The first step in the ETL pipeline is extracting data from various sources. It can include extracting data from databases, APIs, or plain files like CSV or Excel. Data can be structured or unstructured.

Here is an example of how we extract data from a CSV file.

Python Code:

import pandas as pd
#read csv file
df = pd.read_csv("data.csv")
#extract specific data
name = df["name"]
age = df["age"]
address = df["address"]
#print extracted data
print("Name:", name)
print("Age:", age)
print("Address:", address)

Transformation: It is the process of transforming the data to make it suitable for use in machine learning models. This may include cleaning the data to remove errors or inconsistencies, standardizing the data, and converting the data into a format that the model can use. This step also includes feature engineering, where the raw data is transformed into a set of features to be used as input for the model.
This is a simple code for converting data from json to DataFrame.

import json
import pandas as pd
#load json file
with open("data.json", "r") as json_file:
data = json.load(json_file)
#convert json data to a DataFrame
df = pd.DataFrame(data)
#write to csv
df.to_csv("data.csv", index=False)

Load: The final step is to upload or load the converted data to the destination. It can be a database, a data store, or a file system. The resulting data is ready for further use, such as training or testing machine learning models.

Here’s a simple code that shows how we load data using the pandas:

import pandas as pd
df = pd.read_csv('data.csv')

After collecting the data, we generally use the data injection if we find any missing values.

What is Data Injection?

Adding new data to an existing data server can be done for various reasons to update the database with new data and to add more diverse data to improve the performance of machine learning models. Or error correction of the original dataset is usually done by automation with some handy tools.

There are three types.

Batch Insertion: Data is inserted in bulk, it is usually at a fixed time
Real-time injection: Data is injected immediately when it is generated.
Stream Injection: Data is injected in a continuous stream. It is often used in real-time.

Here is a code example of how we inject data using the append function using the pandas library.

The next stage of the data pipeline is data cleaning.

import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()

# Add some data to the DataFrame
df = df.append({'Name': 'John', 'Age': 30, 'Country': 'US'}, ignore_index=True)
df = df.append({'Name': 'Jane', 'Age': 25, 'Country': 'UK'}, ignore_index=True)

# Print the DataFrame
print(df)

The Importance of Data Cleaning

Data cleaning is the removal or correction of errors in data. This may include removing missing values and duplicates and managing outliers. Cleaning data is an iterative process, and new insights may require you to go back and make changes. In Python, the pandas library is often used to clean data.

There are important reasons for cleaning data.

Data quality: Data quality is crucial for accurate and reliable analysis. More precise and consistent information can lead to actual results and better decision-making.
Performance of machine learning: Dirty data can negatively affect the performance of machine learning models. Cleaning your data improves the accuracy and reliability of your model.
Data storage and retrieval: Clean data is easier to store and retrieve and reduces the risk of errors and inconsistencies in data storage and retrieval.
Data Governance: Data cleansing is crucial to ensure data integrity and compliance with data regulatory policies and regulations.
Data storage: Wiping data helps save data for long-term use and analysis.

Here’s code that shows how to drop missing values and remove duplicates using the pandas library:

df = df.dropna()
df = df.drop_duplicates()

# Fill missing values
df = df.fillna(value=-1)

Here is another example of how we clean the data by using various techniques

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Sarah', 'NaN'],
        'Age': [30, 25, 35, 32, None],
        'Country': ['US', 'UK', 'Canada', 'Australia', 'NaN']}
df = pd.DataFrame(data)

# Drop missing values
df = df.dropna()

# Remove duplicates
df = df.drop_duplicates()

# Handle outliers
df = df[df['Age'] < 40]

# Print the cleaned DataFrame
print(df)

The third stage of the data pipeline is data pre-processing,

It’s also good to clearly understand the data and the features before applying any cleaning methods and to test the model’s performance after cleaning the data.

What is Data Pre-processing?

Data processing is preparing data for use in machine learning models. This is an essential step in machine learning because it ensures that the data is in a format that the model can use and that any errors or inconsistencies are resolved.

Data processing usually involves a combination of data cleaning, data transformation, and data standardization. The specific steps in data processing depend on the type of data and the machine learning model you are using. However, here are some general steps:

Data cleanup: Remove errors, inconsistencies, and outliers from the database.
Data Transformation: Data transformation into a form that can be used by machine learning models, such as changing categorical variables to numerical variables.
Data Normalization: Scale data in a specific range between 0 and 1, which helps improve the performance of some machine learning models.
Add Data: Add changes or manipulations to existing data points to create new ones.
Feature Selection or Extraction: Identify and select the essential features from your data to use as input to your machine learning model.
Outlier detection: Identify and remove data points that deviate significantly from large amounts of data. Outliers can alter analytical results and adversely affect the performance of machine learning models.
Detect Duplicates: Identify and remove duplicate data points. Duplicate data can lead to inaccurate or unreliable results and increase the size of your data set, making it difficult to process and analyze.
Identify Trends: Find patterns and trends in your data that you can use to inform future predictions or better understand the nature of your data.

Data processing is essential in machine learning because it ensures that the data is in a form the model can use and that any errors or inconsistencies are removed. This improves the model’s performance and accuracy of the prediction.

Here is some simple code that shows how to use the LabelEncoder class to scale categorical variables to numeric values and the MinMaxScaler class to scale numeric variables.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, LabelEncoder

# Create a sample DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Sarah'],
        'Age': [30, 25, 35, 32],
        'Country': ['US', 'UK', 'Canada', 'Australia'],
        'Gender':['M','F','M','F']}
df = pd.DataFrame(data)

# Convert categorical variables to numerical
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])

# One hot encoding
onehot_encoder = OneHotEncoder()
country_encoded = onehot_encoder.fit_transform(df[['Country']])
df = pd.concat([df, pd.DataFrame(country_encoded.toarray())], axis=1)
df = df.drop(['Country'], axis=1)

# Scale numerical variables
scaler = MinMaxScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])

# Print the preprocessed DataFrame
print(df)

The final stage of the data pipeline is feature engineering,

A Dive into Feature Engineering

Feature engineering transforms raw data into features that can be used as input for machine learning models. This involves identifying and extracting the most critical data from the raw material and converting it into a format the model can use. Feature engineering is essential in machine learning because it can significantly impact model performance.

Different techniques that can be used for feature engineering are:

Feature Extraction: Extract relevant information from raw data. For example, identify the most important features or combine existing features to create new features.
Attribute Modification: Change the attribute type, such as changing a categorical variable to a numeric variable or zooming the data to fit within a specific range.
Feature Selection: Determine the essential features of your data to use as input to your machine learning model.
Dimension Reduction: Reduce the number of features in the database by removing redundant or irrelevant features.
Add Data: Add changes or manipulations to existing data points to create new ones.

Feature engineering requires a good understanding of your data, the problem to be solved, and the machine learning algorithms to use. This process is iterative and experimental and may require several iterations to find the optimal feature set that improves the performance of our model.

Complete Code for the Entire ETL Pipeline

Here is an example of a complete ETL pipeline using the pandas and scikit-learn libraries:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, LabelEncoder

# Extract data from CSV file
df = pd.read_csv('data.csv')

# Data cleaning
df = df.dropna()
df = df.drop_duplicates()

# Data transformation
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])

onehot_encoder = OneHotEncoder()
country_encoded = onehot_encoder.fit_transform(df[['Country']])
df = pd.concat([df, pd.DataFrame(country_encoded.toarray())], axis=1)
df = df.drop(['Country'], axis=1)

scaler = MinMaxScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])

# Load data into a new CSV file
df.to_csv('cleaned_data.csv', index=False)

The data is first retrieved from a CSV file using this example’s pandas read_csv() function. Data cleaning is then done by removing missing values and duplicates. This is done using LabelEncoder to change categorical variables to numeric, OneHotEncoder to scale categorical variables to numbers, and MinMaxScaler to scale numerical variables. Finally, the deleted data is read into a new CSV file using the pandas to_csv() function.

Note that this example is a very simplified version of the ETL pipeline. In a real scenario, the pipeline may be more complex and involve more processing and outsourcing, costing, etc. can include methods such as. In addition, data traceability is also essential. That is, it tracks the origin of the data, its changes, and where it is stored. This not only helps you understand the quality of your data but also helps you debug and review your pipeline. Also, it is essential to clearly understand the data and features before applying post-processing methods and checking the model’s performance after pre-processing. Information.

Conclusion

The Data quality is critical to the success of machine learning models. By taking care of every step of the process, from data collection to cleaning, processing, and validation, you can ensure that your data is of the highest quality. This will allow your model to make more accurate predictions, leading to better results and successful machine-learning projects.

Now you will know the importance of data quality in Machine learning. Here are some of the key takeaways from my article:

Key Takeaways

Understanding the impact of poor data quality on machine learning models and the resulting outcomes.
Recognizing the importance of data quality in the success of machine learning models.
Familiarizing myself with the ETL pipeline and its role in ensuring data quality.
Acquiring skills for data cleaning, pre-processing, and feature engineering techniques to improve the quality of data used in ML models.
Understanding the concept and importance of feature engineering in machine learning.
Learning techniques for selecting, creating, and transforming features to improve the performance of ML models.

Thanks for reading! Want to share something not mentioned above? Thoughts? Feel free to comment below.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Tarak

I am Tarak Ram, working as Machine Learning Intern at Antern. I am always curious to learn new things and interested in these emerging technologies, which brings me from an arts background to this advanced AI field.
I also teach Machine learning on my youtube channel and always look forward to learning something new.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What is Data Quality in Machine Learning?

Introduction

Table of Contents

What is Machine Learning?

Why is Data Critical in Machine Learning?

Collection of Data Through ETL Pipeline

What is Data Injection?

The Importance of Data Cleaning

What is Data Pre-processing?

A Dive into Feature Engineering

Complete Code for the Entire ETL Pipeline

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid