Python Treatment for Outliers in Data Science

Shanthababu Pandian Last Updated : 24 Oct, 2024

6 min read

This article was published as a part of the Data Science Blogathon.

Before we get started the discussion on Outliers, we should understand exactly what Feature Improvements mean under Feature Engineering.

What is Feature Engineering?

When we have a LOT OF FEATURES in the given dataset, feature engineering can become quite a challenging and interesting module.
The number of features could significantly impact the model considerably, So that feature engineering is an important task in the Data Science life cycle.

Feature Improvements

In the Feature Engineering family, we are having many key factors are there, let’s discuss Outlier here. This is one of the interesting topics and easy to understand in Layman’s terms.

Outlier

An outlier is an observation of a data point that lies an abnormal distance from other values in a given population. (odd man out)
- Like in the following data point (Age)
  - 18,22,45,67,89,125,30
An outlier is an object(s) that deviates significantly from the rest of the object collection.
- List of Cities
  - New York, Las Angles, London, France, Delhi, Chennai
It is an abnormal observation during the Data Analysis stage, that data point lies far away from other values.
- List of Animals
  - cat, fox, rabbit, fish
An outlier is an observation that diverges from well-structured data.
The root cause for the Outlier can be an error in measurement or data collection error.
Quick ways to handling Outliers.
- Outliers can either be a mistake or just variance. (As mentioned, examples)
- If we found this is due to a mistake, then we can ignore them.
- If we found this is due to variance, in the data, we can work on this.

In the picture of the Apples, we can find the out man out?? Is it? Hope can Yes!

But the huge list of a given feature/column from the .csv file could be a really challenging one for naked eyes.

First and foremost, the best way to find the Outliers are in the feature is the visualization method.

What are the Possibilities for an Outlier?

Of course! It would be below quick reasons.

Incorrect data entry or error during data processing
Missing values in a dataset.
Data did not come from the intended sample.
Errors occur during experiments.
Not an errored, it would be unusual from the original.
Extreme distribution than normal.

That’s fine, but you might have questions about Outlier if you’re a real lover of Data Analytics, Data mining, and Data Science point of view.

Let’s have a quick discussion on those.

Understand more about Outlier

Outliers tell us that the observations of the given data set, how the data point(s) differ significantly from the overall perspective. Simply saying odd one/many. this would be an error during data collection.
Generally, Outliers affect statistical results while doing the EDA process, we could say a quick example is the MEAN and MODE of a given set of data set, which will be misleading that the data values would be higher than they really are.
the CORRELATION COEFFICIENT is highly sensitive to outliers. Since it measures the strength of a linear relationship between
two variables. the relationship dependent of the data. correlation is a non-resistant measure and r (correlation coefficient) is strongly affected by
outliers.
- Positive Relationship
  - When the correlation coefficient is closer to value 1
- Negative Relationship
  - When the correlation coefficient is closer to value -1
- Independent
  - When X and Y are independent, then the correlation coefficient is close to zero (0)
We could understand the data collection process from the Outliers and its observations. An analysis of how it occurs and how to minimize and set the process in future data collection guidelines.
Even though the Outliers increase the inconsistent results in your dataset during analysis and the power of statistical impacts significant, there would challenge and roadblocks to remove them in few situations.
DO or DO NOT (Drop Outlier)
- Before dropping the Outliers, we must analyze the dataset with and without outliers and understand better the impact of the results.
- If you observed that it is obvious due to incorrectly entered or measured, certainly you can drop the outlier. No issues on that case.
- If you find that your assumptions are getting affected, you may drop the outlier straight away, provided that no changes in the results.
- If the outlier affects your assumptions and results. No questions simply drop the outlier and proceed with your further steps.

Finding Outliers

So far we have discussed what is Outliers, how it affects the given dataset, and Either can we drop them or NOT. Let see now how to find from the given dataset. Are you ready!

We will look at simple methods first, Univariate and Multivariate analysis.

Univariate method: I believe you’re familiar with Univariate analysis, playing around one variable/feature from the given data set. Here to look at the Outlier we’re going to apply the BOX plot to understand the nature of the Outlier and where it is exactly.

Let see some sample code. Just I am taking titanic.csv as a sample for my analysis, here I am considering age for my analysis.

plt.figure(figsize=(5,5))
sns.boxplot(y='age',data=df_titanic)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_titanic = pd.read_csv('titanic.csv')
plt.figure(figsize=(5,5))
sns.boxplot(y='Age',data=df_titanic)
plt.show()

You can see the outliers on the top portion of the box plot visually in the form of dots.

Multivariate method: Just I am taking titanic.csv as a sample for my analysis, here I am considering age and passenger class for my analysis.

plt.figure(figsize=(8,5))
sns.boxplot(x='pclass',y='age',data=df_titanic)

We can very well use Histogram and Scatter Plot visualization technique to identify the outliers.

On top of this, we have with
mathematically to find the Outliers as follows Z-Score and Inter Quartile Range (IQR) Score methods

Z-Score method: In which the distribution of data in the form mean is 0 and the standard deviation (SD) is 1 as Normal Distribution format.

Let’s consider below the age group of kids, which was collected during data science life cycle stage one, and proceed for analysis, before going into further analysis, Data scientist wants to remove outliers. Look at code and output, we could understand the essence of finding outliers using the Z-score method.

import numpy as np  
kids_age = [1, 2, 4, 8, 3, 8, 11, 15, 12, 6, 6, 3, 6, 7, 12,9,5,5,7,10,10,11,13,14,14] 
mean = np.mean(voting_age) 
std = np.std(voting_age) 
print('Mean of the kid''s age in the given series :', mean) 
print('STD Deviation of kid''s age in the given series :', std)
threshold = 3
outlier = [] 
for i in voting_age: 
    z = (i-mean)/std 
    if z > threshold: 
        outlier.append(i) 
print('Outlier in the dataset is (Teen agers):', outlier)

Output

Mean of the kid’s age in the given series: 2.6666666666666665
STD Deviation of kids age in the given series: 3.3598941782277745
The outlier in the dataset is (Teenagers): [15]

(IQR) Score method: In which data has been divided into quartiles (Q1, Q2, and Q3). Please refer to the picture Outliers Scaling above. Ranges as below.

25th percentile of the data – Q1
50th percentile of the data – Q2
75th percentile of the data – Q3

Let’s have the junior boxing weight category series from the given data set and will figure out the outliers.

import numpy as np  
import seaborn as sns
# jr_boxing_weight_categories
jr_boxing_weight_categories = [25,30,35,40,45,50,45,35,50,60,120,150] 
Q1 = np.percentile(jr_boxing_weight_categories, 25, interpolation = 'midpoint')
Q2 = np.percentile(jr_boxing_weight_categories, 50, interpolation = 'midpoint')  
Q3 = np.percentile(jr_boxing_weight_categories, 75, interpolation = 'midpoint')

IQR = Q3 - Q1
print('Interquartile range is', IQR)
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 * IQR
print('low_limit is', low_lim)
print('up_limit is', up_lim)
outlier =[]
for x in jr_boxing_weight_categories:
    if ((x> up_lim) or (x<low_lim)):
         outlier.append(x)
print(' outlier in the dataset is', outlier)

Output

Interquartile range is 20.0
low_limit is 5.0
up_limit is 85.0
the outlier in the dataset is [120, 150]

sns.boxplot(jr_boxing_weight_categories)

Loot at the boxplot we could understand where the outliers are sitting in the plot.

So far, we have discussed what is Outliers, how it looks like, Outliers are good or bad for data set, how to visualize using matplotlib /seaborn and stats methods.

Now, will conclude correcting or removing the outliers and taking appropriate decision. we can use the same Z- score and (IQR) Score with the condition we can correct or remove the outliers on-demand basis. because as mentioned earlier Outliers are not errors, it would be unusual from the original.

Hope this article helps you to understand the Outliers in the zoomed view in all aspects. let’s come up with another topic shortly. until then bye for now! Thanks for reading! Cheers!!

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Shanthababu Pandian

Shanthababu Pandian has over 23 years of IT experience, specializing in data architecting, engineering, analytics, DQ&G, data science, ML, and Gen AI. He holds a BE in electronics and communication engineering and three Master’s degrees (M.Tech, MBA, M.S.) from a prestigious Indian university. He has completed postgraduate programs in AIML from the University of Texas and data science from IIT Guwahati. He is a director of data and AI in London, UK, leading data-driven transformation programs focusing on team building and nurturing AIML and Gen AI. He helps global clients achieve business value through scalable data engineering and AI technologies. He is also a national and international speaker, author, technical reviewer, and blogger.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Python Treatment for Outliers in Data Science

What is Feature Engineering?

Feature Improvements

Outlier

What are the Possibilities for an Outlier?

Understand more about Outlier

Finding Outliers

Output

Output

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect