Fighting Data Bias – Everyone’s Responsibility

A Last Updated : 30 Mar, 2021

6 min read

This article was published as a part of the Data Science Blogathon.

What is Bias?

Most of you might have watched or at least heard about the popular Netflix series ‘Queen’s Gambit’. This series had excellently captured the struggles of women in society and one of the best examples of gender bias.

_{PiCTURE CREDIT: PHIL BRAY/NETFLIX}

We all know that society is biased for ages. Bias based on gender, race, age, socioeconomic status, etc have consciously or unconsciously been part of human thoughts and actions. In this modern society with the help of raising awareness, most of us are coming forward to fight against discrimination and prejudice that is affecting human decision-making.

But what about the decision-making done by intelligent systems and applications that are increasingly becoming an inevitable part of our lives?

These intelligent applications are built on data supplied by humans. When the bias is present in human thoughts and actions, there is no surprise that the intelligent applications that we are developing are inheriting this bias from us.

What is Data Bias?

Consider an NLP application fills a sentence as ‘Father is to doctor as a mother is to nurse’.

The above NLP example is directly linked to the gender inequality present in society.

Consider more examples:

Why did one of the popular AI-based recruitment software biased against women applicants?

Why did Siri and Alexa show gender bias initially?

There are many reports that a lot of image processing applications fail to recognize women, especially dark-skinned women.

Why did the AI-based decision support application fail to identify criminals belonging to a particular race?

Why is this bias in the output given by these ML/AI applications?

Because the Machine Learning / AI applications that we design learn from the data that we feed to them. The data we feed contains the prejudices and inequalities that exist in the human world consciously or unconsciously.

We are in the race to build smart cities, smart buildings using the advancements in technology. What happens when an automatic door opener in your office fails to recognize a person based on his/her skin color?

How serious are the implications of neglecting bias in the data?

As Data Scientists/ Data Analysts/Machine Learning Engineers and AI practitioners, we know that if our data sample does not represent the whole population, then our results are not statistically significant. Which means that we do not get accurate results.

Machine Learning models built on such data would perform worse on underrepresented data.

_{Picture source: https://link.springer.com/article/10.1007/s13555-020-00372-0}

Consider an example from one of the critical domains, Healthcare, where data bias would result in devastating results.

The AI algorithms developed to detect skin cancer as perfectly as an experienced dermatologist failed to detect skin cancers in people with dark skin. Refer to the picture shown above.

Why did this happen?

Because the dataset was imbalanced. Majority of the images on which the algorithms were trained belong to light-skinned individuals. The data that was used to train these algorithms was taken from those states where the majority are white-skinned people. Hence, the algorithm fails to detect the disease in dark-skinned people when the images belonging to them were given to it.

Another AI application developed to identify the early stages of Alzheimer’s disease in people took auditory tests from people. It takes the way a person speaks as input and analyzes that data to identify the disease. But as the algorithm was trained on the data from Native English speakers, when a non-native English speaker takes the test, it wrongly identified the pauses and mispronunciations as indicators of the disease. (An example of false positive)

_{Picture source: https://www.onartificialintelligence.com/articles/18060/new-findings-on-human-speech-recognition}

What are the consequences of this wrong diagnosis in the above two examples?

Where in the development process have we gone wrong?

How can AI bias occur?

There are multiple factors behind these AI biases. There is no single root cause.

1. Missing diverse demographic categories.

_{Picture Source: https://www2.stat.duke.edu/courses}

Sampling errors are also majorly the result of improper data collection methods.

Datasets that do not include diverse demographic categories will be imbalanced/skewed and there are higher chances of overlooking these factors during the data cleaning phase.

2. Bias inherited from humans.

As discussed above, bias can be induced into data while labeling, most of the time unintentionally, by humans in supervised learning. This can be due to the fact that unconscious bias is present in humans. As this data teaches and trains the AI algorithm on how to analyze and give predictions, the output will have anomalies.

3. During the feature engineering phase

During the feature engineering phase, bias can occur.

For example, while developing an ML application for predicting loan approval, if features like race, gender are considered, these features would induce bias.

On the contrary, while developing an AI application for healthcare, if the same features like race, gender are removed from the dataset, this would result in the errors explained in the healthcare examples above.

Research on handling AI Bias

AI is being used widely in not only the popular domains but also in very sensitive domains like health care, criminal justice, etc. Hence, the debate on biased data and fairness in the output is always on in data and AI communities.

There is so much research and study going on to identify how bias is induced into the AI systems and how to handle it to reduce errors. Responsible AI and ethical AI are also been adopted widely to tackle the problem of bias too along with other AI challenges.

Are we not responsible to reduce this data bias?

One of the primary goals of using AI in decision support systems should be to make decisions less biased than humans.

Should we leave this biased-data problem to the researchers and carry on with our regular data cleaning tasks and trying to improve the accuracy of our algorithms as part of our development work?

As Artificial Intelligence is growing deeper and deeper into our lives, bias in data that is used for developing these applications can have serious implications not only on human life but also on the entire planet.

Hence, it is everyone’s responsibility to work towards identifying and handling bias at the early stages of development.

What is our part to reduce data bias?

Every data Machine Learning engineer/AI practitioner has to take the responsibility of identifying and removing bias while he works on developing artificially intelligent applications.

Here are some of the steps we can consider to take this forward.

We should not blindly build, develop applications with whatever data is available to us.

We need to work with researchers too and ensure that diverse data is available for our model development.

We have to be careful during the data collection phase to gain enough domain knowledge on the problem we are working on to be able to assess if the data collected includes diverse factors and has any chances of bias.

During the feature engineering phase, we should study the features in-depth combined with more research on the problem domain we are working in, to eliminate any features that may possibly induce bias.

Explainable AI and Interpretable AI also helps us to build trust in algorithms by ensuring fairness, inclusivity, transparency, and reliability.

Testing and evaluating the models carefully by measuring accuracy levels for different demographic categories and sensitive groups may also help in reducing algorithmic bias.

Finally, we must also ensure to include the topics related to AI-Bias and its implications in every data course. Because handling data bias and saving the world from adverse effects is the responsibility of every data enthusiast.

About Me:

Prasuna Atreyapurapu:

https://www.linkedin.com/in/prasuna-atreyapurapu-a954064/

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Beginner Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Type: HTTP

li_theme_set

ANONCHK

Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation

Expiry: 1 Day

Type: HTTP

We do not use cookies of this type.

Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.

Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.

Fighting Data Bias – Everyone’s Responsibility

What is Bias?

What is Data Bias?

How serious are the implications of neglecting bias in the data?

Why did this happen?

What are the consequences of this wrong diagnosis in the above two examples?

Where in the development process have we gone wrong?

How can AI bias occur?

1. Missing diverse demographic categories.

2. Bias inherited from humans.

3. During the feature engineering phase

Research on handling AI Bias

Are we not responsible to reduce this data bias?

What is our part to reduce data bias?

Here are some of the steps we can consider to take this forward.

About Me:

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)