Breast Cancer Anomaly Detection for Improved Screening

Ananya Manjunath Last Updated : 13 Jun, 2023

12 min read

Introduction

Breast cancer is a serious medical condition that affects millions and millions of women worldwide. Even though there is an improvement in the medical field, recognizing and treating breast cancer is possible but spotting it and treating it at an early stage is still not possible. By using Anomaly detection we can identify tiny yet vital patterns in breast cancer that might not be visible to the naked eye. By increasing the accuracy of screening methods, many lives can be saved and we can help them to beat breast cancer. In this generation of computer-controlled health care, anomaly detection is a powerful tool that can change how we deal with breast cancer screening and treatment.

Learning Objectives

In this article, we will do the following:

We will explore the data and identify any potential anomalies.
We will create visualizations to understand the data and its abnormalities in a better way.
We will train and build a model to detect any abnormal data points.
We will analyze and interpret our results to draw meaningful conclusions about Breast Cancer.

This article was published as a part of the Data Science Blogathon.

Introduction
What is Breast Cancer?
Why is Early Detection of Breast Cancer Crucial?
What are the Types of Breast Cancer?
Symptoms of Breast Cancer
Diagnosis for Breast Cancer
Best Methods of Detecting Breast Cancer
How can we Detect Breast Cancer Using Machine Learning?
Understanding the Data and Problem Statement
Problem Statement – Breast Cancer Anomaly Detection
The Pipeline of the Project
Conclusion
Frequently Asked Questions

What is Breast Cancer?

Breast cancer occurs when breast cells grow uncontrollably and can be found in various parts of the breast. It can metastasize by spreading through blood vessels and lymph vessels to other areas of the body.

Why is Early Detection of Breast Cancer Crucial?

When we ignore or don’t care about the cancer symptoms or delay the treatment there will be a low chance of survival. There will be more problems associated with this and at the later or last stages the treatment might not work and there will be more costs for healthcare. Early treatment might help in overcoming the cancer and therefore it is important to treat it in the earliest possible stage.

What are the Types of Breast Cancer?

There are several types of breast cancer, and some of them are:

IDC (Invasive Ductal Carcinoma)
ILC (Invasive Lobular Cancer)
IBC (Inflammatory Breast Cancer)
TNBC (Triple Negative Breast Cancer)
MBC (Metastatic Breast Cancer)
DCIS (Ductal Carcinoma In Situ)
LCIS (Lobular Carcinoma In Situ)

Symptoms of Breast Cancer

Formation of new lumps in the underarms or in the breast.
There will be swelling of the breast or some part of it.
Irritation near the breast area.
The skin might get dry near the nipple or the breast.
There might be pain in the breast area.

Diagnosis for Breast Cancer

For the diagnosis of breast cancer, the following is done:

Examination of the Breast: In this, the doctor will check for lumps or any other abnormalities in both breasts.
X-ray of the Breast: The X-ray of the breast is called Mammogram. These are generally used for the screening of breast cancer. If there are any abnormalities found in the X-ray the doctor suggests the required treatment for further procedure.
Ultrasound of Breast: A breast ultrasound is done to check whether the lump formed is a solid mass or a fluid-filled cyst.
Sample Collection: This process is called Biopsy. In this process, the sample of the lump is taken by using a specialized needle device, and the core of the lump is extracted from the affected area.

Best Methods of Detecting Breast Cancer

Biopsy i.e., Mammography is one of the best ways to identify breast cancer. Another best way is said to be MRI (Magnetic resonance imaging) through which we can identify the high risk of breast cancer

How can we Detect Breast Cancer Using Machine Learning?

We can use many Machine Learning algorithms to detect breast cancer disease such algorithms include SVM, Decision Trees, and Neural Networks.

Using these algorithms we can predict cancer at an early stage and it will help the spreading of the disease to slow down and increases the probability of saving the life of the patient.

Understanding the Data and Problem Statement

The data set used for this project is sourced from the UCI Machine Learning Repository, containing 569 instances of breast cancer and 30 attributes. Interested readers may download the data set by clicking on the following link: here. Alternatively, the data set is available in the scikit-learn library, a popular machine-learning library for Python. By working through this blog, readers will gain a better understanding of the complexities involved in detecting anomalies in breast cancer data and how to effectively use the data set for machine learning purposes.

Problem Statement – Breast Cancer Anomaly Detection

The goal of the project or the aim is to understand the data and find out the occurrence of breast cancer that are irregular. In this, we will use the Isolation Forest library in Python to build and train the model to find the uneven data points in the dataset.

Ultimately, we will study and illuminate our results to conclude meaningful conclusions from the data.

The Pipeline of the Project

The project pipeline includes various steps, they are:

Importing the Libraries
Loading the dataset
Probing Data Analysis
Preprocessing of the data
Visualizing the data
Splitting of data into training and testing data set
Predicting anomalies using IsolationForest
Predicting anomalies using LocalOutlierFactor

Step-1: Importing the Libraries

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns12345python

Step-2: Loading and Reading the Dataset

df = pd.read_csv('data.csv')
df.head(5)

Output:

Step-3: Probing Data Analysis

3.1: Fetching the top 5 records in the data

df.head(5)

Output:

3.2:Finding out the number of columns in the dataset

df.columns

Output:

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
dtype='object')
1234567891011python

3.3: Finding the length of data

print('length of data is', len(df))

Output:

length of data is 569

3.4: Getting the shape of the data

df.shape

Output:

(569, 33)

3.5: Information on the data

df.info()

Output:

3.6: Datatypes of the columns

df.dtypes

Output:

3.7: Finding whether the dataset has null values

np.sum(df.isnull().any(axis=1))

Output:

3.8: Number of rows and columns in the dataset

print('Count of columns in the data is: ', len(df.columns))
print('Count of rows in the data is: ', len(df))

Output:

Count of columns in the data is: 31

Count of rows in the data is: 569

3.9: Checking for unique values of diagnosis

df['diagnosis'].unique()

Output:

array([1, 0])

3.10: Number of Diagnosis value

df['diagnosis'].nunique()

Output:

Step-4: Preprocessing of the Data

4.1: Handling Missing values:

In the preprocessing process handling the missing values is one of the most important steps if the dataset contains missing values. The presence of missing values can cause many problems such as it might cause errors in the program or simply that data is not available in the first place. There are many techniques to deal with type of error depending on the nature of the data.

Basically, there are techniques that are always suitable to handle the missing values. In some cases, we drop the row or column if the missing value is very less or very more or irrelevant to the given data or might not be useful in building a model. We will use is.null() function to find the missing values.

def null_values(data): 
  null_values = data.isnull().sum() 
  null_values = null_values[null_values > 0] 
  null_values.sort_values(inplace=True) 
  print(null_values) 
null_values(datas)

Output:

Series([ ], dtype: int64)

All values in the data are present.

4.2:Encoding the data:

In the data pre-processing phase, the next step involves encoding the data into a suitable form for model building. This step involves converting categorical variables into numerical form i.e., changing the data type of the variable from object to int64, scaling down the data into a standard range, or applying any other transformations to create a clean dataset. In this project-based blog, we will use the LabelEncoder method from sklearn. preprocessing library to convert categorical variables into numerical ones so that we can use the variable in training the model.

To further elaborate on the data pre-processing step, it is very important to encode data even to visualize it. Many plots won’t use the categorical variable to interpret the results cause they are based on numerical calculations. Although we are using the LabelEncoder method in this project-based blog we can also use methods like one-hot encoding, binary encoding, etc. depending on the needs of the model.

Scaling the data to a standard range is very necessary to ensure the variables are weighted equally and that our model is not biased towards one particular feature. This can be achieved using methods such as standardization or normalization.

In the below code, we are first importing LabelEncoder from sklearn. preprocessing and then creating an object of that method. Then finally we will use the object to call the fit_transform function to transform the specified variable into a numerical datatype.

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['diagnosis']=le.fit_transform(data['diagnosis'])
df.head()

Output:

Step-5: Visualizing the data

To understand the data and its anomalies in a better way, we will try different types of visualizations. In these visualizations, we can perform scatter plots, histograms, box plots, and many more. By this, we can identify the outliers and patterns of the data which are not likely related to the raw data. These will majorly help us to construct an effective anomaly detection model.

In addition to this we can use other techniques such as clustering or regression analysis for the further analysis of the data and to understand the model in its various properties. In general, our main objective is to build a unique and reliable model that can detect and guide us through any unusual or unexpected patterns accurately in the data, which helps us to find the issues that may occur before they can cause any major harm or which disrupt our operations.

#Number of Malignant(M) and Benign(B) cells

plt.figure(figsize=(8, 6))

sns.countplot(x='diagnosis', data=df, palette= ['#FFC0CB', '#ADD8E6'],  
            edgecolor='black', linewidth=1.5)

plt.title('Diagnosis Count', fontsize=20, fontweight='bold')
plt.xlabel('Diagnosis', fontsize=14)
plt.ylabel('Count', fontsize=14)

ax = plt.gca()

for patch in ax.patches:
    plt.text(x=patch.get_x()+0.4, y=patch.get_height()+2, 
    s=str(int(patch.get_height())), fontsize=12)

Output:

plt.figure(figsize=(25,15))
sns.heatmap(df.corr(),annot=True, cmap='coolwarm')

Output:

heat map | breast cancer anomaly detection

Kernel Density Estimation Plot showing the distribution of ‘radius_mean’ among benign and malignant tumors in a breast cancer dataset

def plot_distribution(df, var, target, **kwargs):
    row = kwargs.get('row', None)
    col = kwargs.get('col', None)
    facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col)
    facet.map(sns.kdeplot, var, shade=True)
    facet.set(xlim=(0, df[var].max()))
    facet.add_legend()
    plt.show()
plot_distribution(df, var='radius_mean', target='diagnosis')

Output:

Scatter Plot showcasing the relationship between ‘radius_mean’ and ‘texture_mean’ in benign and malignant tumors of a breast cancer dataset.

def plot_scatter(df, var1, var2, target, **kwargs):
    row = kwargs.get('row', None)
    col = kwargs.get('col', None)
    facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col)
    facet.map(plt.scatter, var1, var2, alpha=0.5)
    facet.add_legend()
    plt.show()
plot_scatter(df, var1='radius_mean', var2='texture_mean', target='diagnosis')

Output:

import plotly.express as px
fig = px.parallel_coordinates(df, dimensions=['radius_mean', 'texture_mean', 'perimeter_mean', 
          'area_mean', 'smoothness_mean', 'compactness_mean', 
          'concavity_mean', 'concave points_mean', 'symmetry_mean', 
          'fractal_dimension_mean'],
      color='diagnosis', color_continuous_scale=px.colors.sequential.Plasma, 
    labels={'radius_mean': 'Radius Mean', 'texture_mean': 'Texture Mean', 
  perimeter_mean': 'Perimeter Mean', 'area_mean': 'Area Mean', 
  'smoothness_mean': 'Smoothness Mean', 'compactness_mean': 'Compactness Mean', 
   'concavity_mean': 'Concavity Mean', 'concave points_mean': 'Concave Points Mean', 
   symmetry_mean': 'Symmetry Mean', 'fractal_dimension_mean': 'Fractal Dimension Mean'},
   title='Breast Cancer Diagnosis by Mean Characteristics')

fig.show()

Output:

data visualization | breast cancer anomaly detection

Step-6: Model Development

The model development process utilized Python’s scikit-learn library to train and develop the isolation model, which identifies hidden data points. An unsupervised learning algorithm called Isolation Forest was used, known for its effectiveness in anomaly detection. It involves creating a random forest of isolation trees, training each with a randomly selected subset of the data. Outliers are detected based on the average path lengths of the data points.

By using this technique, we can identify the hidden outliers and patterns in the data. In total, we can say that the Isolation Forest algorithm is a vital tool for anomaly detection in Breast cancer data and also it has the ability to revolutionize the way by which we can approach a better way of screening and treating methods of this disease.

6.1: Splitting the data into features and target

from sklearn.feature_selection import SelectKBest, f_classif
# Split the data into features and target
X = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']

6.2: Printing X and Y values:

x.head()

Output:

y.head()

Output:

output

6.3: Performing feature selection using SelectKBest and f_classif

# Performing feature selection using SelectKBest and f_classif
selector = SelectKBest(score_func=f_classif, k=5)
selector.fit(X, y)

Output:

SelectKBest

SelectKBest(k=5)

6.4: Get the indices of the selected features

# Getting the indices of the selected features
selected_indices = selector.get_support(indices=True)

6.5: Get the names of the selected features and print it


# Getting the names of the selected features
selected_features = X.columns[selected_indices].tolist()
# Printing the selected features
print(selected_features)

Output:

[‘perimeter_mean’, ‘concave points_mean’, ‘radius_worst’, ‘perimeter_worst’, ‘concave points_worst’]

Step-7: Splitting of data into training and testing data set

x = df[selected_features]
y = df['diagnosis']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

Step-8: Predicting anomalies using IsolationForest

8.1: Fit an Isolation Forest model on the training data

from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
# Fit an Isolation Forest model on the training data
clf = IsolationForest(n_estimators=100, max_samples="auto", contamination="auto", random_state=42)
clf.fit(X_train)

Output:

IsolationForest

IsolationForest(random_state=42)

8.2: Use the model to predict outliers in the test data

# Using the model to predict outliers in the test data
y_pred = clf.predict(X_test)
y_pred = np.where(y_pred == -1, 1, 0)  # Convert -1 (outlier) to 1, and 1 (inlier) to 0

Output:

array([1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0])

8.3: plotting the Outliers

# plot the outliers keep in  red color
plt.figure(figsize=(10,10))
plt.hist(y_test[y_pred==0], bins=20, alpha=0.5, label="Inliers")
plt.hist(y_test[y_pred==1], bins=20, alpha=0.5, label="Outliers")
plt.xlabel("Diagnosis (0: benign, 1: malignant)")
plt.ylabel("Frequency")
plt.title("Outliers detected by Isolation Forest")
plt.legend()
plt.show()

Output:

Step-9: Predicting anomalies using LocalOutlierFactor

9.1: Predicting anomalies:

import plotly.graph_objs as go
from sklearn.neighbors import LocalOutlierFactor

model = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
model.fit(X)
# Predicting anomalies
y_pred1 = model.fit_predict(X)

9.2: Creating scatter plot and adding legends to the annotations:

# Creating scatter plot
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=X.iloc[:, 0],
        y=X.iloc[:, 1],
        mode='markers',
        marker=dict(
            color=y_pred1,
            colorscale='viridis'
        ),
        hovertemplate='Feature 1: %{x}<br>Feature 2: %{y}<extra></extra>'
    )
)

fig.update_layout(
    title='Local Outlier Factor Anomaly Detection',
    xaxis_title='Feature 1',
    yaxis_title='Feature 2'
)

# Add legend annotations
normal_points = go.Scatter(x=[], y=[], mode='markers', 
            marker=dict(color='yellow'), showlegend=True, name='Normal')
anomaly_points = go.Scatter(x=[], y=[], 
        mode='markers', marker=dict(color='darkviolet'), showlegend=True, name='Anomaly')
  
for i in range(len(X)):
    if y_pred1[i] == 1:
        normal_points['x'] += (X.iloc[i, 0],)
        normal_points['y'] += (X.iloc[i, 1],)
    else:
        anomaly_points['x'] += (X.iloc[i, 0],)
        anomaly_points['y'] += (X.iloc[i, 1],)

fig.add_trace(normal_points)
fig.add_trace(anomaly_points)

fig.show()

Output:

local outlier factor | anomaly detection

Conclusion

In this project-based blog, we took a look over anomaly detection in breast cancer data. We used Python’s Scikit-learn library for constructing and training an Isolation Forest model for detecting the hidden data points in the dataset. This model was capable of discovering the outliers and the hidden patterns in the data and helped us to get a meaningful conclusion.

By refining the accuracy of the screening method, we can potentially save countless lives and help them fight against breast cancer. Through the use of these machine learning and data visualization techniques, we can understand the complication connected with the detection of anomalies in breast cancer data in a better way and we can go one step ahead in learning effective and treating methods. Altogether, this project was a prominent success and has found a new way for breast cancer data analysis and anomaly detection.

Key Takeaways

By using anomaly detection methods we can identify subtle yet essential patterns in breast cancer data.
By enhancing the accuracy of screening methods, we can save many lives and help defeat breast cancer.
The Isolation Forest algorithm is a powerful tool for anomaly detection in breast cancer data and has the potential to revolutionize the way we approach screening and treatment methods for this disease.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. How do you detect breast abnormalities?

A. Breast abnormalities can be detected through various methods, including regular self-examinations, clinical breast examinations by healthcare professionals, and imaging tests. Recently use of AI technology has been used for anomaly detection.

Q2. What are the detections of breast cancer?

A. Breast cancer can be detected through a combination of screening methods, such as mammograms, clinical breast examinations, and breast self-exams. These screenings can help identify any suspicious lumps, changes in breast size or shape, nipple discharge, or other abnormalities that may indicate the presence of breast cancer.

Q3. What are the 5 warning signs of breast cancer?

A. The 5 warning signs of breast cancer include a new lump or mass in the breast or underarm, changes in breast size or shape, nipple discharge or inversion, skin dimpling or puckering, and redness or thickening of the breast skin.

Q4. What is the name of the blood test for breast cancer?

A. The blood test for breast cancer is called the circulating tumor DNA (ctDNA) test. It analyzes fragments of tumor DNA that circulate in the bloodstream, allowing for the detection of genetic mutations associated with breast cancer.

Ananya Manjunath

Passionate about Data. Third-year CSE(Data Science)student at Presidency University, Bangalore.
Working as an AI/ML developer at Google Developer Student Clubs - Presidency University Bangalore, with a great team. When I am not studying I usually spend my time reading novels.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Breast Cancer Anomaly Detection for Improved Screening

Introduction

Learning Objectives

Table of contents

What is Breast Cancer?

Why is Early Detection of Breast Cancer Crucial?

What are the Types of Breast Cancer?

Symptoms of Breast Cancer

Diagnosis for Breast Cancer

Best Methods of Detecting Breast Cancer

How can we Detect Breast Cancer Using Machine Learning?

Understanding the Data and Problem Statement

Problem Statement – Breast Cancer Anomaly Detection

The Pipeline of the Project

Step-1: Importing the Libraries

Step-2: Loading and Reading the Dataset

Output:

Step-3: Probing Data Analysis

3.1: Fetching the top 5 records in the data

Output:

3.2:Finding out the number of columns in the dataset

Output:

3.3: Finding the length of data

Output:

3.4: Getting the shape of the data

Output:

3.5: Information on the data

Output:

3.6: Datatypes of the columns

Output:

3.7: Finding whether the dataset has null values

Output:

3.8: Number of rows and columns in the dataset

Output:

3.9: Checking for unique values of diagnosis

Output:

3.10: Number of Diagnosis value

Output:

Step-4: Preprocessing of the Data