Breast cancer is a serious medical condition that affects millions and millions of women worldwide. Even though there is an improvement in the medical field, recognizing and treating breast cancer is possible but spotting it and treating it at an early stage is still not possible. By using Anomaly detection we can identify tiny yet vital patterns in breast cancer that might not be visible to the naked eye. By increasing the accuracy of screening methods, many lives can be saved and we can help them to beat breast cancer. In this generation of computer-controlled health care, anomaly detection is a powerful tool that can change how we deal with breast cancer screening and treatment.
In this article, we will do the following:
This article was published as a part of the Data Science Blogathon.
Breast cancer occurs when breast cells grow uncontrollably and can be found in various parts of the breast. It can metastasize by spreading through blood vessels and lymph vessels to other areas of the body.
When we ignore or don’t care about the cancer symptoms or delay the treatment there will be a low chance of survival. There will be more problems associated with this and at the later or last stages the treatment might not work and there will be more costs for healthcare. Early treatment might help in overcoming the cancer and therefore it is important to treat it in the earliest possible stage.
There are several types of breast cancer, and some of them are:
For the diagnosis of breast cancer, the following is done:
Biopsy i.e., Mammography is one of the best ways to identify breast cancer. Another best way is said to be MRI (Magnetic resonance imaging) through which we can identify the high risk of breast cancer
We can use many Machine Learning algorithms to detect breast cancer disease such algorithms include SVM, Decision Trees, and Neural Networks.
Using these algorithms we can predict cancer at an early stage and it will help the spreading of the disease to slow down and increases the probability of saving the life of the patient.
The data set used for this project is sourced from the UCI Machine Learning Repository, containing 569 instances of breast cancer and 30 attributes. Interested readers may download the data set by clicking on the following link: here. Alternatively, the data set is available in the scikit-learn library, a popular machine-learning library for Python. By working through this blog, readers will gain a better understanding of the complexities involved in detecting anomalies in breast cancer data and how to effectively use the data set for machine learning purposes.
The goal of the project or the aim is to understand the data and find out the occurrence of breast cancer that are irregular. In this, we will use the Isolation Forest library in Python to build and train the model to find the uneven data points in the dataset.
Ultimately, we will study and illuminate our results to conclude meaningful conclusions from the data.
The project pipeline includes various steps, they are:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns12345python
df = pd.read_csv('data.csv')
df.head(5)
df.head(5)
df.columns
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'], dtype='object') 1234567891011python
print('length of data is', len(df))
length of data is 569
df.shape
(569, 33)
df.info()
df.dtypes
np.sum(df.isnull().any(axis=1))
0
print('Count of columns in the data is: ', len(df.columns))
print('Count of rows in the data is: ', len(df))
Count of columns in the data is: 31
Count of rows in the data is: 569
df['diagnosis'].unique()
array([1, 0])
df['diagnosis'].nunique()
2
In the preprocessing process handling the missing values is one of the most important steps if the dataset contains missing values. The presence of missing values can cause many problems such as it might cause errors in the program or simply that data is not available in the first place. There are many techniques to deal with type of error depending on the nature of the data.
Basically, there are techniques that are always suitable to handle the missing values. In some cases, we drop the row or column if the missing value is very less or very more or irrelevant to the given data or might not be useful in building a model. We will use is.null() function to find the missing values.
def null_values(data):
null_values = data.isnull().sum()
null_values = null_values[null_values > 0]
null_values.sort_values(inplace=True)
print(null_values)
null_values(datas)
Series([ ], dtype: int64)
All values in the data are present.
In the data pre-processing phase, the next step involves encoding the data into a suitable form for model building. This step involves converting categorical variables into numerical form i.e., changing the data type of the variable from object to int64, scaling down the data into a standard range, or applying any other transformations to create a clean dataset. In this project-based blog, we will use the LabelEncoder method from sklearn. preprocessing library to convert categorical variables into numerical ones so that we can use the variable in training the model.
To further elaborate on the data pre-processing step, it is very important to encode data even to visualize it. Many plots won’t use the categorical variable to interpret the results cause they are based on numerical calculations. Although we are using the LabelEncoder method in this project-based blog we can also use methods like one-hot encoding, binary encoding, etc. depending on the needs of the model.
Scaling the data to a standard range is very necessary to ensure the variables are weighted equally and that our model is not biased towards one particular feature. This can be achieved using methods such as standardization or normalization.
In the below code, we are first importing LabelEncoder from sklearn. preprocessing and then creating an object of that method. Then finally we will use the object to call the fit_transform function to transform the specified variable into a numerical datatype.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['diagnosis']=le.fit_transform(data['diagnosis'])
df.head()
To understand the data and its anomalies in a better way, we will try different types of visualizations. In these visualizations, we can perform scatter plots, histograms, box plots, and many more. By this, we can identify the outliers and patterns of the data which are not likely related to the raw data. These will majorly help us to construct an effective anomaly detection model.
In addition to this we can use other techniques such as clustering or regression analysis for the further analysis of the data and to understand the model in its various properties. In general, our main objective is to build a unique and reliable model that can detect and guide us through any unusual or unexpected patterns accurately in the data, which helps us to find the issues that may occur before they can cause any major harm or which disrupt our operations.
#Number of Malignant(M) and Benign(B) cells
plt.figure(figsize=(8, 6))
sns.countplot(x='diagnosis', data=df, palette= ['#FFC0CB', '#ADD8E6'],
edgecolor='black', linewidth=1.5)
plt.title('Diagnosis Count', fontsize=20, fontweight='bold')
plt.xlabel('Diagnosis', fontsize=14)
plt.ylabel('Count', fontsize=14)
ax = plt.gca()
for patch in ax.patches:
plt.text(x=patch.get_x()+0.4, y=patch.get_height()+2,
s=str(int(patch.get_height())), fontsize=12)
plt.figure(figsize=(25,15))
sns.heatmap(df.corr(),annot=True, cmap='coolwarm')
Kernel Density Estimation Plot showing the distribution of ‘radius_mean’ among benign and malignant tumors in a breast cancer dataset
def plot_distribution(df, var, target, **kwargs):
row = kwargs.get('row', None)
col = kwargs.get('col', None)
facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col)
facet.map(sns.kdeplot, var, shade=True)
facet.set(xlim=(0, df[var].max()))
facet.add_legend()
plt.show()
plot_distribution(df, var='radius_mean', target='diagnosis')
Scatter Plot showcasing the relationship between ‘radius_mean’ and ‘texture_mean’ in benign and malignant tumors of a breast cancer dataset.
def plot_scatter(df, var1, var2, target, **kwargs):
row = kwargs.get('row', None)
col = kwargs.get('col', None)
facet = sns.FacetGrid(df, hue=target, aspect=4, row=row, col=col)
facet.map(plt.scatter, var1, var2, alpha=0.5)
facet.add_legend()
plt.show()
plot_scatter(df, var1='radius_mean', var2='texture_mean', target='diagnosis')
import plotly.express as px
fig = px.parallel_coordinates(df, dimensions=['radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'symmetry_mean',
'fractal_dimension_mean'],
color='diagnosis', color_continuous_scale=px.colors.sequential.Plasma,
labels={'radius_mean': 'Radius Mean', 'texture_mean': 'Texture Mean',
perimeter_mean': 'Perimeter Mean', 'area_mean': 'Area Mean',
'smoothness_mean': 'Smoothness Mean', 'compactness_mean': 'Compactness Mean',
'concavity_mean': 'Concavity Mean', 'concave points_mean': 'Concave Points Mean',
symmetry_mean': 'Symmetry Mean', 'fractal_dimension_mean': 'Fractal Dimension Mean'},
title='Breast Cancer Diagnosis by Mean Characteristics')
fig.show()
The model development process utilized Python’s scikit-learn library to train and develop the isolation model, which identifies hidden data points. An unsupervised learning algorithm called Isolation Forest was used, known for its effectiveness in anomaly detection. It involves creating a random forest of isolation trees, training each with a randomly selected subset of the data. Outliers are detected based on the average path lengths of the data points.
By using this technique, we can identify the hidden outliers and patterns in the data. In total, we can say that the Isolation Forest algorithm is a vital tool for anomaly detection in Breast cancer data and also it has the ability to revolutionize the way by which we can approach a better way of screening and treating methods of this disease.
from sklearn.feature_selection import SelectKBest, f_classif
# Split the data into features and target
X = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']
x.head()
y.head()
# Performing feature selection using SelectKBest and f_classif
selector = SelectKBest(score_func=f_classif, k=5)
selector.fit(X, y)
SelectKBest
SelectKBest(k=5)
# Getting the indices of the selected features
selected_indices = selector.get_support(indices=True)
# Getting the names of the selected features
selected_features = X.columns[selected_indices].tolist()
# Printing the selected features
print(selected_features)
x = df[selected_features]
y = df['diagnosis']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
# Fit an Isolation Forest model on the training data
clf = IsolationForest(n_estimators=100, max_samples="auto", contamination="auto", random_state=42)
clf.fit(X_train)
IsolationForest
IsolationForest(random_state=42)
# Using the model to predict outliers in the test data
y_pred = clf.predict(X_test)
y_pred = np.where(y_pred == -1, 1, 0) # Convert -1 (outlier) to 1, and 1 (inlier) to 0
array([1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0])
# plot the outliers keep in red color
plt.figure(figsize=(10,10))
plt.hist(y_test[y_pred==0], bins=20, alpha=0.5, label="Inliers")
plt.hist(y_test[y_pred==1], bins=20, alpha=0.5, label="Outliers")
plt.xlabel("Diagnosis (0: benign, 1: malignant)")
plt.ylabel("Frequency")
plt.title("Outliers detected by Isolation Forest")
plt.legend()
plt.show()
import plotly.graph_objs as go
from sklearn.neighbors import LocalOutlierFactor
model = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
model.fit(X)
# Predicting anomalies
y_pred1 = model.fit_predict(X)
# Creating scatter plot
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=X.iloc[:, 0],
y=X.iloc[:, 1],
mode='markers',
marker=dict(
color=y_pred1,
colorscale='viridis'
),
hovertemplate='Feature 1: %{x}<br>Feature 2: %{y}<extra></extra>'
)
)
fig.update_layout(
title='Local Outlier Factor Anomaly Detection',
xaxis_title='Feature 1',
yaxis_title='Feature 2'
)
# Add legend annotations
normal_points = go.Scatter(x=[], y=[], mode='markers',
marker=dict(color='yellow'), showlegend=True, name='Normal')
anomaly_points = go.Scatter(x=[], y=[],
mode='markers', marker=dict(color='darkviolet'), showlegend=True, name='Anomaly')
for i in range(len(X)):
if y_pred1[i] == 1:
normal_points['x'] += (X.iloc[i, 0],)
normal_points['y'] += (X.iloc[i, 1],)
else:
anomaly_points['x'] += (X.iloc[i, 0],)
anomaly_points['y'] += (X.iloc[i, 1],)
fig.add_trace(normal_points)
fig.add_trace(anomaly_points)
fig.show()
In this project-based blog, we took a look over anomaly detection in breast cancer data. We used Python’s Scikit-learn library for constructing and training an Isolation Forest model for detecting the hidden data points in the dataset. This model was capable of discovering the outliers and the hidden patterns in the data and helped us to get a meaningful conclusion.
By refining the accuracy of the screening method, we can potentially save countless lives and help them fight against breast cancer. Through the use of these machine learning and data visualization techniques, we can understand the complication connected with the detection of anomalies in breast cancer data in a better way and we can go one step ahead in learning effective and treating methods. Altogether, this project was a prominent success and has found a new way for breast cancer data analysis and anomaly detection.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. Breast abnormalities can be detected through various methods, including regular self-examinations, clinical breast examinations by healthcare professionals, and imaging tests. Recently use of AI technology has been used for anomaly detection.
A. Breast cancer can be detected through a combination of screening methods, such as mammograms, clinical breast examinations, and breast self-exams. These screenings can help identify any suspicious lumps, changes in breast size or shape, nipple discharge, or other abnormalities that may indicate the presence of breast cancer.
A. The 5 warning signs of breast cancer include a new lump or mass in the breast or underarm, changes in breast size or shape, nipple discharge or inversion, skin dimpling or puckering, and redness or thickening of the breast skin.
A. The blood test for breast cancer is called the circulating tumor DNA (ctDNA) test. It analyzes fragments of tumor DNA that circulate in the bloodstream, allowing for the detection of genetic mutations associated with breast cancer.