Diabetes is usually classified into 2- type 1 and type 2. Diabetes mellitus, which is Type 2 diabetes, is a chronic condition posing significant challenges to global healthcare systems. The increasing prevalence of this disease demands innovative approaches for early detection and effective management. Recent advancements in artificial intelligence and machine learning techniques offer promising solutions for predicting diabetes. Utilizing extensive datasets, including essential health indicators such as blood pressure, body mass index (BMI), and glucose levels, machine learning models can identify patterns and risk factors associated with diabetes.
In this context, various machine learning algorithms, including decision trees, random forests, support vector machines (SVM), and neural networks, have been employed to develop robust classifiers for diabetes prediction. Notably, algorithms such as logistic regression, Naive Bayes, and k-nearest neighbors (KNN) have shown significant potential in the prediction of type 1 or type 2 diabetes in prediabetic patients.
The deployment of these models in real-world healthcare settings can facilitate early diagnosis and intervention, potentially reducing the burden of diabetes-related complications. This article delves into the methodologies, data analysis, and evaluation metrics of different machine-learning approaches for diabetes prediction, highlighting their implications in clinical practice and public health. Through systematic review and performance analysis, we aim to provide a comprehensive overview of the current landscape for early detection of diabetes and future directions in using machine learning for diabetes prediction models.
This article was published as a part of the Data Science Blogathon.
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process, aimed at summarizing the main characteristics of a dataset, often with visual methods. Here are the critical aspects of EDA:
EDA is iterative and involves going back and forth between different steps to refine the understanding of the data. The goal is to make sense of the data, detect essential features, and generate questions or hypotheses for further analysis. It is an integral part of data science and helps make informed decisions for subsequent data modeling and analysis steps.
Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on building systems to learn from and make data-based decisions. Instead of being explicitly programmed to perform a task, these systems use algorithms to identify patterns and make predictions or decisions. Here are the key components and types of machine learning:
Supervised Learning: The model is trained on labeled data, where the input data and the corresponding output are provided. For example, regression (predicting continuous values) and classification (categorizing data into discrete classes) are used. Its applications are Spam detection, image classification, and medical diagnosis.
Unsupervised Learning: The model is trained on unlabeled data and must find patterns and relationships within the data. For example, Clustering (grouping similar data points) and association (finding rules that describe large portions of data). Its applications are Customer segmentation, market basket analysis, and anomaly detection.
Semi-supervised Learning: It combines a small amount of labeled data with many unlabeled data during training. Its applications are situations where acquiring labeled data is expensive or time-consuming, such as medical image analysis.
Reinforcement Learning: The model learns by interacting with an environment and receiving rewards or penalties based on its actions. For example, algorithms for playing games, robotic control, and self-driving cars. Its applications are Game playing (like AlphaGo), robotics, and recommendation systems.
Also Read: Supervised Learning And Unsupervised Machine Learning
Machine learning offers several advantages over traditional statistical models and other methods for diabetes prediction, making it particularly well-suited for this application. Here are key reasons why machine learning is often better for diabetes prediction:
Machine learning algorithms, such as decision trees, random forests, and neural networks, excel at capturing complex, non-linear relationships between features that traditional linear models might miss.
Example: The relationship between blood glucose levels, age, BMI, and diabetes risk is often non-linear and may involve complex interactions that machine learning models can better capture.
Machine learning models can automatically perform feature selection and engineering, identifying the most relevant features for predicting diabetes.
Example: Algorithms like LASSO (Least Absolute Shrinkage and Selection Operator) or random forests can rank features by importance, potentially uncovering hidden predictors of diabetes.
Also Read about this article in details “Lasso and Ridge Regression“
Machine learning models can handle large datasets with many features and observations, improving the predictions’ robustness and accuracy.
Example: With access to extensive patient records, including medical history, lifestyle factors, and genetic information, machine learning models can provide more accurate predictions than models limited to smaller datasets.
Machine learning models, particularly in dynamic environments, can be updated and retrained with new data to improve their accuracy and adapt to population or disease characteristics changes.
Example: As new research reveals more about the genetic markers associated with diabetes, machine learning models can incorporate this information to enhance prediction accuracy.
Machine learning models can integrate and analyze diverse data types, including structured data (e.g., lab results) and unstructured data (e.g., doctor’s notes, medical imaging).
Example: Combining lab results, lifestyle information, and genomic data in a single model can lead to more comprehensive and accurate diabetes predictions.
Machine learning models generally outperform traditional models in predictive accuracy due to their ability to learn from large datasets and capture complex patterns.
Example: Studies have shown that machine learning models, like gradient boosting machines or deep neural networks, often provide higher accuracy in diabetes prediction compared to logistic regression.
Machine learning models can identify high-risk individuals earlier than traditional methods, enabling timely interventions and potentially preventing the onset of diabetes.
Example: Early identification through predictive modeling can lead to lifestyle modifications or medical treatments that delay or prevent diabetes.
Diabetes prediction using machine learning means using computer programs to guess if someone might get diabetes. These programs look at things like health history and lifestyle to make their guess. They learn from many examples of people with and without diabetes to make better guesses. For instance, they might look at how much sugar someone eats or if they exercise regularly. By doing this, they can give early warnings to people at risk of getting diabetes so they can take better care of themselves.
The Pima Indians Diabetes Dataset is a publicly available test dataset widely used for diabetes research and predictive modeling. It contains 768 observations of females of Pima Indian heritage aged 21 years or older. The dataset includes eight medical predictor variables and one target variable. The predictor variables are:
The target variable is ‘Outcome’’ which indicates whether the patient had diabetes (1) or not (0). This training dataset is particularly useful for testing machine learning algorithms for binary classification tasks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from mlxtend.plotting import plot_decision_regions
import missingno as msno
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
We start by reading the training dataset, which is in CSV format:
diabetes_df = pd.read_csv('diabetes.csv') diabetes_df.head()
Output:
First, let’s see the columns available in our dataset:
diabetes_df.columns
Output:
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
To get more information about the dataset:
diabetes_df.info()
Output:
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
To understand the statistics of the dataset:
diabetes_df.describe()
Output:
To know more about the dataset with transpose – here, T is for the transpose
diabetes_df.describe().T
Output:
Let’s check if our dataset has any null values:
diabetes_df.isnull().head(10)
Output:
To get the total number of null values in the dataset:
diabetes_df.isnull().sum()
Output:
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
We observe no missing values from the above code, which is misleading. In this dataset, missing values are encoded as 0. Therefore, we must replace the 0 values with NaN and then blame them properly.
Replace 0 values with NaN:
diabetes_df_copy = diabetes_df.copy(deep=True)
diabetes_df_copy[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_df_copy[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)
# Showing the count of NaNs
print(diabetes_df_copy.isnull().sum())
Output:
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
We will replace the zeros with NaN values to maintain the dataset’s authenticity and then attribute these missing values to the respective columns’ mean or median.
First, let’s visualize the distribution of each feature in the dataset before removing null values:
p = diabetes_df.hist(figsize=(20, 20))
Output:
So here, we have seen the distribution of each feature, whether dependent or independent. One thing that could always strike us is why we need to know the distribution of data.? So; the answer is simple: it is the best way to start the dataset analysis as it shows the occurrence of every value in the graphical structure, letting us know the range of the data.
Imputing Missing Values
Now, we will attribute the missing values. We’ll use the mean value for ‘Glucose’ and ‘BloodPressure’’ and the median value for ‘SkinThickness’’ ‘Insulin’’ and ‘BMI’:
diabetes_df_copy['Glucose'].fillna(diabetes_df_copy['Glucose'].mean(), inplace=True)
diabetes_df_copy['BloodPressure'].fillna(diabetes_df_copy['BloodPressure'].mean(), inplace=True)
diabetes_df_copy['SkinThickness'].fillna(diabetes_df_copy['SkinThickness'].median(), inplace=True)
diabetes_df_copy['Insulin'].fillna(diabetes_df_copy['Insulin'].median(), inplace=True)
diabetes_df_copy['BMI'].fillna(diabetes_df_copy['BMI'].median(), inplace=True)
Let’s visualize the distribution of each feature again after imputing the missing values:
p = diabetes_df_copy.hist(figsize=(20, 20))
Output:
Inference: Here we are again using the hist plot to see the distribution of the dataset, but this time, we are using this visualization to see the changes that we can see after those null values are removed from the dataset, and we can see the difference for example – In age column after removal of the null values, we can see that there is a spike at the range of 50 to 100 which is quite logical as well.
To further verify that there are no null values left in the dataset, we can use the Missingno library:
p = msno.bar(diabetes_df)
Output:
color_wheel = {1: "#0392cf", 2: "#7bc043"}
colors = diabetes_df["Outcome"].map(lambda x: color_wheel.get(x + 1))
print(diabetes_df.Outcome.value_counts())
p = diabetes_df.Outcome.value_counts().plot(kind="bar")
Output:
0 500
1 268
Name: Outcome, dtype: int64
Inference: The above visualization indicates that our training dataset is imbalanced. The number of non-diabetic patients is almost double that of diabetic patients.
Also, you can check out this article of how to handle imbalanced data
Output
Finally, let’s examine the distribution and outliers for the ‘Insulin’ feature using both a distplot and a boxplot:
plt.subplot(121)
sns.distplot(diabetes_df['Insulin'])
plt.subplot(122)
diabetes_df['Insulin'].plot.box(figsize=(16, 5))
plt.show()
Output:
Inference: The distplot helps us understand the distribution of the ‘Insulin’ feature, while the boxplot reveals any outliers present. This combined approach provides a comprehensive view of the data, highlighting any potential issues that need to be addressed during further analysis.
Let’s analyze the correlation between all the features in the dataset before any data cleaning. This will help us understand the relationships between different features:
plt.figure(figsize=(12, 10))
# Using seaborn to create a heatmap for the correlation matrix
p = sns.heatmap(diabetes_df.corr(), annot=True, cmap='RdYlGn')
Output:
Inference: The heatmap above shows the correlation coefficients between each pair of features in the dataset. The correlation coefficient ranges from -1 to 1, where:
By examining the heatmap, we can identify which features strongly correlate with each other and with the target variable ‘Outcome’’ This information is crucial for feature selection and engineering steps in the machine learning pipeline.
Before scaling the data, let’s take a quick look at the first few rows of the testing data:
diabetes_df_copy.head()
Output:
Next, we will apply standard scaling to the training dataset. Standard scaling helps normalize the data, ensuring that each feature contributes equally to the machine-learning model:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = pd.DataFrame(sc_X.fit_transform(diabetes_df_copy.drop(['Outcome'], axis=1)),
columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])
X.head()
Output:
After scaling, the values of all features are now on the same scale. This helps our machine learning model perform better because no single feature will dominate due to its larger values.
Let’s also take a look at our target variable, ‘Outcome’:
y = diabetes_df_copy.Outcome
y.head()
Output:
0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64
The ‘Outcome’ column shows whether a patient has diabetes (1) or not (0). Understanding the target variable is essential to build an accurate predictive model.
First, we need to split the dataset into features (X) and target (y):
X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']
Now, we will split the data into training and testing sets using the train_test_split
Function:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=7)
Building the model using Random Forest:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
Check the accuracy of the model on the training dataset:
rfc_train = rfc.predict(X_train)
from sklearn import metrics
print("Training Accuracy =", format(metrics.accuracy_score(y_train, rfc_train)))
Output:
Training Accuracy = 1.0
The model is overfitted on the training data. Now, let’s check the accuracy of the test data:
predictions = rfc.predict(X_test)
print("Test Accuracy =", format(metrics.accuracy_score(y_test, predictions)))
Output:
Test Accuracy = 0.7677165354330708
Get the classification report and confusion matrix:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
Output:
Building the model using a Decision Tree:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
Make predictions on the testing data:
predictions = dtree.predict(X_test)
print("Test Accuracy =", format(metrics.accuracy_score(y_test, predictions)))
Output:
Test Accuracy = 0.7322834645669292
Get the classification report and confusion matrix:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
Output:
from xgboost import XGBClassifier
xgb_model = XGBClassifier(gamma=0)
xgb_model.fit(X_train, y_train)
Output:
Make predictions on the testing data:
xgb_pred = xgb_model.predict(X_test)
print("Test Accuracy =", format(metrics.accuracy_score(y_test, xgb_pred)))
Output:
Test Accuracy = 0.7401574803149606
Get the classification report and confusion matrix:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, xgb_pred))
print(classification_report(y_test, xgb_pred))
Output:
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)
Make predictions on the testing data:
svc_pred = svc_model.predict(X_test)
print("Test Accuracy =", format(metrics.accuracy_score(y_test, svc_pred)))
Output:
Test Accuracy = 0.7401574803149606
Get the classification report and confusion matrix:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, svc_pred))
print(classification_report(y_test, svc_pred))
Output:
Among the models tested, the Random Forest model performed the best with an accuracy of 0.7677.
Knowing the importance of each feature is essential as it shows how much each contributes to the model’s predictions.
Let’s retrieve the feature importances from the Random Forest model:
rfc.feature_importances_
Output:
array([0.07684946, 0.25643635, 0.08952599, 0.08437176, 0.08552636, 0.14911634, 0.11751284, 0.1406609 ])
From the above output, it’s not very clear which feature is most important. Therefore, we will create a visualization to better understand the feature’s importance.
We will now plot the feature importances to get a clearer picture:
pd.Series(rfc.feature_importances_, index=X.columns).plot(kind='barh')
Output:
From the graph above, it is clear that ‘Glucose’ is the most essential feature in this dataset. Visualizing feature importance helps us identify which features influence the model’s predictions most.
import pickle
# Firstly, we will be using the dump() function to save the model using pickle
saved_model = pickle.dumps(rfc)
# Then we will be loading that saved model
rfc_from_pickle = pickle.loads(saved_model)
# Lastly, after loading that model we will use this to make predictions
rfc_from_pickle.predict(X_test)
Output:
Now, for the last time, I’ll be looking at the dataset’s head and tail so that we can take any random set of features from both to test whether our model is good enough to give the correct prediction.
diabetes_df.head()
Output:
diabetes_df.tail()
Output:
Adding data points to the model will either return 0 or 1, i.e., a person with diabetes or not.
rfc.predict([[0, 137, 40, 35, 168, 43.1, 2.228, 33]]) # 4th patient
Output:
array([1], dtype=int64)
According to our model, this patient has diabetes.
Another one:
rfc.predict([[10, 101, 76, 48, 180, 32.9, 0.171, 63]]) # 763rd patient
Output:
array([0], dtype=int64)
This patient does not have diabetes.
Data gathering: Collect a thorough dataset detailing individuals’ health records, daily routines, and physical measurements for predicting diabetes through machine learning.
Data preprocessing involves eliminating inconsistencies and errors to clean the data. This measure guarantees the dataset’s appropriateness for training machine learning algorithms in detecting diabetes through machine learning techniques.
Feature Selection: Recognize and choose important attributes like blood sugar levels, BMI, family history, and age. These characteristics are essential for accurately predicting diabetes.
Train models with machine learning algorithms, Such as random forest or neural networks, using the prepared dataset for model training. While being trained, the models are taught to identify patterns that suggest the presence of diabetes by studying examples.
Evaluate the trained models: Performance using accuracy, precision, recall, and F1-score metrics. This measure guarantees the accuracy of diabetes prediction by the models.
Prediction of diabetes: Utilize the trained machine learning models to forecast the probability of individuals experiencing diabetes according to the data provided.
Continuous Monitoring: Set up a mechanism to continuously monitor and revise the models with the arrival of new data. This guarantees that the models stay precise and applicable for forecasting diabetes with machine learning in actual situations.
While machine learning models are highly effective for predicting diabetes, several alternative approaches and models for healthcare can also be used, each with advantages and limitations. Here are some of the leading alternative models for heart disease prediction:
It is a traditional statistical method used for binary classification problems.
It is a probabilistic classifier based on Bayes’ theorem with an assumption of independence among predictors.
It is a non-parametric method that classifies a data point based on the majority class of its k nearest neighbors.
Machine learning offers powerful techniques for disease prediction in healthcare by analyzing various health indicators and lifestyle factors. This comprehensive analysis explored several machine learning algorithms, such as random forests, decision trees, XGBoost, and support vector machines, for building effective diabetes prediction models.
The random forest model emerged as the top performer, achieving an accuracy of 0.77 on the test dataset. We also gained valuable insights into feature importance, with glucose levels being the most influential predictor of diabetes in this dataset. Visualizing the data distributions, correlations, and outliers further enhanced our understanding.
While machine learning excels at diabetes prediction, we discussed alternative methods like logistic regression, naive Bayes, and k-nearest neighbors, each with strengths and limitations. Selecting the right approach depends on factors like dataset size, model interpretability needs, and the complexity of the underlying relationships.
Looking ahead, continued research integrating more extensive and more diverse patient datasets and exploring advanced neural network architectures holds immense potential for improving diabetes prediction accuracy. Additionally, deploying these predictive models in clinical settings can facilitate early intervention, risk stratification, and tailored treatment plans, ultimately improving outcomes for individuals at risk of developing diabetes.
Here’s the repo link to this article.
Here, you can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon (link)
If you want to learn more about artificial intelligence, machine learning, deep learning etc, don’t forget to follow Analytics Vidhya’s blogs.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Machine learning techniques such as decision trees, logistic regression, neural networks, and random forests are commonly used to predict diabetes. These algorithms examine data about blood sugar levels and lifestyle choices to predict the probability of developing diabetes, which is referred to as machine learning.
Support Vector Machines (SVM) were selected to predict diabetes because of their capability to deal with intricate datasets with high dimensionality. SVM efficiently categorizes individuals into diabetic and non-diabetic groups using different input factors to create precise diabetes prediction models.
Truly, artificial intelligence, particularly through the use of machine learning, can effectively detect diabetes by analyzing the medical records and physical symptoms of patients. This method, known as machine learning, can predict diabetes, enabling early identification of at-risk individuals and facilitating the implementation of preventive measures and personalized healthcare plans.
I gone through this Diabetes Prediction Using Machine Learning and noticed below observation. Not sure, I observed correctly but, I noticed something in the below code. Before Model Building, by using the StandardScaler, you scaled the data for input features 'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age' and assigned to X variable.And, from diabetes_df_copy dataframe you took the Outcome feature into y. But, when you started Model Building to split the data using Train test split, you utilized the diabetes_df dataframe (which is not imputed version of dataframe) instead of using already scaled data in X. And also, before you impute the data, you copied the data from diabetes_df into diabetes_df_copy dataframe. I think you supposed to use scaled data in X for all ML Model builds. Can you please correct me if I observed incorrectly.
Good demonstration of a real-world machine learning process. As a clinician I have concerns about using this dataset without some medical expertise. For example, you can't have triceps thickness or insulin levels of zero. This means the test was not done. Is imputation legitimate in this situation? A pregnancy level of zero could mean it was not asked or it could mean no pregnancies. We don't know which. The column on pedigree is complicated and probably should be deleted and not used
Sir any other projects available sir?
Comments are Closed