In today’s AI-driven landscape, we’ve grown to rely on data-driven choices, making the lives of working professionals more straightforward. From managing supply chains to approving loans, data is the linchpin. The magic of data science extends to healthcare, offering promising outcomes. This article unfolds the human side of ‘Building a Medication Prediction Model from Medical Data Analysis.’ It explores the transformative journey where data science meets modern medicine, unveiling patterns that have the potential to revolutionize healthcare. Join us as we delve into a realm where insights and innovation converge, shaping a future prioritizing personalized treatments and compassionate care.
In this article, we’ll explore how to analyze a medical dataset to create a model that predicts which medications a patient should take when faced with a specific diagnosis. It sounds intriguing, so let’s dive right in!
This article was published as a part of the Data Science Blogathon.
We will be downloading and using the Open Source dataset from kaggle:
The dataset contains
Here, we shall import relevant libraries for the required for our exercise. Then we shall load the dataset into a dataframe and view few rows.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report
df = pd.read_csv("/kaggle/input/hospital-antibiotics-usage/Hopsital Dataset.csv")
print(df.shape)
df.head()
Output:
The column names are pretty intuitive and they do make sense. Let us examine by looking at a few statistics.
# "Let's look at some stats"
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 833 non-null object
1 Date of Data Entry 833 non-null object
2 Gender 833 non-null object
3 Diagnosis 833 non-null object
4 Name of Drug 833 non-null object
5 Dosage (gram) 833 non-null object
6 Route 833 non-null object
7 Frequency 833 non-null object
8 Duration (days) 833 non-null object
9 Indication 832 non-null object
dtypes: object(10)
memory usage: 65.2+ KB
Observations:
Let’s clean some columns. In the previous step, we saw that all columns are integers. So, for starters, we will convert Age, Dosage and Duration to numeric values. Similarly, we will convert the Date of Data Entry into datetime type. Instead of directly converting them, we will create new columns i.e. we will create a Age2 column that will be a numeric version of Age column and so on.
df['Age2'] = pd.to_numeric(df['Age'],errors='coerce')
df['Dosage (gram)2'] = pd.to_numeric(df['Dosage (gram)'],errors='coerce')
df['Duration (days)2'] = pd.to_numeric(df['Duration (days)'],errors='coerce')
df['Date of Data Entry2'] = pd.to_datetime(df['Date of Data Entry'],errors='coerce')
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 833 non-null object
1 Date of Data Entry 833 non-null object
2 Gender 833 non-null object
3 Diagnosis 833 non-null object
4 Name of Drug 833 non-null object
5 Dosage (gram) 833 non-null object
6 Route 833 non-null object
7 Frequency 833 non-null object
8 Duration (days) 833 non-null object
9 Indication 832 non-null object
10 Age2 832 non-null float64
11 Dosage (gram)2 831 non-null float64
12 Duration (days)2 831 non-null float64
13 Date of Data Entry2 831 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(10)
memory usage: 91.2+ KB
After converting, we see some null values. Let us look at that data.
df[(df['Dosage (gram)2'].isnull())
| (df['Duration (days)2'].isnull())
| (df['Age2'].isnull())
| (df['Date of Data Entry2'].isnull())
]#import csv
Output:
Seems like some garbage values in dataset. Let’s remove them.
Now, we will also substitute the values of new column in the older columns and drop the newly created columns
df = df[~((df['Dosage (gram)2'].isnull())
| (df['Duration (days)2'].isnull())
| (df['Age2'].isnull())
| (df['Date of Data Entry2'].isnull()))
]
df['Age'] = df['Age2'].astype('int')
df['Dosage (gram)'] = df['Dosage (gram)2']
df['Date of Data Entry'] = df['Date of Data Entry2']
df['Duration (days)'] = df['Duration (days)2'].astype('int')
df = df.drop(['Age2','Dosage (gram)2','Date of Data Entry2','Duration (days)2'],axis=1)
print(df.shape)
df.head()
Output:
Now, we shall look at some statistics across all columns. We will use the describe function and pass the include argument to get the values statistics across all columns. Let us examine that
df.describe(include='all')
Output:
Observations:
Here, we will try to examine the Route and Frequency column. We will use the value_counts() function.
display(df['Route'].value_counts())
print()
df['Frequency'].value_counts()
Output:
Route
IV 534
Oral 293
IM 4
Name: count, dtype: int64
Frequency
BD 430
TDS 283
OD 110
QID 8
Name: count, dtype: int64
Observations:
Here, we will examine the Diagnosis column. We will use the value_counts() function.
df['Diagnosis'].value_counts()
Output:
Upon running the above code, we see that there are 263 different values. Also, we can see that each patient can have multiple diagnosis (separated by comma). So, let us try to build a wordcloud to make more sense of this column. We will try to remove stopwords from the wordcloud to filter out unnecessary noise.
text = " ".join(diagnosis for diagnosis in adf.Diagnosis)
print ("There are {} words in the combination of all diagnosis.".format(len(text)))
stopwords = set(STOPWORDS)
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Output:
There are 35359 words in the combination of all diagnosis.
Observations:
Let’s look at the top and bottom 10 words/phrases. We will use the Counter function within the collections library.
A = collections.Counter([i.strip().lower()
for i in text.split(',') if (i.strip().lower()) not in stopwords ])
print('Top 10 words/phrases')
display(A.most_common(10))
print('\nBottom 10 words/phrases')
display(A.most_common()[-11:-1])
Output:
Top 10 words/phrases
[('col', 77),
('chest infection', 68),
('ihd', 55),
('copd', 40),
('hypertension', 38),
('ccf', 36),
('type 2 dm', 32),
("koch's lung", 28),
('ckd', 28),
('uraemic gastritis', 18)]
Bottom 10 words/phrases
[('poorly controlled dm complicated uti', 1),
('poorly controlled dm septic shock', 1),
('hypertension hepatitis', 1),
('uti hepatitis', 1),
('uti uti', 1),
('cff ppt by chest infection early col', 1),
('operated spinal cord', 1),
('hematoma and paraplegia he', 1),
('rvi with col glandular fever', 1),
('fever with confusion ccf', 1)]
Observations:
Here, we will examine the Indication column. We will use the value_counts() function.
display(df['Indication'].value_counts())
print()
df['Indication'].value_counts(1)
Output:
Indication
chest infection 92
col 32
uti 30
type 2 dm 25
prevention of infection 22
..
pad(lt u.l) 1
old stroke 1
fainting attack 1
cheat infection 1
centepede bite 1
Name: count, Length: 220, dtype: int64
Indication
chest infection 0.110843
col 0.038554
uti 0.036145
type 2 dm 0.030120
prevention of infection 0.026506
...
pad(lt u.l) 0.001205
old stroke 0.001205
fainting attack 0.001205
cheat infection 0.001205
centepede bite 0.001205
Name: proportion, Length: 220, dtype: float64
We see that there are 220 values of this Indication column. Seems like some spelling errors are prevalent as well – chest vs cheat. We can clean the same. For current exercise – let us consider top 25% indications.
top_indications = df['Indication'].value_counts(1).reset_index()
top_indications['cum_proportion'] = top_indications['proportion'].cumsum()
top_indications = top_indications[top_indications['cum_proportion']<0.25]
top_indications
Output:
Indication | proportion | cum_proportion | |
---|---|---|---|
0 | chest infection | 0.110843 | 0.110843 |
1 | col | 0.038554 | 0.149398 |
2 | uti | 0.036145 | 0.185542 |
3 | type 2 dm | 0.030120 | 0.215663 |
4 | prevention of infection | 0.026506 | 0.242169 |
We will use this dataframe in the bivariate analysis and modeling exercise later.
Here, we will examine this column. We will use the value_counts() function.
display(df['Name of Drug'].nunique())
display(df['Name of Drug'].value_counts())
Output:
55
Name of Drug
ceftriaxone 221
co-amoxiclav 162
metronidazole 59
cefixime 58
septrin 37
clarithromycin 32
levofloxacin 31
amoxicillin+flucloxacillin 29
ceftazidime 24
cefepime 14
cefipime 13
clindamycin 12
rifaximin 10
amikacin 9
cefoperazone 9
coamoxiclav 9
meropenem 8
ciprofloxacin 7
gentamicin 5
pen v 5
rifampicin 5
azithromycin 5
cifran 4
mirox 4
amoxicillin 4
streptomycin 4
ceftazidine 4
clarthromycin 4
amoxicillin+flucoxacillin 3
cefoparazone+sulbactam 3
linezolid 3
ofloxacin 3
norfloxacin 3
imipenem 2
flucloxacillin 2
ceftiaxone 2
cefaziclime 2
ceftriaxone+sulbactam 2
cefexime 2
pipercillin+tazobactam 1
amoxicillin+flucloxiacillin 1
pentoxifylline 1
menopem 1
levefloxacin 1
pentoxyfylline 1
doxycyclin 1
amoxicillin+flucoxiacillin 1
vancomycin 1
cefteiaxone 1
dazolic 1
amoxicillin+flucloaxcin 1
amoxiclav 1
doxycycline 1
cefoperazone+sulbactam 1
nitrofurantoin 1
Observations:
Let us consider the top 5 drugs here. We will create a column cum_proportion to store the cumulative proportion that each drug is contributing.
top_drugs = (df['Name of Drug'].value_counts(1).reset_index())
top_drugs['cum_proportion'] = top_drugs['proportion'].cumsum()
top_drugs = top_drugs.head()
top_drugs
Output:
Name of Drug | proportion | cum_proportion | |
---|---|---|---|
0 | ceftriaxone | 0.265945 | 0.265945 |
1 | co-amoxiclav | 0.194946 | 0.460890 |
2 | metronidazole | 0.070999 | 0.531889 |
3 | cefixime | 0.069795 | 0.601685 |
4 | septrin | 0.044525 | 0.646209 |
The top 5 drugs are given to 64.6% of the patients.
PS : If we correct the spelling errors – this could be more than 64.6%.
We will use this dataframe in the bivariate analysis and modelling exercise later.
Here, we will consider the top_indications and the top_drugs dataframes we created and try to see the distribution across them i.e. we compare top 5 drug name vs top 25% Indication. We will use pivot_table() function.
(df[(df['Indication'].isin(top_indications['Indication']))
&(df['Name of Drug'].isin(top_drugs['Name of Drug']))]
.pivot_table(index='Indication',columns='Name of Drug',values='Age',aggfunc='count')
)
Output:
Name of Drug | cefixime | ceftriaxone | co-amoxiclav | metronidazole | septrin |
---|---|---|---|---|---|
Indication | |||||
chest infection | 7.0 | 22.0 | 27.0 | 3.0 | 1.0 |
col | 1.0 | 14.0 | 3.0 | 3.0 | 1.0 |
prevention of infection | 2.0 | 6.0 | 3.0 | 2.0 | 1.0 |
type 2 dm | 2.0 | 7.0 | 4.0 | 1.0 | NaN |
uti | 3.0 | 9.0 | 2.0 | NaN | NaN |
Observations:
Note: These medicines are prescribed under several other factors – like the Age, Gender, Diagnosis, and Medical history of the patient.
Here, we will try to understand if a few conditions appear for older patients. We will consider the top_indications and examine the mean and median values of age.
(df[df['Indication'].isin(top_indications['Indication'])]
.groupby('Indication')['Age']
.agg(['mean','median','count'])
)
Output:
mean | median | count | |
---|---|---|---|
Indication | |||
chest infection | 57.173913 | 61.5 | 92 |
col | 48.031250 | 48.0 | 32 |
prevention of infection | 42.136364 | 46.0 | 22 |
type 2 dm | 63.560000 | 62.0 | 25 |
uti | 50.233333 | 53.5 | 30 |
Observations:
Here, we will try to understand if specific drugs are used for older patients. We will consider the top_drugs and examine the mean and median values of age.
(df[df['Name of Drug'].isin(top_drugs['Name of Drug'])]
.groupby('Name of Drug')['Age']
.agg(['mean','median','count']).sort_values(by='median')
)
Output:
mean | median | count | |
---|---|---|---|
Name of Drug | |||
septrin | 44.513514 | 40.0 | 37 |
cefixime | 43.137931 | 42.0 | 58 |
ceftriaxone | 50.484163 | 49.0 | 221 |
metronidazole | 53.661017 | 54.0 | 59 |
co-amoxiclav | 56.518519 | 60.0 | 162 |
Observations:
Here, we will try to help the Pharmacist or the Prescribing Doctor. The problem statement is to identify which Drug will be given to the patient basis the Diagnosis, Age and Gender column.
Few considerations and assumptions:
Let us see if we can try to beat that. We will create a copy of the dataframe for the modelling exercise.
adf = df.copy()
adf['Output'] = np.where(df['Name of Drug'].isin(top_drugs['Name of Drug']),
df['Name of Drug'],'Other')
adf['Output'].value_counts()
Output:
Output
Other 294
ceftriaxone 221
co-amoxiclav 162
metronidazole 59
cefixime 58
septrin 37
Name: count, dtype: int64
We have seen this distribution earlier as well. It means that the classes are not evenly distributed. We will need to keep this in mind while initializing the model.
We will try to capture more frequent words/ngrams in the diagnosis column. We will use the Count Vectorizer module.
vectorizer = CountVectorizer(max_features=150,stop_words='english',
ngram_range=(1,3))
X = vectorizer.fit_transform(adf['Diagnosis'].str.lower())
vectorizer.get_feature_names_out()
Output:
array(['abscess', 'acute', 'acute ge', 'af', 'aki', 'aki ckd',
'aki ckd retro', 'alcoholic', 'anaemia', 'art', 'bite', 'bleeding',
'bone', 'ca', 'cap', 'ccf', 'ccf chest', 'ccf chest infection',
'ccf increased', 'ccf increased lft', 'ccf koch', 'ccf koch lung',
'cerebral', 'cerebral infarct', 'chest', 'chest infection',
'chest infection pre', 'chronic', 'ckd', 'ckd chest infection',
'ckd retro', 'col', 'col portal', 'col portal hypertension',
'copd', 'copd chest', 'copd chest infection', 'debility',
'debility excessive', 'debility excessive vomitting', 'diabetes',
'disease', 'disease renal', 'disease renal impairment', 'dm',
'dm ihd', 'edema', 'effusion', 'encephalopathy', 'excessive',
'excessive vomitting', 'excessive vomitting uraemic', 'failure',
'fever', 'gastritis', 'gastritis hcv', 'gastritis hcv aki', 'ge',
'general', 'general debility', 'general debility excessive',
'gi bleeding', 'hcv', 'hcv aki', 'hcv aki ckd', 'hcv col', 'heart',
'hepatic', 'hepatic encephalopathy', 'hepatitis', 'ht',
'ht disease', 'ht disease renal', 'hypertension', 'ihd',
'impairment', 'impairment koch', 'impairment koch lungs',
'increased', 'increased lft', 'infarct', 'infection',
'infection pre', 'infection pre diabetes', 'koch', 'koch lung',
'koch lung copd', 'koch lungs', 'koch lungs ccf', 'left',
'left sided', 'leg', 'lft', 'lung', 'lung copd', 'lung copd chest',
'lungs', 'lungs ccf', 'lungs ccf increased', 'marrow', 'multiple',
'multiple myeloma', 'multiple myeloma ckd', 'myeloma',
'myeloma ckd', 'newly', 'old', 'pleural', 'pleural effusion',
'pneumonia', 'portal', 'portal hypertension', 'pre',
'pre diabetes', 'pulmonary', 'pulmonary edema', 'renal',
'renal impairment', 'renal impairment koch', 'retro', 'right',
'rvi', 'rvi stage', 'rvi stage ht', 'septic', 'septic shock',
'severe', 'severe anaemia', 'severe anaemia multiple', 'shock',
'sided', 'snake', 'snake bite', 'stage', 'stage ht',
'stage ht disease', 'stroke', 'tb', 'type', 'type dm',
'type dm ihd', 'uraemic', 'uraemic gastritis',
'uraemic gastritis hcv', 'uti', 'uti type', 'uti type dm',
'vomitting', 'vomitting uraemic', 'vomitting uraemic gastritis'],
dtype=object)
The above list showcases the top 150 ngrams observed in the diagnosis column after removing the stopwords.
Here, we will create a single dataframe containing the features we just made and the age and gender column as input. We will use Label Encoder to convert the Drug Names into numeric values.
feature_df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
feature_df['Age'] = adf['Age'].fillna(0).astype('int')
feature_df['Gender_Male'] = np.where(adf['Gender']=='Male',1,0)
le = LabelEncoder()
feature_df['Output'] = le.fit_transform(adf['Output'])
Now, we will do a train test split. We will keep 20% of the data as the test set. We will use the random_state argument to ensure reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
feature_df.drop('Output',axis=1).fillna(-1),
feature_df['Output'],
test_size=0.2, random_state=42)
Here, I have tried using the Random Forest model. You can try with other models as well. We will use the random_state argument to ensure reproducibility. We will use the class_weight parameter since we had seen earlier that the classes are not evenly distributed.
clf = RandomForestClassifier(max_depth=6, random_state=0,
class_weight='balanced')
clf.fit(X_train, y_train)
Let us see the accuracy and other metrics for the training dataset,
# accuracy on X_train data
final_accuracy = clf.score(X_train, y_train)
print("final_accuracy is : ",final_accuracy)
# creating a confusion matrix for determining and visualizing the accuracy score
clf_predict = clf.predict(X_train)
print(classification_report(y_train, clf_predict,
target_names=list(mapping_df['Actual_Name'])))
Output:
final_accuracy is : 0.411144578313253
precision recall f1-score support
Other 0.84 0.30 0.44 236
cefixime 0.16 0.86 0.27 49
ceftriaxone 0.74 0.26 0.39 176
co-amoxiclav 0.48 0.47 0.48 127
metronidazole 0.39 0.59 0.47 49
septrin 0.45 0.93 0.61 27
accuracy 0.41 664
macro avg 0.51 0.57 0.44 664
weighted avg 0.64 0.41 0.43 664
Similarly, let us see accuracy and other metrics for the test dataset.
# accuracy on X_test data
final_accuracy = clf.score(X_test, y_test)
print("final_accuracy is : ",final_accuracy)
# creating a confusion matrix for determining and visualizing the accuracy score
clf_predict = clf.predict(X_test)
print(classification_report(y_test, clf_predict,
target_names=list(mapping_df['Actual_Name'])))
Output:
final_accuracy is : 0.38323353293413176
precision recall f1-score support
Other 0.71 0.38 0.49 58
cefixime 0.08 0.56 0.14 9
ceftriaxone 0.36 0.09 0.14 45
co-amoxiclav 0.64 0.71 0.68 35
metronidazole 0.31 0.40 0.35 10
septrin 0.44 0.40 0.42 10
accuracy 0.38 167
macro avg 0.42 0.42 0.37 167
weighted avg 0.53 0.38 0.41 167
Key Observations:
In this article, we analyzed a medical dataset end to end. Then, we cleaned the dataset. We saw basic statistics, distribution, and even a wordcloud to understand the columns. Then, we created a problem statement to help the Pharmacist or the Prescribing Doctor with the modern medicine. It was to identify which Drug would be given to the patient basis the Diagnosis, Age, and Gender column.
Some things to try out to improve the model performance:
Thanks for reading my article. Feel free to connect with me on LinkedIn to discuss this.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.