Building a Medication Prediction Model from Medical Data Analysis

aakash93 Last Updated : 12 Dec, 2023

12 min read

Introduction

In today’s AI-driven landscape, we’ve grown to rely on data-driven choices, making the lives of working professionals more straightforward. From managing supply chains to approving loans, data is the linchpin. The magic of data science extends to healthcare, offering promising outcomes. This article unfolds the human side of ‘Building a Medication Prediction Model from Medical Data Analysis.’ It explores the transformative journey where data science meets modern medicine, unveiling patterns that have the potential to revolutionize healthcare. Join us as we delve into a realm where insights and innovation converge, shaping a future prioritizing personalized treatments and compassionate care.

medication prediction model | Data Science

Learning Objectives

In this article, we’ll explore how to analyze a medical dataset to create a model that predicts which medications a patient should take when faced with a specific diagnosis. It sounds intriguing, so let’s dive right in!

This article was published as a part of the Data Science Blogathon.

Dataset
Loading Libraries and Data set
Basic Statistics
Data Preprocessing
Looking at Some Stats
Univariate EDA Route and Frequency Column
Univariate EDA Diagnosis Column
Univariate EDA Indication Column
Univariate EDA Name of Drug column
Bivariate Analysis Indication vs Name of Drug
Bivariate Analysis Indication vs Age
Bivariate Analysis Name of Drug vs Age
Modeling Approach
Feature Engineering
Dataset Creation
Modelling

Dataset

We will be downloading and using the Open Source dataset from kaggle:

Link here

The dataset contains

Age and gender of the patient
Diagnosis of the patient
Antibiotics used to treat patient
Dosage of the antibiotics in grams
Route of application of antibiotics
Frequency of usage of antibiotics
Duration of treatment using antibiotics in days
Indiction of antibiotics

Loading Libraries and Data set

Here, we shall import relevant libraries for the required for our exercise. Then we shall load the dataset into a dataframe and view few rows.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report

df = pd.read_csv("/kaggle/input/hospital-antibiotics-usage/Hopsital Dataset.csv")
print(df.shape)
df.head()

Output:

The column names are pretty intuitive and they do make sense. Let us examine by looking at a few statistics.

Basic Statistics

# "Let's look at some stats"
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Age                 833 non-null    object
 1   Date of Data Entry  833 non-null    object
 2   Gender              833 non-null    object
 3   Diagnosis           833 non-null    object
 4   Name of Drug        833 non-null    object
 5   Dosage (gram)       833 non-null    object
 6   Route               833 non-null    object
 7   Frequency           833 non-null    object
 8   Duration (days)     833 non-null    object
 9   Indication          832 non-null    object
dtypes: object(10)
memory usage: 65.2+ KB

Observations:

Every column is a string column
Let’s convert column Age, Dosage(gram), Duration(days) to numeric.
Let’s also convert Date of Data Entry into date time.
Indication has one null value, all other column don’t have null values.

Data Preprocessing

Let’s clean some columns. In the previous step, we saw that all columns are integers. So, for starters, we will convert Age, Dosage and Duration to numeric values. Similarly, we will convert the Date of Data Entry into datetime type. Instead of directly converting them, we will create new columns i.e. we will create a Age2 column that will be a numeric version of Age column and so on.

df['Age2'] = pd.to_numeric(df['Age'],errors='coerce')
df['Dosage (gram)2'] = pd.to_numeric(df['Dosage (gram)'],errors='coerce')
df['Duration (days)2'] = pd.to_numeric(df['Duration (days)'],errors='coerce')
df['Date of Data Entry2'] = pd.to_datetime(df['Date of Data Entry'],errors='coerce')
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Age                  833 non-null    object        
 1   Date of Data Entry   833 non-null    object        
 2   Gender               833 non-null    object        
 3   Diagnosis            833 non-null    object        
 4   Name of Drug         833 non-null    object        
 5   Dosage (gram)        833 non-null    object        
 6   Route                833 non-null    object        
 7   Frequency            833 non-null    object        
 8   Duration (days)      833 non-null    object        
 9   Indication           832 non-null    object        
 10  Age2                 832 non-null    float64       
 11  Dosage (gram)2       831 non-null    float64       
 12  Duration (days)2     831 non-null    float64       
 13  Date of Data Entry2  831 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(10)
memory usage: 91.2+ KB

After converting, we see some null values. Let us look at that data.

df[(df['Dosage (gram)2'].isnull())
  | (df['Duration (days)2'].isnull())
  | (df['Age2'].isnull())
  | (df['Date of Data Entry2'].isnull())
  ]#import csv

Output:

Seems like some garbage values in dataset. Let’s remove them.

Now, we will also substitute the values of new column in the older columns and drop the newly created columns

df = df[~((df['Dosage (gram)2'].isnull())
  | (df['Duration (days)2'].isnull())
  | (df['Age2'].isnull())
  | (df['Date of Data Entry2'].isnull()))
  ]

df['Age'] = df['Age2'].astype('int')
df['Dosage (gram)'] = df['Dosage (gram)2']
df['Date of Data Entry'] = df['Date of Data Entry2']
df['Duration (days)'] = df['Duration (days)2'].astype('int')
df = df.drop(['Age2','Dosage (gram)2','Date of Data Entry2','Duration (days)2'],axis=1)

print(df.shape)
df.head()

Output:

Looking at Some Stats

Now, we shall look at some statistics across all columns. We will use the describe function and pass the include argument to get the values statistics across all columns. Let us examine that

df.describe(include='all')

Output:

Observations:

We can see that the age is 1, and the max is 90.
Similarly, looking at the Data Entry column – we know that it contains data for 19-Dec-2019 between 1 and 7 pm.
Gender has 2 unique values.
Dosage has a minimum value of 0.02 grams and a max weight of 960 grams (seems extreme).
Duration has a value of 1 and a max value of 28 days.
Diagnosis, Name of Drug, Route, Frequency, and Indication columns have different values. We will examine them.

Univariate EDA Route and Frequency Column

Here, we will try to examine the Route and Frequency column. We will use the value_counts() function.

display(df['Route'].value_counts())
print()
df['Frequency'].value_counts()

Output:

Route
IV      534
Oral    293
IM        4
Name: count, dtype: int64

Frequency
BD     430
TDS    283
OD     110
QID      8
Name: count, dtype: int64

Observations:

There are 3 different routes of Drug treatment. The most used Route of Drug treatment is IV, and the least used is IM.
There are 4 different Frequency of Drug treatment. BD is the most used Frequency of Drug treatment, and the least used is QID.

Univariate EDA Diagnosis Column

Here, we will examine the Diagnosis column. We will use the value_counts() function.

df['Diagnosis'].value_counts()

Output:

Upon running the above code, we see that there are 263 different values. Also, we can see that each patient can have multiple diagnosis (separated by comma). So, let us try to build a wordcloud to make more sense of this column. We will try to remove stopwords from the wordcloud to filter out unnecessary noise.

text = " ".join(diagnosis for diagnosis in adf.Diagnosis)
print ("There are {} words in the combination of all diagnosis.".format(len(text)))

stopwords = set(STOPWORDS)

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Output:

There are 35359 words in the combination of all diagnosis.

combination of diagnosis | Modern Medicine | Data Science

Observations:

There are 35359 words in the combination of all diagnosis. A lot of them will be repetitive.
Looking at the wordcloud, Chest infection and koch lung seem to be more frequent diagnosis.
We can see other conditions like – diabetes, myeloma, etc.

Let’s look at the top and bottom 10 words/phrases. We will use the Counter function within the collections library.

A = collections.Counter([i.strip().lower() 
  for i in text.split(',') if (i.strip().lower()) not in stopwords ])
print('Top 10 words/phrases')
display(A.most_common(10))
print('\nBottom 10 words/phrases')
display(A.most_common()[-11:-1])

Output:

Top 10 words/phrases
[('col', 77),
 ('chest infection', 68),
 ('ihd', 55),
 ('copd', 40),
 ('hypertension', 38),
 ('ccf', 36),
 ('type 2 dm', 32),
 ("koch's lung", 28),
 ('ckd', 28),
 ('uraemic gastritis', 18)]

Bottom 10 words/phrases
[('poorly controlled dm complicated uti', 1),
 ('poorly controlled dm septic shock', 1),
 ('hypertension hepatitis', 1),
 ('uti hepatitis', 1),
 ('uti uti', 1),
 ('cff ppt by chest infection early col', 1),
 ('operated spinal cord', 1),
 ('hematoma and paraplegia he', 1),
 ('rvi with col glandular fever', 1),
 ('fever with confusion ccf', 1)]

Observations:

Among the top 10 words/phrases, col and chest infections are more frequent. 77 patients are diagnosed with col, and 68 are diagnosed with chest infection.
Among the bottom 10 words/phrases, we see that each of the listed ones occurs only once.

Univariate EDA Indication Column

Here, we will examine the Indication column. We will use the value_counts() function.

display(df['Indication'].value_counts())
print()
df['Indication'].value_counts(1)

Output:

Indication
chest infection            92
col                        32
uti                        30
type 2 dm                  25
prevention of infection    22
                           ..
pad(lt u.l)                 1
old stroke                  1
fainting attack             1
cheat infection             1
centepede bite              1
Name: count, Length: 220, dtype: int64

Indication
chest infection            0.110843
col                        0.038554
uti                        0.036145
type 2 dm                  0.030120
prevention of infection    0.026506
                             ...   
pad(lt u.l)                0.001205
old stroke                 0.001205
fainting attack            0.001205
cheat infection            0.001205
centepede bite             0.001205
Name: proportion, Length: 220, dtype: float64

We see that there are 220 values of this Indication column. Seems like some spelling errors are prevalent as well – chest vs cheat. We can clean the same. For current exercise – let us consider top 25% indications.

top_indications = df['Indication'].value_counts(1).reset_index()
top_indications['cum_proportion'] = top_indications['proportion'].cumsum()
top_indications = top_indications[top_indications['cum_proportion']<0.25]
top_indications

Output:

	Indication	proportion	cum_proportion
0	chest infection	0.110843	0.110843
1	col	0.038554	0.149398
2	uti	0.036145	0.185542
3	type 2 dm	0.030120	0.215663
4	prevention of infection	0.026506	0.242169

We will use this dataframe in the bivariate analysis and modeling exercise later.

Univariate EDA Name of Drug column

Here, we will examine this column. We will use the value_counts() function.

display(df['Name of Drug'].nunique())
display(df['Name of Drug'].value_counts())

Output:

55

Name of Drug
ceftriaxone                    221
co-amoxiclav                   162
metronidazole                   59
cefixime                        58
septrin                         37
clarithromycin                  32
levofloxacin                    31
amoxicillin+flucloxacillin      29
ceftazidime                     24
cefepime                        14
cefipime                        13
clindamycin                     12
rifaximin                       10
amikacin                         9
cefoperazone                     9
coamoxiclav                      9
meropenem                        8
ciprofloxacin                    7
gentamicin                       5
pen v                            5
rifampicin                       5
azithromycin                     5
cifran                           4
mirox                            4
amoxicillin                      4
streptomycin                     4
ceftazidine                      4
clarthromycin                    4
amoxicillin+flucoxacillin        3
cefoparazone+sulbactam           3
linezolid                        3
ofloxacin                        3
norfloxacin                      3
imipenem                         2
flucloxacillin                   2
ceftiaxone                       2
cefaziclime                      2
ceftriaxone+sulbactam            2
cefexime                         2
pipercillin+tazobactam           1
amoxicillin+flucloxiacillin      1
pentoxifylline                   1
menopem                          1
levefloxacin                     1
pentoxyfylline                   1
doxycyclin                       1
amoxicillin+flucoxiacillin       1
vancomycin                       1
cefteiaxone                      1
dazolic                          1
amoxicillin+flucloaxcin          1
amoxiclav                        1
doxycycline                      1
cefoperazone+sulbactam           1
nitrofurantoin                   1

Observations:

There are 55 unique drugs.
Again, we see spelling errors – ceftriaxone vs ceftiaxone. The unique drug count may be lower.

Let us consider the top 5 drugs here. We will create a column cum_proportion to store the cumulative proportion that each drug is contributing.

top_drugs = (df['Name of Drug'].value_counts(1).reset_index())
top_drugs['cum_proportion'] = top_drugs['proportion'].cumsum()
top_drugs = top_drugs.head()
top_drugs

Output:

	Name of Drug	proportion	cum_proportion
0	ceftriaxone	0.265945	0.265945
1	co-amoxiclav	0.194946	0.460890
2	metronidazole	0.070999	0.531889
3	cefixime	0.069795	0.601685
4	septrin	0.044525	0.646209

The top 5 drugs are given to 64.6% of the patients.

PS : If we correct the spelling errors – this could be more than 64.6%.

We will use this dataframe in the bivariate analysis and modelling exercise later.

Bivariate Analysis Indication vs Name of Drug

Here, we will consider the top_indications and the top_drugs dataframes we created and try to see the distribution across them i.e. we compare top 5 drug name vs top 25% Indication. We will use pivot_table() function.

(df[(df['Indication'].isin(top_indications['Indication']))
   &(df['Name of Drug'].isin(top_drugs['Name of Drug']))]
.pivot_table(index='Indication',columns='Name of Drug',values='Age',aggfunc='count')
)

Output:

Name of Drug	cefixime	ceftriaxone	co-amoxiclav	metronidazole	septrin
Indication
chest infection	7.0	22.0	27.0	3.0	1.0
col	1.0	14.0	3.0	3.0	1.0
prevention of infection	2.0	6.0	3.0	2.0	1.0
type 2 dm	2.0	7.0	4.0	1.0	NaN
uti	3.0	9.0	2.0	NaN	NaN

Observations:

for a chest infection, co-amoxiclav is the most recommended medicine, followed by ceftriaxone.
Similar observations can be drawn for other indications.

Note: These medicines are prescribed under several other factors – like the Age, Gender, Diagnosis, and Medical history of the patient.

Bivariate Analysis Indication vs Age

Here, we will try to understand if a few conditions appear for older patients. We will consider the top_indications and examine the mean and median values of age.

(df[df['Indication'].isin(top_indications['Indication'])]
.groupby('Indication')['Age']
.agg(['mean','median','count'])
)

Output:

	mean	median	count
Indication
chest infection	57.173913	61.5	92
col	48.031250	48.0	32
prevention of infection	42.136364	46.0	22
type 2 dm	63.560000	62.0	25
uti	50.233333	53.5	30

Observations:

Indications – col and prevention of infection are more observed in younger patients.
uti is observed in mid-age patients
Indications: chest infection and type 2 dm are more observed in older patients.

Bivariate Analysis Name of Drug vs Age

Here, we will try to understand if specific drugs are used for older patients. We will consider the top_drugs and examine the mean and median values of age.

(df[df['Name of Drug'].isin(top_drugs['Name of Drug'])]
.groupby('Name of Drug')['Age']
.agg(['mean','median','count']).sort_values(by='median')
)

Output:

	mean	median	count
Name of Drug
septrin	44.513514	40.0	37
cefixime	43.137931	42.0	58
ceftriaxone	50.484163	49.0	221
metronidazole	53.661017	54.0	59
co-amoxiclav	56.518519	60.0	162

Observations:

septrin and cefixime is prescribed to younger patients
ceftriaxone is prescribed to mid-age patients
metronidazole and co-amoxiclav is prescribed to older patients

Modeling Approach

Here, we will try to help the Pharmacist or the Prescribing Doctor. The problem statement is to identify which Drug will be given to the patient basis the Diagnosis, Age and Gender column.

Few considerations and assumptions:

For this exercise – we will consider only the top 5 drugs and mark the rest as Others.
So, for each value of (Diagnosis, Age, Gender) – the objective is to identify the drug that will be recommended. The baseline model is (1/6) = 16.67% accuracy.

Let us see if we can try to beat that. We will create a copy of the dataframe for the modelling exercise.

adf = df.copy()
adf['Output'] = np.where(df['Name of Drug'].isin(top_drugs['Name of Drug']),
                    df['Name of Drug'],'Other')
adf['Output'].value_counts()

Output:

Output
Other            294
ceftriaxone      221
co-amoxiclav     162
metronidazole     59
cefixime          58
septrin           37
Name: count, dtype: int64

We have seen this distribution earlier as well. It means that the classes are not evenly distributed. We will need to keep this in mind while initializing the model.

Feature Engineering

We will try to capture more frequent words/ngrams in the diagnosis column. We will use the Count Vectorizer module.

vectorizer = CountVectorizer(max_features=150,stop_words='english',
              ngram_range=(1,3))
X = vectorizer.fit_transform(adf['Diagnosis'].str.lower())
vectorizer.get_feature_names_out()

Output:

array(['abscess', 'acute', 'acute ge', 'af', 'aki', 'aki ckd',
       'aki ckd retro', 'alcoholic', 'anaemia', 'art', 'bite', 'bleeding',
       'bone', 'ca', 'cap', 'ccf', 'ccf chest', 'ccf chest infection',
       'ccf increased', 'ccf increased lft', 'ccf koch', 'ccf koch lung',
       'cerebral', 'cerebral infarct', 'chest', 'chest infection',
       'chest infection pre', 'chronic', 'ckd', 'ckd chest infection',
       'ckd retro', 'col', 'col portal', 'col portal hypertension',
       'copd', 'copd chest', 'copd chest infection', 'debility',
       'debility excessive', 'debility excessive vomitting', 'diabetes',
       'disease', 'disease renal', 'disease renal impairment', 'dm',
       'dm ihd', 'edema', 'effusion', 'encephalopathy', 'excessive',
       'excessive vomitting', 'excessive vomitting uraemic', 'failure',
       'fever', 'gastritis', 'gastritis hcv', 'gastritis hcv aki', 'ge',
       'general', 'general debility', 'general debility excessive',
       'gi bleeding', 'hcv', 'hcv aki', 'hcv aki ckd', 'hcv col', 'heart',
       'hepatic', 'hepatic encephalopathy', 'hepatitis', 'ht',
       'ht disease', 'ht disease renal', 'hypertension', 'ihd',
       'impairment', 'impairment koch', 'impairment koch lungs',
       'increased', 'increased lft', 'infarct', 'infection',
       'infection pre', 'infection pre diabetes', 'koch', 'koch lung',
       'koch lung copd', 'koch lungs', 'koch lungs ccf', 'left',
       'left sided', 'leg', 'lft', 'lung', 'lung copd', 'lung copd chest',
       'lungs', 'lungs ccf', 'lungs ccf increased', 'marrow', 'multiple',
       'multiple myeloma', 'multiple myeloma ckd', 'myeloma',
       'myeloma ckd', 'newly', 'old', 'pleural', 'pleural effusion',
       'pneumonia', 'portal', 'portal hypertension', 'pre',
       'pre diabetes', 'pulmonary', 'pulmonary edema', 'renal',
       'renal impairment', 'renal impairment koch', 'retro', 'right',
       'rvi', 'rvi stage', 'rvi stage ht', 'septic', 'septic shock',
       'severe', 'severe anaemia', 'severe anaemia multiple', 'shock',
       'sided', 'snake', 'snake bite', 'stage', 'stage ht',
       'stage ht disease', 'stroke', 'tb', 'type', 'type dm',
       'type dm ihd', 'uraemic', 'uraemic gastritis',
       'uraemic gastritis hcv', 'uti', 'uti type', 'uti type dm',
       'vomitting', 'vomitting uraemic', 'vomitting uraemic gastritis'],
      dtype=object)

The above list showcases the top 150 ngrams observed in the diagnosis column after removing the stopwords.

Dataset Creation

Here, we will create a single dataframe containing the features we just made and the age and gender column as input. We will use Label Encoder to convert the Drug Names into numeric values.

feature_df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
feature_df['Age'] = adf['Age'].fillna(0).astype('int')
feature_df['Gender_Male'] = np.where(adf['Gender']=='Male',1,0)


le = LabelEncoder()
feature_df['Output'] = le.fit_transform(adf['Output'])

Now, we will do a train test split. We will keep 20% of the data as the test set. We will use the random_state argument to ensure reproducibility.

X_train, X_test, y_train, y_test = train_test_split(
  feature_df.drop('Output',axis=1).fillna(-1), 
  feature_df['Output'], 
  test_size=0.2, random_state=42)

Modelling

Here, I have tried using the Random Forest model. You can try with other models as well. We will use the random_state argument to ensure reproducibility. We will use the class_weight parameter since we had seen earlier that the classes are not evenly distributed.

clf = RandomForestClassifier(max_depth=6, random_state=0, 
  class_weight='balanced')
clf.fit(X_train, y_train)

Let us see the accuracy and other metrics for the training dataset,

# accuracy on X_train data
final_accuracy = clf.score(X_train, y_train)
print("final_accuracy is : ",final_accuracy)

# creating a confusion matrix for determining and visualizing the accuracy score
clf_predict = clf.predict(X_train)
print(classification_report(y_train, clf_predict,
                            target_names=list(mapping_df['Actual_Name'])))

Output:

final_accuracy is :  0.411144578313253
               precision    recall  f1-score   support

        Other       0.84      0.30      0.44       236
     cefixime       0.16      0.86      0.27        49
  ceftriaxone       0.74      0.26      0.39       176
 co-amoxiclav       0.48      0.47      0.48       127
metronidazole       0.39      0.59      0.47        49
      septrin       0.45      0.93      0.61        27

     accuracy                           0.41       664
    macro avg       0.51      0.57      0.44       664
 weighted avg       0.64      0.41      0.43       664

Similarly, let us see accuracy and other metrics for the test dataset.

# accuracy on X_test data
final_accuracy = clf.score(X_test, y_test)
print("final_accuracy is : ",final_accuracy)

# creating a confusion matrix for determining and visualizing the accuracy score
clf_predict = clf.predict(X_test)
print(classification_report(y_test, clf_predict,
                            target_names=list(mapping_df['Actual_Name'])))

Output:

final_accuracy is :  0.38323353293413176
               precision    recall  f1-score   support

        Other       0.71      0.38      0.49        58
     cefixime       0.08      0.56      0.14         9
  ceftriaxone       0.36      0.09      0.14        45
 co-amoxiclav       0.64      0.71      0.68        35
metronidazole       0.31      0.40      0.35        10
      septrin       0.44      0.40      0.42        10

     accuracy                           0.38       167
    macro avg       0.42      0.42      0.37       167
 weighted avg       0.53      0.38      0.41       167

Key Observations:

The model is 41.11% accurate on train data and 38.32% on test data.
If we look at the f1-score, we see that for Other drug names – the training dataset had a value of 0.44, and the test dataset had a value of 0.49
We can also see that the test f1-score is lower for cefixime and ceftriaxone. While cefixime has only 9 samples, ceftriaxone has 45 samples. So, we need to analyze these data points to understand the scope of improving our feature sets.

Conclusion

In this article, we analyzed a medical dataset end to end. Then, we cleaned the dataset. We saw basic statistics, distribution, and even a wordcloud to understand the columns. Then, we created a problem statement to help the Pharmacist or the Prescribing Doctor with the modern medicine. It was to identify which Drug would be given to the patient basis the Diagnosis, Age, and Gender column.

Key Takeaways

We considered the top 150 words/bigrams/trigrams from the Diagnosis column for the modeling exercise. Let’s say we dissected the diagnosis column and created 150 features.
We had the Age and Gender feature as well. So, in total, we had 152 features.
Using these 152 features, we tried to determine which medicine/drug would be prescribed.
The baseline/random model was 16.67% accurate. Our model was 41.11% accurate on train data and 38.32% on test data. That is a significant improvement over the baseline model.

Some things to try out to improve the model performance:

Try TF-IFD or embeddings to extract features.
Try different models and use grid search and cross-validation to optimize accuracy.
Try predicting more medicines.

Thanks for reading my article. Feel free to connect with me on LinkedIn to discuss this.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

aakash93

Data Scientist with extensive experience in solving many real world business problems across different domains. Possess fine blend of business knowledge, maths/stats and technology/programming.

Experienced in handling client facing roles, stakeholder management, effective communication with presentation & negotiation skills.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Building a Medication Prediction Model from Medical Data Analysis

Introduction

Learning Objectives

Table of contents

Dataset

Loading Libraries and Data set

Basic Statistics

Data Preprocessing

Looking at Some Stats

Univariate EDA Route and Frequency Column

Univariate EDA Diagnosis Column

Univariate EDA Indication Column

Univariate EDA Name of Drug column

Bivariate Analysis Indication vs Name of Drug

Bivariate Analysis Indication vs Age

Bivariate Analysis Name of Drug vs Age

Modeling Approach

Feature Engineering

Dataset Creation

Modelling

Conclusion

Key Takeaways

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state