For many applications, including online customer service, marketing, and finance, gender identification based on names is a crucial challenge. Given a large number of gender options and the variability of languages, it can be difficult to come up with a name gender identity classification system that is accurate across all languages. This article will discuss how NLP and Python can solve this problem. Here we will deal with identifying gender, based on Indian names.
By the end of this article, you will have learned how to:
This article was published as a part of the Data Science Blogathon.
The business problem that we are going to solve is as follows with NLP pipeline steps:
“Given the name, identify the gender of the person”
This a beginners-level NLP project and requires an understanding of the following concepts:
The suggested fix for this problem is to build a name-based gender identification system combining deep learning and machine learning. Although we know there are more than two genders, we will only consider ‘Male’ and ‘Female.’ Hence this becomes a binary classification model.
For this project, we are going to use the Gender_Data dataset available on Kaggle.
This dataset contains a total of 53925 Indian names, of which 29014 are male, and the remaining are female. The ‘Gender’ attribute contains the values 0 and 1. 0 corresponds to a boy’s name, while 1 represents a female.
In this section, we are going to look at the NLP concepts and other topics that we shall use in building this project.
Label Encoding: This refers to the process of converting categorical labels into numeric labels. Here each categorical label is given a specific value based on its alphabetical ordering.
Count Vectorization: Count vectorization is the process where all the words in the corpus are converted into numerical data based on their frequency in the corpus. It converts textual data into a sparse matrix. Let us vectorize the given an example:
text = [ ‘this is an example, ‘An ant ate the apple’ ]
Logistic Regression
Logistic regression is one of the most commonly used machine learning algorithms for solving classification problems. It is used to predict the likelihood of a certain value belonging to a certain category. It tells the likelihood of a data point belonging to class 0 or class 1. It works based on a sigmoid function. Logistic regression fits the linear regression curve into the sigmoid function, generating an “S”-shaped curve. Here, a threshold point (ideally 0.5) is used to distinguish the classes.
Naïve Bayes
Naïve Bayes is a supervised learning algorithm widely used to classify texts and high-dimensional training data. It is capable of making very quick decisions and hence takes minimal training and testing time. It is named ‘Naïve’ because it assumes that the occurrence of one value is entirely independent of the other values in the dataset. It works based on Bayes theorem, which is as follows:
P(A/B) = [P(A) * P(B/A)] / P(B)
where P(A) is the posterior likelihood, P(B) is the marginal likelihood , P(A) is the prior likelihood and P(B/A) is the likelihood.
Naive Bayes is a fast and easy algorithm that can be used for both binary and multiclass classification problems. However, it presumes that the dataset’s features are uncorrelated, making it difficult to learn the relationship between the variables.
XGBoost
XGBoost is one of the most powerful machine learning algorithms in use today. It stands for eXtreme Gradient Boosted Trees. It is designed to improve the performance of predictive models by exploiting the pattern recognition capabilities embedded in deep learning networks. XGBoost is fast, efficient, and scalable, making it a popular choice for people who need to train large models quickly.
LSTM
In machine learning, a Long Short-Term Memory (LSTM) is a recurrent neural network that can be very useful for tasks such as machine translation and vision. LSTMs can remember multiple forgetful episodes so that they can generate the next sentence or image given a previous sentence or image.
LSTMs are a great specialized algorithm for certain tasks where you need to remember something from one episode (context) and use it in the next episode. For example, you may want to model how someone speaks by remembering past sentences and using that information to generate the next sentence.
LSTMs are especially useful when you have so many similar inputs (similar pixels in an image, similar words in text). These situations are known as streaming problems. With enough training data, an LSTM can learn how to generate different outputs with high accuracy given any subset of its inputs. This is why they are so popular for machine learning tasks such as machine translation and recognition.
The work pipeline involved in this project is as follows:
We first must import the necessary libraries to work with any data and build a solution. Our project will use Numpy, Pandas, Matplotlib, Seaborn, Scikit-Learn, TensorFlow, and Keras.
import numpy as np
import pandas as pd
import matplotlib.pyplotas plt
import seaborn as sns
from wordcloud importWordCloud
dataset = pd.read_csv("C:\\Users\\admin\\Desktop\\
Python_anaconda\\Projects\\Name Gender\\Gender_Data.csv")
Now that we have our data ready, let us look into it to understand better the data we will be working with.
Sample of the Dataset
dataset.head()
Column Names and Data Types of the Attributes
Identifying the data types of each attribute or column in the dataset helps decide what kind of pre-processing should be done.
print(dataset.columns)
print(dataset.dtypes)
We see that there are two attributes in the dataset. The ‘Name’ attribute corresponds to the name of the person, and the ‘Gender’ columns represent if they are male or female.
Replacing Column Values
Here, 0 and 1 in the ‘Gender’ column refer to male and female, respectively. However, for convenience, we shall replace them with ‘M’ and ‘F.’
dataset['Gender'] = dataset['Gender'].replace({0:"M",1:"F"})
The Shape of the Data
print(dataset.shape)
Running the above code snippet shows us that there are a total of 53982 rows and 2 columns. That is, there are 53982 names.
No. of Unique Names and Looking for Class Imbalance
print(len(dataset['Name'].unique()))
Among the 53982 Indian names, there are 53925 unique names, implying that there are 57 values that are repeated. These are the names that are used for both boys and girls and hence have been labeled multiple times.
Let us create a plot to see how many male and female names are present in the dataset.
sns.countplot(x='Gender',data = dataset)
plt.title('No. of male and female names in the dataset')
plt.xticks([0,1],('Female','Male')
It is evident from the above graph that there is no major class imbalance.
Analyzing the Starting Letter of Names
Generally, a few alphabets are most commonly used as the first alphabet in a name. Our dataset lets us see the distribution of English alphabets by starting letters.
alphabets= ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P',
'Q','R','S','T','U','V','W','X','Y','Z']
startletter_count = {}
for i in alphabets:
startletter_count[i] = len(dataset[dataset['Name'].str.startswith(i)])
print(startletter_count)
Visualizing the above information using a bar chart shows that around 6,000 names start with the letter “A”.
plt.figure(figsize = (16,8))
plt.bar(startletter_count.keys(),startletter_count.values())
plt.xlabel('Starting alphabet')
plt.ylabel('No. of names')
plt.title('Number of names starting with each letter')
Let us see what the most common alphabets with which most of the names start are.
print('The 5 most name starting letters are : ',
*sorted(startletter_count.items(), key=lambda item: item[1])[-5:][::-1])
Most Indian names start with the alphabets A, S, K, V, and M.
Analyzing the Ending Letter of Names
Similarly, now let us see what the common ending letters and their distribution across the names in the dataset are.
small_alphabets = ['a','b','c','d','e','f','g','h',
'i','j','k','l','m','n','o','p','q','r','s','t','u','v','x','y','z']
endletter_count ={}
for i in small_alphabets:
endletter_count[i]=len(dataset[dataset['Name'].str.endswith(i)])
print(endletter_count)
plt.figure(figsize = (16,8))
plt.bar(endletter_count.keys(),endletter_count.values())
plt.xlabel('Ending alphabet')
plt.ylabel('No. of names')
plt.title('Number of names ending with each letter')
The above bar graph depicts that approximately 16000 and 14000 names end with the letters “a” and “n.”
print('The 5 most name endind letters are : ', *sorted(endletter_count.items(),
key=lambda item: item[1])[-5:][::-1])
Executing the above-mentioned code gives us the following output:
Hence, most of the names end with the letters “a,” “n,” “i,” “h,” and “r.”
Word Cloud
Word clouds generally help us visualize textual data. We are going to build a word cloud representing the names in the dataset. The size of each name shall depend upon its frequency in the dataset.
# building a word cloud
text = " ".join(i for i in dataset.Name)
word_cloud = WordCloud(
width=3000,
height=2000,
random_state=1,
background_color="white",
colormap="BuPu",
collocations=False,
stopwords=STOPWORDS,
).generate(text)
plt.imshow(word_cloud)
plt.axis("off")
plt.show()
We can see that the names starting with the letter ‘A’ are prominently visible in the word cloud. This supports our earlier analysis that most of the names start with the letter ‘A’ in the dataset.
First, let us define the predictor variable ‘X’ and the target variable ‘Y.’ In our binary classification problem, ‘Name’ is the predictor, while ‘Gender’ is the target attribute. We need to determine the gender based on the name.
X =list( dataset['Name'])
Y = list(dataset['Gender'])
Encode the Labels
Now, we use the LabelEncoder feature in Sklearn to convert the ‘F’ and ‘M’ labels into a machine-readable format.
from sklearn.preprocessing importLabelEncoder
encoder= LabelEncoder()
Y = encoder.fit_transform(Y)
Count Vectorization
We vectorize the names into vector-like data to make the modeling process easier. The variable ‘X’ is transformed into an array of vectors.
from sklearn.feature_extraction.text
import CountVectorizer
cv=CountVectorizer(analyzer='char')
X=cv.fit_transform(X).toarray()
Splitting the Dataset
Now that our target and predictor variables are ready to be used for modeling, we split the dataset into training and testing sets. We shall split the data so that 33% of it is allocated for testing while the rest is used for the initial training of the models.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
Logistic Regression
Here, we are first going to build and test all the models and then later evaluate their performance. The first algorithm we will use is logistic regression. First, we will import the LogisticRegression function from Scikit-Learn and then create a model using it. Next, we fit the x_train and y_train into the model for training purposes. Lastly, we test the model on the test dataset that we created earlier.
from sklearn.linear_model import LogisticRegression
LR_model= LogisticRegression()
LR_model.fit(x_train,y_train)
LR_y_pred = LR_model.predict(x_test)
Naive Bayes
The pipeline for building the models shall remain the same.
from sklearn.naive_bayes import MultinomialNB
NB_model= MultinomialNB()
NB_model.fit(x_train,y_train)
NB_y_pred = NB_model.predict(x_test)
XGBoost
from xgboost import XGBClassifier
XGB_model = XGBClassifier(use_label_encoder= False)
XGB_model.fit(x_train,y_train)
XGB_y_pred = XGB_model.predict(x_test)
For evaluating the model’s performance, we are going to use accuracy as an evaluation measure and also build a confusion matrix to see how many right and wrong predictions were made by the respective model.
# function for confusion matrix
from sklearn.metrics import confusion_matrix
def cmatrix(model):
y_pred = model.predict(x_test)
cmatrix = confusion_matrix(y_test, y_pred)
print(cmatrix)
sns.heatmap(cmatrix,fmt='d',cmap='BuPu',annot=True)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Confusion Matrix')
import sklearn.metrics as metrics
#for logistic regression
print(metrics.accuracy_score(LR_y_pred,y_test))
print(metrics.classification_report(y_test, LR_y_pred))
print(cmatrix(LR_model))
# for naive bayes
print(metrics.accuracy_score(NB_y_pred,y_test))
print(metrics.classification_report(y_test, NB_y_pred))
print(cmatrix(NB_model))
# for XGBoost
print(metrics.accuracy_score(XGB_y_pred,y_test))
print(metrics.classification_report(y_test, XGB_y_pred))
print(cmatrix(XGB_model))
Looking at the above outputs, the accuracy of logistic regression is 71%. It classified around 3000 women’s names as men and 2300 men’s names as women.
Out of all the three mentioned algorithms, XGBoost seems to have performed better. It had a pretty good accuracy of 77%, with 4343 wrong predictions made from 17815 testing samples.
Although we have obtained good accuracy using XGBoost, we can further improve the classification using deep learning models. LSTM is one of the most widely used neural networks for text classification. We are going to build an LSTM network for gender classification and test its performance on our data.
Import Necessary Libraries
Building an LSTM network requires more advanced libraries like Keras and TensorFlow.
Naive Bayes performed way less efficiently than logistic regression, with only 65% testing accuracy.
from tensorflow.keras import models
from tensorflow.keras.models import Model
from tensorflow.keras.models import load_model
from keras.layers import Embedding
from tensorflow.keras.layers import Dense, Dropout, Flatten, Input, LeakyReLU
from tensorflow.keras.layers import BatchNormalization, Activation, Conv2D
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, MaxPooling2D, Dense, Dropout
from tensorflow.keras.layers import LSTM
Defining the LSTM Layers
max_words = 1000
max_len = 26
LSTM_model = Sequential()
LSTM_model.add(Embedding(voc_size,40,input_length=26))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(LSTM(100))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(Dense(64,activation='relu'))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(Dense(1,activation='sigmoid'))
LSTM_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(LSTM_model.summary())
Training
Now that we have constructed the network, we are going to train it using the x_train and y_train features. We shall use 100 epochs to ensure that the model can generalize accurately.
LSTM_model.fit(x_train,y_train,epochs=100,batch_size=64)
This step will take some time to implement.
The above picture shows only the last snippet of the output. We can see that LSTM has given an accuracy of 85%, which is 8% more than XGBoost. Let us define a function that takes in any name as input and classifies the name using this LSTM model.
def predict(name):
prediction = LSTM_model.predict([name_samplevector])
if prediction >=0.5:
out = 'Male ♂'
else:
out = 'Female ♀'print(name+' is a '+ out)
Sample Test
predict('Yamini Ane')
name_samplevector = cv.transform([name]).toarray()
We can see that the model has predicted the name ‘Yamini Ane’ as female. However, there could be some cases where the model makes wrong predictions. This could be because only Indian names were used for training the model.
Lastly, we are going to save this LSTM for further usage.
import pickle
pickle.dump(LSTM_model, open("LSTM_model.pkl", 'wb'))
This brings us to an end to the name gender classification project. Let us review our work. First, we started by defining our problem statement, looking into the algorithms we were going to use and the NLP implementation pipeline. Then we moved on to practically implementing the identification and classification of gender based on names using logistic regression, naïve Bayes, and XGBoost algorithms. Moving forward, we compared the performances of these models. Lastly, we built an LSTM network and proved that it works best for name-based gender identification NLP problems.
The key takeaways from this NLP project are:
I hope you like my article on “Name Gender Classification Using NLP and Python.” The entire code can be found in my GitHub repository. You can connect with me here on LinkedIn.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.