Every Machine Learning enthusiast has a dream of building/working on a cool project, isn’t it? Mere understandings of the theory aren’t enough, you need to work on projects, try to deploy them, and learn from them. Moreover, working on specific domains like NLP gives you wide opportunities and problem statements to explore. Through this article, I wish to introduce you to an amazing project, the Language Detection model using Natural Language Processing. This will take you through a real-world example of ML(application to say). So, let’s not wait anymore.
We are using the Language Detection dataset, which contains text details for 17 different languages.
Languages are:
* English
* Portuguese
* French
* Greek
* Dutch
* Spanish
* Japanese
* Russian
* Danish
* Italian
* Turkish
* Swedish
* Arabic
* Malayalam
* Hindi
* Tamil
* Telugu
Using the text we have to create a model which will be able to predict the given language. This is a solution for many artificial intelligence applications and computational linguists. These kinds of prediction systems are widely used in electronic devices such as mobiles, laptops, etc for machine translation, and also on robots. It helps in tracking and identifying multilingual documents too. The domain of NLP is still a lively area of researchers.
So let’s get started. First of all, we will import all the required libraries.
import pandas as pd import numpy as np import re import seaborn as sns import matplotlib.pyplot as plt import warnings warnings.simplefilter("ignore")
Now let’s import the language detection dataset
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")
data = pd.read_csv("Language Detection.csv")
print(data.head(10))
As I told you earlier this dataset contains text details for 17 different languages. So let’s count the value count for each language.
data["Language"].value_counts()
Output :
English 1385 French 1014 Spanish 819 Portugeese 739 Italian 698 Russian 692 Sweedish 676 Malayalam 594 Dutch 546 Arabic 536 Turkish 474 German 470 Tamil 469 Danish 428 Kannada 369 Greek 365 Hindi 63 Name: Language, dtype: int64
Now we can separate the dependent and independent variables, here text data is the independent variable and the language name is the dependent variable.
X = data["Text"] y = data["Language"]
Our output variable, the name of languages is a categorical variable. For training the model we should have to convert it into a numerical form, so we are performing label encoding on that output variable. For this process, we are importing LabelEncoder from sklearn.
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y = le.fit_transform(y)
This is a dataset created using scraping the Wikipedia, so it contains many unwanted symbols, numbers which will affect the quality of our model. So we should perform text preprocessing techniques.
# creating a list for appending the preprocessed text data_list = [] # iterating through all the text for text in X: # removing the symbols and numbers text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text) text = re.sub(r'[[]]', ' ', text) # converting the text to lower case text = text.lower() # appending to data_list data_list.append(text)
As we all know that, not only the output feature but also the input feature should be of the numerical form. So we are converting text into numerical form by creating a Bag of Words model using CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() X = cv.fit_transform(data_list).toarray() X.shape # (10337, 39419)
We preprocessed our input and output variable. The next step is to create the training set, for training the model and test set, for evaluating the test set. For this process, we are using a train test split.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
And we almost there, the model creation part. We are using the naive_bayes algorithm for our model creation. Later we are training the model using the training set.
from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() model.fit(x_train, y_train)
So we’ve trained our model using the training set. Now let’s predict the output for the test set.
y_pred = model.predict(x_test)
Now we can evaluate our model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report ac = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print("Accuracy is :",ac) # Accuracy is : 0.9772727272727273
The accuracy of the model is 0.97 which is very good and our model is performing well. Now let’s plot the confusion matrix using the seaborn heatmap.
plt.figure(figsize=(15,10)) sns.heatmap(cm, annot = True) plt.show()
The graph will look like this:
When looking into each language, almost all the predictions are right. And yes !! you are almost there !!
Now let’s test the model prediction using text in different languages.
def predict(text): x = cv.transform([text]).toarray() # converting text to bag of words model (Vector) lang = model.predict(x) # predicting the language lang = le.inverse_transform(lang) # finding the language corresponding the the predicted value print("The langauge is in",lang[0]) # printing the language
As you can see, the predictions done by the model are very accurate. You can test using different other languages.
import pandas as pd import numpy as np import re import seaborn as sns import matplotlib.pyplot as plt import warnings warnings.simplefilter("ignore") # Loading the dataset data = pd.read_csv("Language Detection.csv") # value count for each language data["Language"].value_counts() # separating the independent and dependant features X = data["Text"] y = data["Language"] # converting categorical variables to numerical from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y = le.fit_transform(y) # creating a list for appending the preprocessed text data_list = [] # iterating through all the text for text in X: # removing the symbols and numbers text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text) text = re.sub(r'[[]]', ' ', text) # converting the text to lower case text = text.lower() # appending to data_list data_list.append(text) # creating bag of words using countvectorizer from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() X = cv.fit_transform(data_list).toarray() #train test splitting from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20) #model creation and prediction from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() model.fit(x_train, y_train) # prediction y_pred = model.predict(x_test) # model evaluation from sklearn.metrics import accuracy_score, confusion_matrix ac = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) # visualising the confusion matrix plt.figure(figsize=(15,10)) sns.heatmap(cm, annot = True) plt.show() # function for predicting language def predict(text): x = cv.transform([text]).toarray() lang = model.predict(x) lang = le.inverse_transform(lang) print("The langauge is in",lang[0]) # English prediction("Analytics Vidhya provides a community based knowledge portal for Analytics and Data Science professionals") # French prediction("Analytics Vidhya fournit un portail de connaissances basé sur la communauté pour les professionnels de l'analyse et de la science des données") # Arabic prediction("توفر Analytics Vidhya بوابة معرفية قائمة على المجتمع لمحترفي التحليلات وعلوم البيانات") # Spanish prediction("Analytics Vidhya proporciona un portal de conocimiento basado en la comunidad para profesionales de Analytics y Data Science.") # Malayalam prediction("അനലിറ്റിക്സ്, ഡാറ്റാ സയൻസ് പ്രൊഫഷണലുകൾക്കായി കമ്മ്യൂണിറ്റി അധിഷ്ഠിത വിജ്ഞാന പോർട്ടൽ അനലിറ്റിക്സ് വിദ്യ നൽകുന്നു") # Russian prediction("Analytics Vidhya - это портал знаний на базе сообщества для профессионалов в области аналитики и данных.")
That was an interesting project, right? I hope you might have got an intuition on how such projects are worked out. This would have definitely given you a diagram of basic NLP programs. You need to analyze the data and preprocess it accordingly. A bag of words model becomes a way of representing your text data. Text extraction and vectorization are important steps for good predictions in NLP. Naive Bayes always proves to be a better model in such text classification problems, hence more accurate results we get.
You can also find the complete end to end project for the above language detection model in my Github
Thank you for showing interest in the project, hope you follow up with more amazing projects and familiarise yourself with real-life problem statements. Feel free to connect with me on LinkedIn.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Hi, Thank you have shared such a very important information and some amazing sites. Thanks for sharing with us.
How to add more languages in dataset give me any references articles or videos to edit this Language Detection Using Natural Language Processing code.
😃Excellent Article Sir @Basil Saji — i will do it as mini project in my college can you please give me any suggestions for any add-ons to this project, however i am using python flask for creating interface ..if you give any suggestion my project will look more good..waiting for ur reply.