One isn’t obscure to the kind of knowledge that ancient Indian scriptures treasure. Though these scriptures are in many languages, most of them happen to be in Sanskrit. So, why not employ the power of Natural Language Processing and fiddle a little with Sanskrit text?
Sanskrit is one of the most ancient and unambiguous languages in the world. It is one of the few languages that identifies three grammatical genders (Masculine, Feminine, and Neuter) and three grammatical counting cases (Singular, Plural, and Dual). Given its inclusiveness and unambiguity, a few studies argue that Sanskrit is one of the best-suited languages for Natural Language Processing. Though Sanskrit is not practiced as a modern-day language, its text is available in abundance in Hindu scriptures and ancient Indian literature.
In this article, we will try our hands on NLP in Sanskrit. We will be performing the classification of Sanskrit Shlokas (Verses).
Categorization of Sanskrit Shlokas
First, let us understand what Sanskrit Slokas are and on what grounds we will classify them. So this is how a quick Google search defines the term ‘Shloka’:
Shloka: a couplet of Sanskrit verse, especially one in which each line contains sixteen syllables.
These couplets, written in Sanskrit, usually embody religious praises or knowledge of the ways of life.
The following is a typical example of a Shloka along with its English translation:
For our classification task, we will be classifying the Shlokas into the following three classes:
Chanakya Shlokas: These Shlokas are the ones obtained from The Chanakya Niti Sastra, which is an anthology of Shlokas compiled from various Hindu sastras attributed to the Indian philosopher Chanakya.
Vidur Niti Shlokas: These Slokas belong to Vidura Niti, which is an ethical philosophy that was narrated in the form of a conversation, a rich discourse on polity and religiousness between Vidura and King Dhritarashtra in Mahabharata (a Hindu Epic tale)
Sanskrit Slogans: These Shlokas are not attributed to any particular source. This can be treated as the ‘others’ category.
Dataset
We will be using the iNLTK Sanskrit Shlokas Dataset that comprises about 500 Shlokas labeled as Chanakya Slokas, Vidur Niti Slokas and Sanskrit-slogan. The Shlokas have already been cleaned and divided into training and testing as CSV files.
The dataset can be obtained from Kaggle. Please note that the dataset lies under the Creative Commons licence.
Building the Shloka Classifier
Let us tackle this stepwise. Firstly, as the classic data science advice says, we should get to know our data better and then build our model accordingly. We’ll proceed in three broad steps:
Exploratory Data Analysis
Data Pre-Processing
Model Building and Evaluation
Exploratory Data Analysis
Step-1: Import all the requisite Python libraries
#Import Necessary Libraries
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
Here, we’ve used Pandas to load the CSV dataset, Numpy to perform mathematical operations, word cloud to build a visual of our text and Matplotlib to plot graphs as required.
Step-2: Load the Dataset
This is what our data looks like:
#Loading the dataset
data = pd.read_csv('../input/sanskrit-shlokas-dataset/train.csv')
Step-3: Create text Vocabulary-Frequency distribution
Now we need to create a Vocabulary-Frequency distribution for our text. Vocabulary refers to the set of unique words in a text. So we need to store each unique word and its frequency of occurrence in our dataset.
For this, we have first stored all the Shlokas in the training dataset into a single string named ‘text’. Then we stored each unique word as a key in a dictionary named ‘vocab’ and its frequency as the value.
This distribution will help us identify the stopwords in our text. An excerpt from the distribution is shown below:
Now, we’ll generate a bar chart for the frequencies of each output class label.
#Plot class label frequencies
Class = data['Class'].value_counts()
names = ['Chanakya Slokas','Vidur Niti Slokas','sanskrit-slogan']
values = [Class['Chanakya Slokas'],Class['Vidur Niti Slokas'],Class['sanskrit-slogan']]
plt.bar(range(len(values)), values, tick_label=names)
plt.show()
The bar chart turns out as follows:
We can see that number of training examples corresponding to each class is nearly equal i.e. our dataset is balanced.
By now, we have gained quite decent insight into our data; now, let’s move on to pre-processing the data.
Data Pre-Processing
Since we’ll be building an LSTM-based deep learning classifier, we first need to convert our training text into embeddings. For this, we’ll use TensorFlow’s tokenizer. First, we need to train the tokenizer on the entire training text to ensure it fits on its vocabulary. Then we convert the training text to embeddings using the texts_to_sequences() method. Finally, we pad all the generated sequences so as to make them equal in length.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=500, split=' ')
tokenizer.fit_on_texts(data['Sloka'].values)
X = tokenizer.texts_to_sequences(data['Sloka'].values)
X = pad_sequences(X)
After generating the embeddings, our text is ready to be fed to any model. But we must note that our output classes are categorical in nature. Thus, they must be one hot encodes. You can use sklearn’s one hot encoder for the same. However, here I’ve used Pandas’ get_dummies function.
#One Hot Encoding
Y= pd.get_dummies(data['Class'])
The One Hot Encoded Vector (Y) looks like this:
Model Building and Evaluation
Now that our data is ready and waiting to be fed to some model, let us get on to building one. Since we are dealing with a text classification use case, we’ll go ahead with an LSTM-based model. LSTMs (Long Short Term Memory) combines the capabilities of RNNs (Recurrent Neural Networks) with the advantages of memory units that make them suitable for NLP tasks. If you’d like to learn about LSTMs in great detail, I’d recommend you go through this article.
The layer-by-layer description of the model can be seen below:
Please note that since ours is a case of multi-class classification (i.e., we are classifying our input text into more than two classes), thus we have used ‘softmax’ as the activation function in the output layer and ‘categorical_corssentropy’ as the loss function.
Now, we will fit the model into our training data.
history = model.fit(X, Y, epochs = 30, batch_size=32, verbose =1)
The training accuracy turns out to be 94.34% in this case.
Now, to test our model on new data, we need to prepare/pre-process our test data, just as we did with our training data.
#Loading the test data
test = pd.read_csv('../input/sanskrit-shlokas-dataset/valid.csv')
#Tokenize the input texts
X_test = tokenizer.texts_to_sequences(test['Sloka'].values)
X_test = pad_sequences(X_test)
#One Hot Encode the Output Classes
Y_test = pd.get_dummies(test['Class'])
So, we have loaded our test data, tokenized its input text, and one hot encoded its output classes. Thus, our test data is ready for model evaluation!
model.evaluate(X_test,Y_test)
Finally, we have obtained a test accuracy of 78%, which is quite decent.
What’s Next?
You’ve finally built a Sanskrit Shloka Classification model using LSTM from scratch. Give yourself a pat on the back! Though this was a small dataset, you can try finding new datasets or perhaps create one on your own to ensure a more robust model.
Here’s a quick recap of what this article encompasses:
We have successfully classified shlokas into three categories: Chanakya Shlokas, Vidhur Niti Shlokas, and Sanskrit Slogans.
We also learned how to extract stopwords from the language using a vocabulary-frequency distribution and visualize text using word clouds.
Finally, we built an LSTM model using TensorFlow and tuned its parameters to perform the multi-class classification task.
That’s all for this article; feel free to leave a comment with any feedback or questions.
Since you’ve read the article up till here, I’m certain our interests do match — so please feel to connect with me on LinkedIn or Instagram for any queries or potential opportunity
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
I'm Suvrat Arora, a Computer Science graduate. Enthusiastic about AI, Data Science, ML and NLP - I believe that storytelling is a significant aspect of life which has led me to develop a practice of documenting, organizing, and disseminating knowledge across domains, making me an active contributor on multiple platforms.
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
very good explanation