Classifying Sanskrit Shlokas Using an LSTM-based Model

(The entire distribution is too long to be displayed here)

Step-4: Identify Stopwords from the dataset:

In this step, we’ll identify the stopwords from our dataset. Stopwords are the repeatedly occurring words in a text that usually do not add much value to the meaning or intent of the text. Some examples of stopwords in English are: a, an, the, is, are, have, in, etc. These words occur way too often in the language but do not add to its meaning.

#Generating top 2 percentile list 
freqs = np.array(list(vocab.values()))
percentile_val = np.percentile(freqs, 99)

#Identifying and Storing Stopwords
stopwords = list()
for i in vocab.keys():
    if vocab[i] > percentile_val:
        stopwords.append(i)

To identify stopwords from the text, we just need to pick the most frequently occurring words from the vocabulary. But we need to set a threshold on the frequency of a word for it to qualify as a stopword. In our case, we have set the threshold as 99th percentile i.e. words with a frequency of more than 99 percent of words will be categorized as a stopword. Some of the stopwords obtained are shown below:

['स',
 'यत्र',
 'न',
 'भार्या',
 'सर्वं',
 'हि',
 '।',
 'च',
 '॥न',
 'वै',
 'सर्वत्र',
 'नास्ति',
 'मित्रं',
 'विद्या',
 'एव',
 'तस्य',
 'यो',
 'तु',
 'च।',
 'रक्ष्यते',
 'यः']

Step-5: Plot a Word Cloud of the Text

Now, we will plot a word cloud of the training text. We pass stopwords as a parameter to the word cloud to ensure they are omitted in the output. Besides, since the word cloud, as such, does not support any Sanskrit/Hindi font, one has to download a relevant font and pass its path as a parameter to the word cloud. The font that I used can be found here on Google Fonts.

wordcloud = WordCloud(font_path= '../input/devanagari/TiroDevanagariSanskrit-Regular.ttf',width = 800, height = 800,background_color = 'white', stopwords = stopwords,min_font_size = 10).generate(text)

# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis('off')

The generated word cloud looks like this:

The generated embeddings would look like this:

array([[  0,   0,   0, ...,   6, 320,   1],
       [  0,   0,   0, ..., 326, 327,   1],
       [  0,   0,   0, ..., 334,  19, 335],
       ...,
       [  0,   0,   0, ..., 239,  76,  42],
       [  0,   0,   0, ...,   4, 100,  38],
       [  0,   0,   0, ...,   0,   0,   1]], dtype=int32)

After generating the embeddings, our text is ready to be fed to any model. But we must note that our output classes are categorical in nature. Thus, they must be one hot encodes. You can use sklearn’s one hot encoder for the same. However, here I’ve used Pandas’ get_dummies function.

#One Hot Encoding 
Y= pd.get_dummies(data['Class'])

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Classifying Sanskrit Shlokas Using an LSTM-based Model

Introduction

Categorization of Sanskrit Shlokas

Dataset

Building the Shloka Classifier

Exploratory Data Analysis

Step-3: Create text Vocabulary-Frequency distribution

Step-4: Identify Stopwords from the dataset:

Step-5: Plot a Word Cloud of the Text

Step-6: Plot the Output class label frequencies

Data Pre-Processing

Model Building and Evaluation

What’s Next?

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID