NLP: Building a Stemmer for Punjabi in Python!

simran Last Updated : 20 May, 2021

4 min read

This article was published as a part of the Data Science Blogathon

Introduction

This problem was given to me by my professor to create a stemmer for Punjabi language in python. When I started looking online, I saw there have been a few papers developed for NLP in the Punjabi language, but I was not able to find a proper dataset for it.

IIT Bombay has created a database called the WordNet which contains data for a lot of Indian languages, but they have created a web interface and used SQL. They don’t use python for this. Since I was told by my professor to create a stemmer using python, so I couldn’t use that web interface.

So, I have written a code in python for the same. I hope you guys like it.

First, let me introduce you to stemming and the algorithm used in this code.

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of related words with similar meanings, such as democracy, democratic, and democratization.

In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

What is stemming?

Stemming is the process of removing the affixes from words to get the root form of the word, without doing complete morphological analysis. The objective of stemming is to reduce similar words to the same stem. For instance,

am, are, is –> be
car, cars, car’s, cars’ –> car

The result of this mapping of text will be something like:

the boy’s cars are different colours –> the boy car be differ color

There is also a process called lemmatization that works on the same lines. Both stemming and lemmatization have the same goal i.e., to reduce the inflection words and related forms of a word to a common base form.

But stemming is different from lemmatization in the approach it is used to achieve this common goal.

For now, let us focus on stemming. To know more about stemming you can go check this link out!

To create a stemmer, I have used the suffix stripping algorithm.

Suffix stripping algorithm

As the name suggests, in this algorithm we strip the suffix from the word to get the root word. This algorithm doesn’t rely on a lookup table consisting of root words and inflected words. Instead, we follow a certain set of rules to remove these suffixes. These suffixes can be simple or compound. To know more about Punjabi grammar check out this link!

Let’s start with the python code. You can get this code here in Git- hub.

I would suggest you open this Git hub repository along with the article to understand properly what is actually happening in the code. So, let’s start.

First, I have created a Punjabi class. Inside this class, I have created a few functions.

In the __init__ () function (also called constructor or special function), I have created a suffix dictionary. This dictionary contains the suffixes in key-value pair format.

Next, we define a parameterize function rreplace (). This function is basically used to replace the suffix with ‘ ‘ and thus form a new word.

The first parameter string is the text in which we’ll perform the replacing technique. The second parameter ‘old’ is the text we want to replace i.e. this text will get replaced in the string.

The third parameter ‘new’ is the text that will take the place of old text in the string. The fourth parameter count (which is initially set to None) is the number of words to be replaced in the string. This function returns the final word after all the replacement.

Then we define another parameterize function gen_replacement (). This function returns the suffix from the suffixes dictionary we created initially. The words in key = ‘1’ and key=’5′ contain letters preceded by a ‘laggan’ (called ‘matra’ in Hindi). So, this function removes them and returns the new suffix.

Finally, we define the function stemmer () which takes a text as a parameter. This function is used to stem the words in the text. We create a list tag that contains the key values of the dictionary suffix. Start a for loop in which we’ll first split the text.

For all the words in this split text, we’ll start another for loop if L is in the tag. If so, we check for the flag value. If the flag is equal to 1 (flag==1), we break out of the loop otherwise we check if the word length is greater than L+1.

Here, we are checking only those words which have length greater than 2 (the words having length less than 2 are basically stop words, so we don’t actually need them).

Next, we again start a for loop. For the variable ‘suf’ in suffixes[L], we check if our word ends with that particular suffix (suf). If no, then continue in the loop otherwise, call the rreplace function. Inside this function, call the gen_replacement function. The new word is stored in word1 variable. And this word1 is stored in the dictionary dict_punj {}. Set the flag variable to 1 and break out of the loop.

At the end, we check if the flag is equal to zero or not. If so, we store the word as it is in dict_punj {} and return this dictionary otherwise, we simply return the dictionary dict_punj.

That’s it. We are done with the function creation and the class Punjabi. All we have to do now, is to create an object of the class Punjabi and using this object call the function stem () with a parameter containing text in the Punjabi language.

Conclusion

Though the efficiency of this algorithm is less than other algorithms, still it is much simpler as compared to other methods. To increase the efficiency of your stemmer, combine this suffix stripping algorithm with other algorithms. To read about it, check this paper out.

Check out this repository to know further about NLP in Punjabi.

Thanks for reading!

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

simran

Intermediate Python Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

NLP: Building a Stemmer for Punjabi in Python!

Introduction

What is stemming?

Suffix stripping algorithm

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang