3 Important NLP Libraries for Indian Languages You Should Try Out Today!

Mohd Sanad Zaki Rizvi Last Updated : 12 Aug, 2024

13 min read

Overview

Ever wondered how to use NLP models in Indian languages?
This article is all about breaking boundaries and exploring 3 amazing libraries for Indian Languages
We will implement plenty of NLP tasks in Python using these 3 libraries and work with Indian languages

Introduction

Language is a wonderful tool of communication – its powered the human race for centuries and continues to be at the heart of our culture. The sheer amount of languages in the world dwarf our ability to master them all.

In fact, a person born and brought up in part of the country might struggle to communicate with a fellow person from a different state (yes, I’m talking about India!). It’s a challenge a lot of us face in today’s borderless world.

This is a research area that Natural Language Processing (NLP) techniques have not yet managed to master. The majority of breakthroughs and state-of-the-art frameworks we see are developed in the English language. I have long wondered if we could use that and build NLP applications in vernacular languages.

Human beings by nature are diverse and multilingual, so it makes sense, right?

Since the Indian subcontinent itself has a multitude of languages, dialects and writing styles spoken by more than a billion people, we need tools to work with them. And that’s the topic of this article.

We will learn how to work with these languages using existing NLP tools, compare them relatively in terms of various parameters, and learn some challenges/limitations that this area faces.

Here’s what we’ll cover in this article:

Overview
Here’s what we’ll cover in this article:
What are the Languages of the Indian Subcontinent?
Text Processing for Indian Languages using Python
India nlp library in python

What are the Languages of the Indian Subcontinent?

The Indian Subcontinent is a combination of many nations, here’s what Wikipedia says:

The Indian subcontinent is a term mainly used for the geographic region surrounded by the Indian Ocean: Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka.

These nations represent great diversity in languages, cultures, cuisines etc.

Even within India itself, there are a multitude of languages that are spoken and used in day to day life which itself showcases the basic need to be able to build NLP based applications in vernacular languages.

These are some of the languages of the Indian Subcontinent that are supported by libraries we’ll see in this article (each library lists only unique languages it supports as there are many overlapping languages like hindi):

iNLTK- Hindi, Punjabi, Sanskrit, Gujarati, Kannada, Malyalam, Nepali, Odia, Marathi, Bengali, Tamil, Urdu
Indic NLP Library- Assamese, Sindhi, Sinhala, Sanskrit, Konkani, Kannada, Telugu,
StanfordNLP- Many of the above languages

Text Processing for Indian Languages using Python

There are a handful of Python libraries we can use to perform text processing and build NLP applications for Indian languages. I’ve put them together in this diagram:

All of these libraries are prominent projects that researchers and developers are actively utilizing and improving for working with multiple languages. Each library has its own strengths and that’s why we will explore them one by one.

1. iNLTK (Natural Language Toolkit for Indic Languages)

As the name suggests, the iNLTK library is the Indian language equivalent of the popular NLTK Python package. This library is built with the goal of providing features that an NLP application developer will need.

Let’s explore the features of this library.

Installing iNLTK

iNLTK has a dependency on PyTorch 1.3.1, hence you have to install that first:

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

You can then install iNLTK using pip:

pip install inltk

Language support

iNLTK currently supports 12 languages of the Indian Subcontinent:

That’s quite a diverse collection of languages!

Setting the language

iNLTK has language models trained for different languages and in order to use one, we have to download its files first. We will be working with Hindi text, so let’s set “Hindi” as our language:

from inltk.inltk import setup
setup('hi')

This will download all the necessary files to make inferences for Hindi.

Tokenization

The first step we do to solve any NLP task is to break down the text into its smallest units or tokens. iNLTK supports tokenization of all the 12 languages I showed earlier:

	from inltk.inltk import tokenize

	hindi_text = """प्राचीन काल में विक्रमादित्य नाम के एक आदर्श राजा हुआ करते थे।
	अपने साहस, पराक्रम और शौर्य के लिए राजा विक्रम मशहूर थे।
	ऐसा भी कहा जाता है कि राजा विक्रम अपनी प्राजा के जीवन के दुख दर्द जानने के लिए रात्री के पहर में भेष बदल कर नगर में घूमते थे।"""

	# tokenize(input text, language code)
	tokenize(hindi_text, "hi")

view raw inltk_tokenize.py hosted with ❤ by GitHub

Let’s look at the output of the above code:

The input text in Hindi is nicely split into words and even the punctuations are captured. This was a basic task – let’s now see some interesting applications of iNLTK!

Generate similar sentences from a given text input

Since iNLTK is internally based on a Language Model for each of the languages it supports, we can do interesting stuff like generate similar sentences given a piece of text!

	from inltk.inltk import get_similar_sentences

	# get similar sentences to the one given in hindi
	output = get_similar_sentences('मैं आज बहुत खुश हूं', 5, 'hi')

	print(output)

view raw inltk_similar.py hosted with ❤ by GitHub

The first parameter is the input sentence. Next, we pass the number of similar sentences we want (here it’s 5) and then we pass the language code which is ‘hi’ for Hindi.

Here’s the model’s output:

This feature of iNLTK is very useful for text data augmentation as we can just multiply the sentences in our training data by populating it with sentences that have a similar meaning.

Identify the language of a text

Knowing what language a particular text is written in can be very useful when building vernacular applications or working with multilingual data. iNLTK provides this very useful functionality as well:

Above is an example of a sentence written in Malayalam that iNLTK correctly identifies.

Extract embedding vectors

When we are training machine learning or deep learning-based models for NLP tasks, we usually represent the text data by an embedding like TF-IDF, Word2vec, GloVe, etc. These embedding vectors capture the semantic information of the text input and are easier to work with for the models (as they expect numerical input).

iNLTK under the hood utilizes the ULMFiT method of training language models and hence it can generate vector embeddings for a given input text. Here’s an example:

	from inltk.inltk import get_embedding_vectors

	# get embedding for input words
	vectors = get_embedding_vectors("विश्लेषिकी विद्या", "hi")

	print(vectors)
	# print shape of the first word
	print("shape:", vectors[0].shape)

view raw inltk_embed.py hosted with ❤ by GitHub

We get two embedding vectors, one for each word in the input sentence:

Notice that each word is denoted by an embedding of 400 dimensions.

Text completion

Text completion is one of the most exciting aspects of language modeling. We can use it in multiple situations. Since iNLTK internally uses language models, you can easily use it to auto-complete the input text.

In this example, I have taken a Bengali sentence that says “The weather is nice today”:

	from inltk.inltk import setup
	from inltk.inltk import predict_next_words

	# download models for Gujarati
	setup('bn')
	# predict the next words of the sentence "The weather is nice today"
	predict_next_words("আবহাওয়া চমৎকার", 10, "bn", 0.7)

view raw inltk_text_complete.py hosted with ❤ by GitHub

Here, the fourth parameter is to adjust the “randomness” of the model to make different generations (you can play with this value). The model gives a prompt output:

'আবহাওয়া চমৎকারভাবে, সরলভাবে এক-একটি সৃষ্টির দিনক্ষণ'

This roughly translates to ‘The weather is excellent, simply a day of creation’ (according to Google Translate). It’s an interestingly smooth output, isn’t it?

We can often use text generation abilities of a language model to augment the text dataset, and since we usually have small datasets for vernacular languages, this feature of iNLTK comes in handy.

Finding similarity between two sentences

iNLTK provides an API to find semantic similarities between two pieces of text. This is a really useful feature! We can use the similarity score for feature engineering and even building sentiment analysis systems. Here’s how it works:

	from inltk.inltk import get_sentence_similarity

	# similarity of encodings is calculated by using cmp function whose default is cosine similarity
	get_sentence_similarity('मुझे भोजन पसंद है।', 'मैं ऐसे भोजन की सराहना करता हूं जिसका स्वाद अच्छा हो।', 'hi')

view raw inltk_sent_similar.py hosted with ❤ by GitHub

I have given two sentences as input above. The first one roughly translates to “I like food” while the second one means “I appreciate food that tastes good” in Hindi. The model gives out a cosine similarity of 0.67 which means that the sentences are pretty close, and that’s correct.

Apart from cosine similarity, you can pass your own comparison function to the cmp parameter if you want to use a custom distance metric.

Additionally, there are many interesting features that the library provides and I urge you to check out iNLTK’s documentation page for more information.

2. Indic NLP Library

I find the Indic NLP Library quite useful for performing advanced text processing tasks for Indian languages. Just like iNLTK was targeted towards a developer working with vernacular languages, this library is for researchers working in this area.

Here is what the official documentation says about Indic NLP’s objective:

This library provides the following set of functionalities:

Text Normalization
Script Information
Tokenization
Word Segmentation
Script Conversion
Romanization
Indicization
Transliteration
Translation

We’ll explore all of them one by one in this article. But first, let’s have a look at the different languages this library supports out of the box and which functionality is available for what language:

As you can see, the Indic NLP Library supports a few more languages than iNLTK, including Konkani, Sindhi, Telugu, etc. Let’s explore the library further!

Installing the Indic NLP Library

You can install the library using pip:

pip install indic-nlp-library

# download the resource
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

Apart from its API, this library also provides certain scripts that are useful for NLP. You can clone the GitHub folder itself to get them:

# download the repo
git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git

Now that all the files are downloaded, you can set the path so that Python knows where to find these on your computer:

	import sys
	from indicnlp import common

	# The path to the local git repo for Indic NLP library
	INDIC_NLP_LIB_HOME=r"indic_nlp_library"

	# The path to the local git repo for Indic NLP Resources
	INDIC_NLP_RESOURCES=r"indic_nlp_resources"

	# Add library to Python path
	sys.path.append(r'{}\src'.format(INDIC_NLP_LIB_HOME))

	# Set environment variable for resources folder
	common.set_resources_path(INDIC_NLP_RESOURCES)

view raw indic_setup.py hosted with ❤ by GitHub

The above steps might take some time due to the size of the resources. Once you are done with these steps, you are ready to start!

Splitting input text into sentences

Indic NLP Library supports many basic text processing tasks like normalization, tokenization at the word level, etc. But sentence level tokenization is what I find interesting because this is something that different Indian languages follow different rules for.

Here is an example of how to use this sentence splitter

	from indicnlp.tokenize import sentence_tokenize

	indic_string="""तो क्या विश्व कप 2019 में मैच का बॉस टॉस है? यानी मैच में हार-जीत में \
	टॉस की भूमिका अहम है? आप ऐसा सोच सकते हैं। विश्वकप के अपने-अपने पहले मैच में बुरी तरह हारने वाली एशिया की दो टीमों \
	पाकिस्तान और श्रीलंका के कप्तान ने हालांकि अपने हार के पीछे टॉस की दलील तो नहीं दी, लेकिन यह जरूर कहा था कि वह एक अहम टॉस हार गए थे।"""

	# Split the sentence, language code "hi" is passed for hingi
	sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')

	# print the sentences
	for t in sentences:
	print(t)

view raw indic_sentence_split.py hosted with ❤ by GitHub

Here is the output:

तो क्या विश्व कप 2019 में मैच का बॉस टॉस है? 

यानी मैच में हार-जीत में टॉस की भूमिका अहम है? 

आप ऐसा सोच सकते हैं। 

विश्वकप के अपने-अपने पहले मैच में बुरी तरह हारने वाली एशिया की दो टीमों पाकिस्तान और श्रीलंका के कप्तान ने हालांकि अपने हार के पीछे टॉस की दलील तो नहीं दी, लेकिन यह जरूर कहा था कि वह एक अहम टॉस हार गए थे।

Now, what if I tell you that you can do the same for all 15 Indian languages that Indic NLP Library supports? Fascinating, isn’t it?

Transliteration among various Indian Language Scripts

Transliteration is when you convert a word written in one language such that it is written using the alphabet of the second language. Note that this is very different from “Translation” wherein you also convert the word itself to the second language so that it’s “meaning” is maintained.

Here is an example to illustrate the difference:

Here is how you can perform transliteration using the Indic NLP Library:

	from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator

	# Input text "Today the weather is good. Sun is bright and there are no signs of rain. Hence we can play today."
	input_text='आज मौसम अच्छा है। सूरज उज्ज्वल है और बारिश के कोई संकेत नहीं हैं। इसलिए हम आज खेल सकते हैं!'

	# Transliterate from Hindi to Telugu
	print(UnicodeIndicTransliterator.transliterate(input_text,"hi","te"))

view raw indic_transliterate.py hosted with ❤ by GitHub

In the above example, we have a sentence written in Hindi and we want to transliterate it to Telugu. This is the output of the model:

This is a near-perfect transliteration!

Converting Indian Languages to Roman Script

This is a feature that will be very helpful when working with social media data of non-native English speakers as they have a tendency to mix and interchange language every now and then in their posts.

English follows Roman Script for the alphabet, hence we can “Transliterate” any Indian language text to English using this library:

	from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

	input_text='आज मौसम अच्छा है। इसलिए हम आज खेल सकते हैं!'

	# Transliterate Hindi to Roman
	print(ItransTransliterator.to_itrans(input_text, 'hi'))

view raw indic_roman.py hosted with ❤ by GitHub

Here is what the model gives as output:

aaja mausama achchaa hai. isalie hama aaja khela sakate hai !

Very cool, isn’t it? This is something most of us can relate to as a lot of times we type our local language using English alphabets (I’m looking at all you texting people!).

Understanding the phonetics of a character

Phonetics of a character describe the speech property of that character (like how will it sound, how much tongue should be rolled to pronounce it, etc.)

Here is an example of a phonetic property that defines how the character “k” is spoken:

The Indian Sub-Continent languages have strong phonetics for their alphabet and that’s why in the Indic NLP Library, each character has a phonetic vector associated with it that defines its properties.

How is this useful? Well, you can basically take the character of a new language and just learn almost everything about it – from whether it is a vowel or consonant to how is the tongue rolled to pronounce that word?

Here is an example where we take the simple Hindi character ‘आ’ :

	from indicnlp.langinfo import *

	# Input character
	c='आ'
	# Language is Hindi or 'hi'
	lang='hi'

	print('Is vowel?: {}'.format(is_vowel(c,lang)))
	print('Is consonant?: {}'.format(is_consonant(c,lang)))
	print('Is velar?: {}'.format(is_velar(c,lang)))
	print('Is palatal?: {}'.format(is_palatal(c,lang)))
	print('Is aspirated?: {}'.format(is_aspirated(c,lang)))
	print('Is unvoiced?: {}'.format(is_unvoiced(c,lang)))
	print('Is nasal?: {}'.format(is_nasal(c,lang)))

view raw indic_phonetic.py hosted with ❤ by GitHub

Here is the output:

How similar do two characters sound?

Many languages have multiple characters that have a similar sound or are spoken similarly but used in different settings in words. Can you think of any off the top of your head?

In English, it would be the characters “k” and “c”. While growing up, I’d often wonder why it was written as “school” but pronounced as “skool”? That’s exactly what I’m talking about here.

Similarly, in Hindi, we have characters ‘क’ and ‘ख’ that are confused a lot due to their sound being very similar.

Let’s find how phonetically similar these characters are using the Indic NLP Library:

	from indicnlp.script import indic_scripts as isc
	from indicnlp.script import phonetic_sim as psim

	c1='क'
	c2='ख'
	c3='भ'
	lang='hi'

	print('Similarity between {} and {}'.format(c1,c2))
	print(psim.cosine(
	isc.get_phonetic_feature_vector(c1,lang),
	isc.get_phonetic_feature_vector(c2,lang)
	))

	print(u'Similarity between {} and {}'.format(c1,c3))
	print(psim.cosine(
	isc.get_phonetic_feature_vector(c1,lang),
	isc.get_phonetic_feature_vector(c3,lang)
	))

view raw indic_similar.py hosted with ❤ by GitHub

I have also used a third character ‘भ’ for comparison purposes. Let’s see what output the model gives:

As expected, there is a higher similarity between ‘क’ and ‘ख’ than ‘क’ and ‘भ’.

Splitting words into Syllables

Source

We can use the Indic NLP Library to split words of Indian Languages into their syllables. This is really useful because languages have unique rules that govern what makes a syllable.

For example, when we consider the case of Indian Languages in general and Hindi, in particular, you’d notice that the concept of matras is very important when considering syllables. Here’s an example in Hindi:

This type of syllabification is known as Orthographic Syllabification. Let’s see how we can do this in Python:

	from indicnlp.syllable import syllabifier

	# Word to be broken into syllables
	w='जगदीशचंद्र'
	# Language code Hindi in this case
	lang='hi'

	# Break into syllables
	print(' '.join(syllabifier.orthographic_syllabify(w,lang)))

view raw indic_syllables.py hosted with ❤ by GitHub

We have given the Hindi word ‘जगदीशचंद्र’ as input and here’s the output:

ज ग दी श च ंद्र

Notice how the various syllables have been properly identified! If you want to learn more about Orthographic Syllabification, you can read the paper – Orthographic Syllable as a basic unit for SMT between Related Languages.

Now that we have learned a fair bit of NLP tasks that we can perform with Indian Languages, let’s go to the next step with StanfordNLP.

3. StanfordNLP

StanfordNLP is an NLP library right from Stanford’s Research Group on Natural Language Processing.

The most striking feature of this library is that it supports around 53 human languages for text processing!

Out of these languages, StanfordNLP supports Hindi and Urdu that belong to the Indian Sub-Continent.

StanfordNLP is good for generating features of Computational Linguistics like Named Entity Recognition (NER), Part of Speech (POS) tags, Dependency Parsing, etc. Let’s see a glimpse of this library!

Installing StanfordNLP

1. Install the StanfordNLP library:

pip install stanfordnlp

2. We need to download a language’s specific model to work with it. Launch a Python shell and import StanfordNLP:

import stanfordnlp

3. Then download the language model for Hindi (“hi”):

stanfordnlp.download('hi')

This can take a while depending on your internet connection. These language models are pretty huge (the English one is 1.96GB).

Note: You need Python 3.6.8/3.7.2 or later to use StanfordNLP.

Extracting Part of Speech (POS) Tags for Hindi

StanfordNLP comes with built-in processors to perform five basic NLP tasks:

Tokenization
Multi-Word Token Expansion
Lemmatization
Parts of Speech Tagging
Dependency Parsing

Let’s start by creating a text pipeline:

nlp = stanfordnlp.Pipeline(processors = "pos")

Now, we will first take a piece of Hindi text and run the StanfordNLP pipeline on it:

hindi_doc = nlp("""केंद्र की मोदी सरकार ने शुक्रवार को अपना अंतरिम बजट पेश किया. कार्यवाहक वित्त मंत्री पीयूष गोयल ने अपने बजट में किसान, मजदूर, करदाता, महिला वर्ग समेत हर किसी के लिए बंपर ऐलान किए. हालांकि, बजट के बाद भी टैक्स को लेकर काफी कन्फ्यूजन बना रहा. केंद्र सरकार के इस अंतरिम बजट क्या खास रहा और किसको क्या मिला, आसान भाषा में यहां समझें""")

Once you have done this, StanfordNLP will return an object containing the POS tags of the input text. You can use the below code to extract the POS tags:

	#dictionary that contains pos tags and their explanations
	pos_dict = {
	'CC': 'coordinating conjunction','CD': 'cardinal digit','DT': 'determiner',
	'EX': 'existential there (like: \"there is\" ... think of it like \"there exists\")',
	'FW': 'foreign word','IN': 'preposition/subordinating conjunction','JJ': 'adjective \'big\'',
	'JJR': 'adjective, comparative \'bigger\'','JJS': 'adjective, superlative \'biggest\'',
	'LS': 'list marker 1)','MD': 'modal could, will','NN': 'noun, singular \'desk\'',
	'NNS': 'noun plural \'desks\'','NNP': 'proper noun, singular \'Harrison\'',
	'NNPS': 'proper noun, plural \'Americans\'','PDT': 'predeterminer \'all the kids\'',
	'POS': 'possessive ending parent\'s','PRP': 'personal pronoun I, he, she',
	'PRP$': 'possessive pronoun my, his, hers','RB': 'adverb very, silently,',
	'RBR': 'adverb, comparative better','RBS': 'adverb, superlative best',
	'RP': 'particle give up','TO': 'to go \'to\' the store.','UH': 'interjection errrrrrrrm',
	'VB': 'verb, base form take','VBD': 'verb, past tense took',
	'VBG': 'verb, gerund/present participle taking','VBN': 'verb, past participle taken',
	'VBP': 'verb, sing. present, non-3d take','VBZ': 'verb, 3rd person sing. present takes',
	'WDT': 'wh-determiner which','WP': 'wh-pronoun who, what','WP$': 'possessive wh-pronoun whose',
	'WRB': 'wh-abverb where, when','QF' : 'quantifier, bahut, thoda, kam (Hindi)','VM' : 'main verb',
	'PSP' : 'postposition, common in indian langs','DEM' : 'demonstrative, common in indian langs'
	}

	#extract parts of speech
	def extract_pos(doc):
	parsed_text = {'word':[], 'pos':[], 'exp':[]}
	for sent in doc.sentences:
	for wrd in sent.words:
	if wrd.pos in pos_dict.keys():
	pos_exp = pos_dict[wrd.pos]
	else:
	pos_exp = 'NA'
	parsed_text['word'].append(wrd.text)
	parsed_text['pos'].append(wrd.pos)
	parsed_text['exp'].append(pos_exp)
	#return a dataframe of pos and text
	return pd.DataFrame(parsed_text)

view raw indic_stanfordnlp.py hosted with ❤ by GitHub

Once we call the extract_pos(hindi_doc) function, we will able to see the correct POS tags for each word in the input sequence along with their explanations:

An interesting fact about StanfordNLP is that its POS tagger performs accurately for a majority of words. It is even able to pick the tense of a word (past, present or future) and whether the word is in base or plural form.

If you want to read more about StanfordNLP and how you can use it for other tasks, feel free to this article.

India nlp library in python

Natural Language Processing (NLP) libraries for Indian languages in Python! Here are two excellent options:

1. Indic NLP Library:

Focus: Built specifically for handling common text processing and NLP tasks in Indian languages.
Strengths:
- Wide range of functionalities, including text normalization, script identification, tokenization, word segmentation, script conversion (romanization, indicization, transliteration), and translation.
- Designed to leverage the commonalities between Indic languages for a general solution.
Installation: pip install indic-nlp-library
Documentation: While an official documentation website might not be readily available, you can find comprehensive information and examples on the project’s GitHub repository: [GitHub indic nlp library]

2. iNLTK (Natural Language Toolkit for Indic Languages):

Focus: Inspired by the popular NLTK library, iNLTK provides features tailored for NLP tasks in Indian languages.
Strengths:
- Intuitive and easy-to-use API for tasks like text processing, tokenization, sentence similarity, and word embedding generation.
- Good fit for developers familiar with NLTK.
Installation: pip install inltk
Documentation: You can find detailed documentation for iNLTK online, though it might not be as actively maintained as the Indic NLP Library: [inltk documentation] (Consider searching for up-to-date information if needed.)

Choosing the Right Library:

The best choice depends on your specific needs:

If you require a comprehensive set of features specifically designed for Indic languages, the Indic NLP Library is a strong contender.
If you’re already comfortable with NLTK and prefer a familiar API, iNLTK could be a good option.pen_spark

Conclusion

You’d have already noticed in this article that there are useful libraries to perform NLP on Indian languages, but even then these libraries have a long way to go in terms of functionality when compared with the likes of spaCy, NLTK and other NLP libraries that majorly support European languages. After Reading this Article you will get full understanding on india languages nlp.

Good news is that the research in multilingual NLP has only risen over the last couple of years and in no time you should be able to see a plethora of options to choose from.

Have you worked with Indian languages before? Do you think there is a library that should be on this list? If yes, mention in the comments below!

Mohd Sanad Zaki Rizvi

A computer science graduate, I have previously worked as a Research Assistant at the University of Southern California(USC-ICT) where I employed NLP and ML to make better virtual STEM mentors. My research interests include using AI and its allied fields of NLP and Computer Vision for tackling real-world problems.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Raymond Doctor

Great post and useful to all working with Indic. One remark. The breakup of जगदीशचंद्र in orthographic syllabification is incorrect. ज ग दी श च ंद्र- The nasal should affix itself to च ज ग दी श चं द्र as per the rules of akshar or the Indic syllable. The split is not correct and may lead to errors if deployed in areas such as TTS.

Hi Raymond, Thanks for pointing it out, you are right. Authors of the Indic NLP Library do mention this as one of the two exceptions to the Orthographic Syllabification process: "The characters "anusvaara" and "chandrabindu" are part of the OS to the left if they represents nasalization of the vowel/consonant or start a new OS if they represent a nasal consonant."

Show 1 reply

Ravindar

Great effort in making such great stuff in understanding of Indian languages using NLP

Arihant

Awesome Article . Keep up the Good work Man !! Which version of Python you used for this ? I am facing couple of errors while trying this - setup('hi') RuntimeError: This event loop is already running output = get_similar_sentences('मैं आज बहुत खुश हूं', 5, 'hi') AttributeError: 'LSTM' object has no attribute '_flat_weights_names' I did have torch torch==1.4.0 version.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

3 Important NLP Libraries for Indian Languages You Should Try Out Today!

Overview

Introduction

Here’s what we’ll cover in this article:

Table of contents

What are the Languages of the Indian Subcontinent?

Text Processing for Indian Languages using Python

1. iNLTK (Natural Language Toolkit for Indic Languages)

Installing iNLTK

Language support

Setting the language

Tokenization

Generate similar sentences from a given text input

Identify the language of a text

Extract embedding vectors

Text completion

Finding similarity between two sentences

2. Indic NLP Library

Installing the Indic NLP Library

Splitting input text into sentences

Converting Indian Languages to Roman Script

Understanding the phonetics of a character

How similar do two characters sound?

Splitting words into Syllables

3. StanfordNLP

Extracting Part of Speech (POS) Tags for Hindi

India nlp library in python

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID