Text Preprocessing in Python -Getting started with NLP

shilpijs Last Updated : 11 Aug, 2021

12 min read

This article was published as a part of the Data Science Blogathon

Introduction

NLP or Natural Language Processing is the science of processing, understanding, and generating human language by machines. Using NLP, information can be extracted from unstructured data, trained to generate responses for human queries, classify text into appropriate categories. News articles, social media posts, online reviews are some of the publicly available sources that are rich in information. NLP is used to derive meaningful insights from these sources but training NLP algorithms directly on the text, in its free form, can induce a lot of noise and add unnecessary complexity. To derive meaningful insights from such unstructured data, it needs to be cleansed, brought to an appropriate level for analysis.

This article covers some of the widely used preprocessing steps and provides an understanding of the structure and vocabulary of the text, along with their code in python. The exact list of steps depends on the quality of the text, the objective of the study, and the NLP task to be performed.

Sentence Segmentation
Part of Speech Tagging
Removal of Special Characters
Removal of Stop Words
Removal of White Spaces
Document Term Matrix

Below is a corpus on NLP:

Document 1 – Natural Language Processing (NLP) is a field within Artificial Intelligence (AI) that is concerned with how computers deal with human language. We are already interacting with such machines in our day-to-day life in the form of IVRs & chat-bots. But do Machines really understand human language, context, syntax, semantics, etc.? Yes, they can be trained to do so!

Document 2 – Google has trained its search engine to make autofill recommendations as text is typed using NLP. Google’s search engine has the capability of

understanding the meaning of words depending on the context in the search. Google’s “state-of-the-art” search engine is one of the most sophisticated examples of NLP.

Document 3 – Origination of Natural Language Processing dates back to the II world war when there was a need for machine translation between Russian & English. Today, NLP has expanded beyond these two languages and can deal with most languages, including sign language.

Document 4 – NLP is actively used for a variety of day-to-day activities like spam detection, recruitment, smart assistants, understanding customer behaviour & so on…… Usage and impact of NLP are growing exponentially, across a wide range of industries.

Document 5 – Acronym NLP is used for both Natural Language Processing & Neuro-Linguistic Programming, but, these are completely different fields of science. Neuro-Linguistic Programming deals with human-to-human interaction, whereas Natural Language Processing deals with human-to-computer interaction.

Loading above in a pandas data frame in python

   Document ID                                               Text
0            1  Natural Language Processing (NLP) is a field w...
1            2  Google has trained it's search engine to make ...
2            3          Origination of Natural Language Proces...
3            4  NLP is actively used for a variety of day-to-d...
4            5   Acronym  NLP is used for both Natural Languag...

Sentence Segmentation

Breaking paragraphs into sentences make them manageable as well as the context within the sentence can be better understood.

In the first document, the first couple of sentences are informative, followed by a question in raising tone. The tone of the individual sentence gets masked, when looking at the complete extract and cannot be deciphered when further broken into tokens.

from nltk.tokenize import sent_tokenize
# breaking every document into sentences
doc_w_sent = [sent_tokenize(text) for text in text_doc.text]
# creating document ID & sentence ID for reference
doc_num_list = [[x] * y for x, y in zip(text_doc.doc_id, [len(doc) for doc in doc_w_sent])]
sentence_num_list = [list(range(1, len(doc)+1)) for doc in doc_w_sent]
# un-nesting lists
doc_w_sent = [x for element in doc_w_sent for x in element]
doc_num_list = [x for element in doc_num_list for x in element]
sentence_num_list = [x for element in sentence_num_list for x in element]
# creating dataframe
text_data = pd.DataFrame({'Document ID' : doc_num_list, 'Sentence ID' : sentence_num_list, 'Text' : doc_w_sent})
print(text_data)

NLTK is a python package for NLP tasks. As the name suggests, ‘sent_tokenize’ breaks paragraphs into sentences based on the end of line punctuation marks – period, question mark, and exclamation mark.

Output:
    Document ID  Sentence ID                                               Text
0             1            1  Natural Language Processing (NLP) is a field w...
1             1            2  We are already interacting with such machines ...
2             1            3  But do Machines really understand human langua...
3             1            4                 Yes, they can be trained to do so!
4             2            1  Google has trained it's search engine to make ...
5             2            2  Google's search engine has the capability of ...
6             2            3  Google's "state-of-the-art" search engine is o...
7             3            1          Origination of Natural Language Proces...
8             3            2  Today, NLP has expanded beyond these two langu...
9             4            1  NLP is actively used for a variety of day-to-d...
10            4            2  Usage and impact of NLP is growing exponential...
11            5            1   Acronym  NLP is used for both Natural Languag...
12            5            2  Neuro Linguistic Programming deals with human-...

A list is created with each sentence as an individual list element. Through “Document ID” and “Sentence ID”, it can be inferred that documents 1 through 5 contain 4, 3, 2, 2, 2 sentences respectively.

Part of Speech Tagging

Part of Speech tagging is the process of assigning labels (part of speech) to words in a sentence given the context it’s used in and on the meaning of the word. It’s critical for Named Entity Recognition (NER), understanding the relationship between words, developing linguistic rules, lemmatization.

import nltk
pos_tag = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in text_data.Text]
print(pos_tag)

NLTK has part of speech tagging and is run on every word in the sentence.

[[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('within', 'IN'), ('Artificial', 'JJ'), ('Intelligence', 'NNP'), ('(', '('), ('AI', 'NNP'), (')', ')'), ('that', 'WDT'), ('is', 'VBZ'), ('concerned', 'VBN'), ('with', 'IN'), ('how', 'WRB'), ('computers', 'NNS'), ('deal', 'VBP'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')], [('We', 'PRP'), ('are', 'VBP'), ('already', 'RB'), ('interacting', 'VBG'), ('with', 'IN'), ('such', 'JJ'), ('machines', 'NNS'), ('in', 'IN'), ('our', 'PRP$'), ('day-to-day', 'JJ'), ('life', 'NN'), ('in', 'IN'), ('the', 'DT'), ('form', 'NN'), ('of', 'IN'), ('IVRs', 'NNP'), ('&', 'CC'), ('chat-bots', 'NNS'), ('.', '.')], [('But', 'CC'), ('do', 'VBP'), ('Machines', 'NNS'), ('really', 'RB'), ('understand', 'VBP'), ('human', 'JJ'), ('language', 'NN'), (',', ','), ('context', 'NN'), (',', ','), ('syntax', 'NN'), (',', ','), ('semantics', 'NNS'), (',', ','), ('etc', 'FW'), ('.', '.'), ('?', '.')], [('Yes', 'UH'), (',', ','), ('they', 'PRP'), ('can', 'MD'), ('be', 'VB'), ('trained', 'VBN'), ('to', 'TO'), ('do', 'VB'), ('so', 'RB'), ('!', '.')], [('Google', 'NNP'), ('has', 'VBZ'), ('trained', 'VBN'), ('it', 'PRP'), ("'s", 'VBZ'), ('search', 'JJ'), ('engine', 'NN'), ('to', 'TO'), ('make', 'VB'), ('autofill', 'JJ'), ('recommendations', 'NNS'), ('as', 'IN'), ('text', 'NN'), ('is', 'VBZ'), ('typed', 'VBN'), ('using', 'VBG'), ('NLP', 'NNP'), ('.', '.')], [('Google', 'NNP'), ("'s", 'POS'), ('search', 'NN'), ('engine', 'NN'), ('has', 'VBZ'), ('the', 'DT'), ('capability', 'NN'), ('of', 'IN'), ('understanding', 'VBG'), ('meaning', 'NN'), ('of', 'IN'), ('words', 'NNS'), ('depending', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('context', 'NN'), ('in', 'IN'), ('the', 'DT'), ('search', 'NN'), ('.', '.')], [('Google', 'NNP'), ("'s", 'POS'), ('``', '``'), ('state-of-the-art', 'JJ'), ("''", "''"), ('search', 'NN'), ('engine', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('sophisticated', 'JJ'), ('examples', 'NNS'), ('of', 'IN'), ('NLP', 'NNP'), ('.', '.')], [('Origination', 'NN'), ('of', 'IN'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('dates', 'VBZ'), ('back', 
'RB'), ('to', 'TO'), ('II', 'NNP'), ('world', 'NN'), ('war', 'NN'), (',', ','), ('when', 'WRB'), ('there', 'EX'), ('was', 'VBD'), ('a', 'DT'), ('need', 'NN'), ('for', 'IN'), ('machine', 'NN'), ('translation', 'NN'), ('between', 'IN'), ('Russian', 'NNP'), ('&', 'CC'), ('English', 'NNP'), ('.', '.')], [('Today', 'NN'), (',', ','), ('NLP', 'NNP'), ('has', 'VBZ'), ('expanded', 'VBN'), ('beyond', 'IN'), ('these', 'DT'), ('two', 'CD'), ('languages', 'NNS'), ('and', 'CC'), ('can', 'MD'), ('deal', 'VB'), ('with', 'IN'), ('most', 'JJS'), ('languages', 'NNS'), (',', ','), ('including', 'VBG'), ('sign', 'JJ'), ('language', 'NN'), ('.', '.')], [('NLP', 'NNP'), ('is', 'VBZ'), ('actively', 'RB'), ('used', 'VBN'), ('for', 'IN'), ('a', 'DT'), ('variety', 'NN'), 
('of', 'IN'), ('day-to-day', 'JJ'), ('activities', 'NNS'), ('like', 'IN'), ('spam', 'NN'), ('detection', 'NN'), (',', ','), ('recruitment', 'NN'), (',', ','), ('smart', 'JJ'), ('assistants', 'NNS'), (',', ','), ('understanding', 'VBG'), ('customer', 'NN'), ('behavior', 'NN'), ('&', 'CC'), ('so', 'RB'), ('on', 'IN'), ('.', '.')], [('Usage', 'NN'), ('and', 'CC'), ('impact', 'NN'), ('of', 'IN'), ('NLP', 'NNP'), ('is', 'VBZ'), ('growing', 'VBG'), ('exponentially', 'RB'), (',', ','), ('across', 'IN'), ('a', 'DT'), ('wide', 'JJ'), ('range', 'NN'), ('of', 'IN'), ('industries', 'NNS'), ('.', '.')], [('Acronym', 'NNP'), ('NLP', 'NNP'), ('is', 'VBZ'), ('used', 'VBN'), ('for', 'IN'), ('both', 'DT'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('&', 'CC'), ('Neuro', 'NNP'), ('Linguistic', 'NNP'), ('Programming', 'NNP'), (',', ','), ('but', 'CC'), (',', ','), ('these', 'DT'), ('are', 'VBP'), ('completely', 'RB'), ('different', 'JJ'), ('fields', 'NNS'), ('of', 'IN'), ('science', 'NN'), ('.', '.')], [('Neuro', 'NNP'), ('Linguistic', 'NNP'), ('Programming', 'NNP'), ('deals', 'NNS'), ('with', 'IN'), ('human-to-human', 'JJ'), ('interaction', 'NN'), (',', ','), ('where', 'WRB'), ('as', 'IN'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('deals', 'NNS'), ('with', 'IN'), ('human-to-computer', 'JJ'), ('interaction', 'NN'), ('.', '.')]]

Output is displayed in a list of tuples for every element in the sentence. In the first sentence of the first document, the word “processing” is used as a proper noun (NNP). But, the word “processing” can also be used as a verb, in the context of “I am processing the data.”. For machines to completely understand the meaning of the text, identifying the correct part of speech becomes important.

Removal of Special Characters

Special characters usually do not add value to the text. One way they can be utilized is to parse text upon the occurrence of any particular special character or indicate the need for words expansions. Once they have been utilized to treat the text, they can be removed from the text to exclude any redundant information being processed through NLP algorithms.

from string import punctuation

import re
text_data.Text = [re.sub('['+punctuation+']', ' ', sent) for sent in text_data.Text]
[print(sent) for sent in text_data.Text]

‘string’ package contains following punctuation marks !”#$%&'()*+,-./:;?@[]^_`{|}~. This list can be modified by adding/removing any character.

Output:

Natural Language Processing  NLP  is a field within Artificial Intelligence  AI     that is concerned with how computers deal with human language
We are already interacting with such machines in our day to day     life in the form of IVRs   chat bots
But do Machines really understand human language  context  syntax  semantics  etc
Yes  they can be trained to do so
Google has trained it s search engine to make autofill recommendations as text is typed using NLP
Google s search engine has the capability of
understanding meaning of words depending on the context in the search
Google s  state of the art  search engine is one of the most sophisticated examples of NLP
        Origination of Natural Language Processing dates back to II world war  when there was a need for machine translation between Russian   English
Today  NLP has expanded beyond these two languages and can deal with most languages  including sign language
NLP is actively used for a variety of day to day activities like spam detection  recruitment  smart assistants  understanding customer behavior   so on
Usage and impact of NLP is growing exponentially  across a wide range of industries
 Acronym  NLP is used for both Natural Language Processing   Neuro Linguistic Programming  but  these are completely different fields of science
Neuro Linguistic Programming deals with human to human interaction  where as Natural Language Processing deals with human to computer interaction

Punctuation marks are replaced by a single space.

Removal of Stop Words

Like special characters, certain words do not add any value to the text. These are called stop words. They can belong to any part of speech. Usually, there is a general list of stop words that can be used for any NLP task but it can be modified depending on the text.

For instance, finding “with” as the most frequent token or even, finding a noun “language” as the most common token in the above paragraph is not useful information. These redundant words should be removed to generate truly valuable insights.

from nltk.corpus import stopwords
 import re
 stop_words = set(stopwords.words('english'))
 stopwords_all = "("+') | ('.join([s for s in stop_words])+")"
 text_data.Text = [re.sub(stopwords_all,' ',sent) for sent in text_data.Text]
 [print(sent) for sent in text_data.Text]

NLTK contains a predefined set of stop words for various languages. This set can be altered to add/remove any word, depending on the context and quality of the text.

Natural Language Processing  NLP      field within Artificial Intelligence  AI         concerned     computers deal   human language
We   already interacting     machines     day   day     life     form   IVRs   chat bots
But   Machines really understand human language  context  syntax  semantics  etc
Yes        trained
Google   trained     search engine   make autofill recommendations   text   typed using NLP
Google   search engine     capability
understanding meaning   words depending     context     search
Google    state     art  search engine   one       sophisticated examples   NLP
        Origination   Natural Language Processing dates back   II world war          need   machine translation   Russian   English
Today  NLP   expanded beyond   two languages     deal     languages  including sign language
NLP   actively used     variety   day   day activities like spam detection  recruitment  smart assistants  understanding customer behavior
Usage   impact   NLP   growing exponentially  across   wide range   industries
 Acronym  NLP   used     Natural Language Processing   Neuro Linguistic Programming         completely different fields   science
Neuro Linguistic Programming deals   human   human interaction      Natural Language Processing deals   human   computer interaction

Helping verbs such as “is” / “are”, determiners such as “that” / “they”, prepositions such as “with”, adjective such as “these” have been removed. Comparing the first sentence of the first document, “is a”, “that is”, “with”, “how” have been dropped.

Removal of White Spaces

White space characters are very commonly found in any text. They can also be introduced in the text as part of other cleansing exercises. These are unnecessary and irrelevant. Below code can be used to remove any such white space character.

import re
text_data.Text = [re.sub(r's+|t+|n+|r+|f+',' ',sent).strip() for sent in text_data.Text]
[print(sent) for sent in text_data.Text]

s, t, n, r, f, respectively, represents space, tab, carriage return, line feed and form feed. ‘strip()’ function removes any leading and trailing spaces.

Output:

Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language
We already interacting machines day day life form IVRs chat bots
But Machines really understand human language context syntax semantics etc
Yes trained
Google trained search engine make autofill recommendations text typed using NLP
Google search engine capability understanding meaning words depending context search
Google state art search engine one sophisticated examples NLP
Origination Natural Language Processing dates back II world war need machine translation Russian English
Today NLP expanded beyond two languages deal languages including sign language
NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior
Usage impact NLP growing exponentially across wide range industries
Acronym NLP used Natural Language Processing Neuro Linguistic Programming completely different fields science
Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction

Multiple white space characters are replaced by a single space and any leading and trailing blanks are removed.

Document Term Matrix

Document Term matrix provides structure to the unstructured data. It’s a basic way of creating fixed-length input for machine learning algorithms. Every document is represented as a row and every token as a column. The two most common value sets are count and TF-IDF. As the name suggests, the count is the total number of occurrences of every token in a document. It assesses how common or rare a word is in a document. TF-IDF (Term Frequency – Inverse Document Frequency) is Term Frequency (proportion of count of token in a document by a count of all tokens in the document) weighted by Inverse Document Frequency (inverse of the total number of the documents containing the token). It’s used to find tokens that can be used to distinguish and classify documents.

# roll up cleansed sentences at document level
text_doc_cleansed = text_data.groupby('Document ID')['Text'].apply(list)
text_doc_cleansed = [' '.join(doc) for doc in text_doc_cleansed]
print(text_doc_cleansed)

Rolled up sentences at the document level.

['Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language We already interacting machines day day life form IVRs chat bots But Machines really understand human language context syntax semantics etc Yes trained', 'Google trained search engine make autofill recommendations text typed using NLP Google search engine capability understanding meaning words depending context search Google state art search engine one sophisticated examples NLP', 'Origination Natural Language Processing dates back II world war need machine translation Russian English Today NLP expanded beyond two languages deal languages including sign language', 'NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior Usage impact NLP growing exponentially across wide range industries', 'Acronym NLP used Natural Language Processing Neuro Linguistic 
Programming completely different fields science Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction']

Count

from sklearn.feature_extraction.text import CountVectorizer
countvectorizer = CountVectorizer()
countvectors = countvectorizer.fit_transform(text_doc_cleansed)
countfeature_names = countvectorizer.get_feature_names()
countdense = countvectors.todense()
countdenselist = countdense.tolist()
count_df = pd.DataFrame(countdenselist, columns=countfeature_names)
print(count_df)

Output:

   acronym  across  actively  activities  ai  already  art  artificial  assistants  autofill  back  ...  usage  used  using  variety  war  we  wide  within  words  world  yes
0        0       0         0           0   1        1    0           1           0         0     0  ...      0     0      0        0    0   1     0       1      0      0    1
1        0       0         0           0   0        0    1           0           0         1     0  ...      0     0      1        0    0   0     0       0      1      0    0
2        0       0         0           0   0        0    0           0           0         0     1  ...      0     0      0        0    1   0     0       0      0      1    0
3        0       1         1           1   0        0    0           0           1         0     0  ...      1     1      0        1    0   0     1       0      0      0    0
4        1       0         0           0   0        0    0           0           0         0     0  ...      0     1      0        0    0   0     0       0      0      0    0

Output is a dataframe with 5 rows representing each document and 113 columns representing each token.

TF-IDF

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
tfidfvectorizer = TfidfVectorizer()
tfidfvectors = tfidfvectorizer.fit_transform(text_doc_cleansed)
tfidffeature_names = tfidfvectorizer.get_feature_names()
tfidfdense = tfidfvectors.todense()
tfidfdenselist = tfidfdense.tolist()
tfidf_df = pd.DataFrame(tfidfdenselist, columns=tfidffeature_names)
print(tfidf_df)

Output:

    acronym    across  actively  activities        ai   already       art  artificial  assistants  ...     using   variety       war        we      wide    within     words     world       yes 
0  0.000000  0.000000  0.000000    0.000000  0.161541  0.161541  0.000000    0.161541    0.000000  ...  0.000000  0.000000  0.000000  0.161541  0.000000  0.161541  0.000000  0.000000  0.161541
1  0.000000  0.000000  0.000000    0.000000  0.000000  0.000000  0.138861    0.000000    0.000000  ...  0.138861  0.000000  0.000000  0.000000  0.000000  0.000000  0.138861  0.000000  0.000000
2  0.000000  0.000000  0.000000    0.000000  0.000000  0.000000  0.000000    0.000000    0.000000  ...  0.000000  0.000000  0.201746  0.000000  0.000000  0.000000  0.000000  0.201746  0.000000
3  0.000000  0.204921  0.204921    0.204921  0.000000  0.000000  0.000000    0.000000    0.204921  ...  0.000000  0.204921  0.000000  0.000000  0.204921  0.000000  0.000000  0.000000  0.000000
4  0.161969  0.000000  0.000000    0.000000  0.000000  0.000000  0.000000    0.000000    0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000

Dimension of the data frame with TF-IDF is the same as that of with count. To compare the vectorization, output with count as value set has tokens with the same appearances within a document as well as across documents, represented by the same vector. However, with the TF-IDF value set, tokens with the same occurrences could be represented by different vectors depending on the length of the document. As mentioned earlier, tokens with the same number of occurrences in a document could also be represented by different vectors, depending on the number of documents they appear in.

Word Cloud

Word cloud is a text visualization technique where tokens are printed, where their size represents their frequency or importance in the text.

import matplotlib.pyplot as plt
from wordcloud import WordCloud
wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(tfidf_df.T.sum(axis=1))
plt.imshow(wordcloud)

Conclusion

The above steps are not exhaustive to complete text cleansing and gain a full understanding of its structure, syntax, and semantics. There are more tasks such as changing cases, expanding contractions, harmonizing text (where the same entity is represented in more than one way), spell check, dependency parsing, and so on. The steps to be performed depend on the NLP objective. For instance, in the case of text summarization/classification, the text would be studied in its entirety, whereas for Named Entity Recognition / Parts of Speech tagging, paragraphs might be broken into sentences/tokens. Quality of text and objective of study play a huge role in determining the level of preprocessing.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

shilpijs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Text Preprocessing in Python -Getting started with NLP

Introduction

Below is a corpus on NLP:

Sentence Segmentation

Part of Speech Tagging

Removal of Special Characters

Removal of Stop Words

Removal of White Spaces

Document Term Matrix

TF-IDF

Word Cloud

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm