A Guide to Perform 5 Important Steps of NLP Using Python

Aishwarya Last Updated : 17 Aug, 2021

7 min read

This article was published as a part of the Data Science Blogathon

Introduction

Natural Language Processing is a popular machine learning technique used to analyze text content. We see a lot of fancy reports around us and a lot of companies use business intelligence insights to drive their business. Most of these insights and reports are created using structured data. There are still some use cases for unstructured data. These could be in the form of text, tweets, images, etc. NLP focuses on bringing out meaningful insights from these text-based sources.

Some examples of NLP include sentiment analysis. So if you have a company and have newly launched a product, you can analyze the sentiments of the users via their tweets. Even product reviews on your website can be analyzed in the same way.

Challenges of NLP

So what seems to be the challenge here?

Let us take an example of a review: “The product is extraordinarily bad”
Extraordinary is usually referred to in a positive way. If we were to use a keyword-based approach and tag it using the word extraordinary, then it would be incorrect. This is where NLP comes in. These situations where oxymorons are used need to be handled carefully.
Another challenge is in terms of similar words as well as ambiguous meanings.
Irony and sarcasm is difficult for a machine to understand

Advantages of NLP

Some of the advantages of NLP include:

Can work with unstructured data.
More insights on the sentiments of a customer.
Chatbots and other such AI/ML-based devices/technologies are being improved upon.

Steps involved in NLP

Let us take a look at the basic steps involved in running a simple NLP algorithm using a news article dataset.

I have imported the required libraries for this data processing using NLP. Post that I have imported the file from my local system. You can see how the file looks like. Each row has a news article containing some text as well as an index value.

import gensim
import numpy
#numpy.numpy.random.bit_generator = numpy.numpy.random._bit_generator
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

import pandas as pd

data = pd.read_csv('C:\Users\ktk\Desktop\BBC News Test.csv', error_bad_lines=False);

data

data_text = data[['Text']]

data_text['index'] = data.ArticleId

documents = data_text

Tokenization

This is the first major step to be done to any data. So what does this step do? Imagine you have a 100-word document. You need to split the document into 100 separate words in order to identify the keywords and the major topics. This process is called tokenization. I have used an example where I have imported the data sets and used a gensim library for all the preprocessing steps.

This library has a preprocess function that helps tokenize the keywords. I have used a function called preprocess to help pick out the keywords. Different libraries have different functions for this process.

processed_docs = documents['Text'].map(preprocess)
processed_docs[:10]

You can also remove the punctuation in this same step. There are functions for the same as well. Since this particular dataset does not have any punctuation, I have not used the punctuation removal functions.

Stop Word Removal

You have a huge dataset or several articles. In these articles, you will find that a lot of words like, “is”, “was”, “were”, etc are present. These words do not technically add any value to the main topic. These are tagged as stop words. There are a number of stop word removal techniques that can be used to remove these stop words. This will help us to arrive at the topic of focus.

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))
stop_words = stopwords.words('english')
output = [w for w in processed_docs if not w in stop_words]
print("n"+str(output[0]))

I have used stop word function present in the NLTK library. The first list contains the list of stop words considered by the system. The second list contains the list of words after the stop words have been removed.

We will be left with only the keywords once the stop words are removed. This step is important for any NLP processing.

Stemming

Stemming means cutting out the other parts of a word and keeping only the stem (i.e. the important part of the word). In English, we add prefixes and suffixes to a word to form different words/tense forms of the same word.

For example, the root word stem can take the form of stemming or stems. The stemming process will remove the suffix to give out the word – stem. I have performed both the stemming as well as the lemmatization process explained in the next step together. The code snippet for both is attached together in the next step. I have attached an example for stemming in the code below. You can notice that the word “queens” has been stemmed to “queen“.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
a = doc_sample.split(' ')
for w in a:
print(w, " : ", ps.stem(w))

Another example is the word ecosystem. The root word for this is “eco” while the derived word is “ecosystem“. You do not need to be a grammar expert to perform stemming. Python has libraries that support the stemming process.

Lemmatization

Lemmatization is similar to stemming but is different in a complex way. Stemming simply cuts out the prefix or the suffix without thinking whether the remaining root word makes sense or not. Lemmatization on the other hand looks at the stemmed word to check whether it makes sense or not.

For example, the word “care” when stemmed will give out “car” but when lemmatized will give out “care”. The root word care is called a lemma.

So why is lemmatization very important?

Lemmatization helps in the disambiguation of words. It brings out the actual meaning of the word. So if you have multiple words which share a similar meaning, lemmatization can help sort this out. Hence, this is a very important step for your NLP process.

def lemmatize_stemming(text):
snow_stemmer = SnowballStemmer(language='english')
return snow_stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result

doc_sample = documents[documents['index'] == 1018].values[0][0]

print('original document: ')

words = []

for word in doc_sample.split(' '):

words.append(word)

print(words)

print('nn tokenized and lemmatized document: ')

print(preprocess(doc_sample))

You can see the steps used to stem and lemmatize the same news article document. Here, I have used a snowball stemmer for this process.

Modelling

Modeling your text is very important if you want to find out the core idea of the text. I the case of supervised machine learning, we use logistic regression or linear regression, etc to model the data. In those cases, we have the output variable which we use to train the model. In this case, since we do not have the output variable, we rely on unsupervised techniques.

There are a lot of good algorithms to help model the text data. Two of the most commonly used are the SVD (Singular Value decomposition) and LDA (latent Dirichlet allocation). These are widely used across the industry and are pretty simple to understand and implement.

LDA is a probabilities algorithm that focuses on iteratively assigning the probability of a word belonging to a topic. I have used LDA here to identify the possible topics for an article.

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

Here, you can see the probabilities being listed out for each article. Each keyword has a value that states the likeliness of the word being the keyword.

Conclusion

What I have listed out are some of the key steps in NLP. NLP is a dimension onto itself. To fully understand the magnitude of it, we need to first understand how deep any language can be. Since, NLP focuses on text data based on language, irony, sarcasm, comedy, trauma, horror, and many more such things need to be considered.

On a parting note, I wish to bring to your attention that the possibilities using NLP are limitless. The industry has realized the value of the text data of late and has started exploring more on this. Even the automated chatbots which pass the turning test have some amount of NLP embedded in them.

About the Author

Hi there! This is Aishwarya Murali, currently working in the analytics division of Karnataka Bank’s – Digital Centre of Excellence in Bangalore. Some of my interesting projects include ML-based scorecards for loan journey automation, customer segmentation, and improving the market share via selective profiling of customers using some machine learning analytics.

I have a master’s in computer applications and have done certification in Business Analytics from IIM-K. Currently, I am working on R&D innovations at my workplace.

You can connect with me at

https://www.linkedin.com/in/aishwarya-murali-b710a978/

You can also mail me at

aishwarya.prabhamurali@gmail.com

Image source – https://www.forbes.com/sites/bernardmarr/2020/09/11/4-simple-ways-businesses-can-use-natural-language-processing/?sh=58bdfe13a5f3

https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html

https://www.blumeglobal.com/learning/natural-language-processing/

How Does Natural Language Processing (NLP) help Chatbots?

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Aishwarya

Hi there! This is Aishwarya Murali, currently working in the analytics division of HSBC in Bangalore. Some of my interesting projects include ML-based scorecards for loan journey automation, customer segmentation, and improving the market share via selective profiling of customers using some machine learning analytics.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

A Guide to Perform 5 Important Steps of NLP Using Python

Introduction

Challenges of NLP

Advantages of NLP

Steps involved in NLP

Tokenization

Stop Word Removal

Stemming

Lemmatization

Modelling

Conclusion

About the Author

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm