Natural Language Processing: Step by Step Guide

Amruta Last Updated : 26 Feb, 2024

7 min read

Introduction

NLP stands for Natural Language Processing, a part of Computer Science, Human Language, and Artificial Intelligence. This technology is used by computers to understand, analyze, manipulate, and interpret human languages. NLP algorithms, leveraged by data scientists and machine learning professionals, are widely used everywhere in areas like Gmail spam, any search, games, and many more. These algorithms employ techniques such as neural networks to process and interpret text, enabling tasks like sentiment analysis, document classification, and information retrieval. Not only that, today we have build complex deep learning architectures like transformers which are used to build language models that are the core behind GPT, Gemini, and the likes.

Learning Objective

Basic understanding of Natural Language Processing.
Learn Various Techniques used for the implementation of NLP.
Understand how to use NLP for text mining.

This article was published as a part of the Data Science Blogathon

Why NLP is so important?
Components of NLP
- Natural Language Understanding
- Natural Language Generation
Phases of NLP
Implementation of NLP using Python
Advantages of NLP
Disadvantages of NLP
Everyday NLP examples
Frequently Asked Questions

Why NLP is so important?

Text data in a massive amount

NLP helps machines to interact with humans in their language and perform related tasks like reading text, understand speech and interpret it in well format. Nowadays machines can analyze more data rather than humans efficiently. All of us know that every day plenty amount of data is generated from various fields such as the medical and pharma industry, social media like Facebook, Instagram, etc. And this data is not well structured (i.e. unstructured) so it becomes a tedious job, that’s why we need NLP. We need NLP for tasks like sentiment analysis, machine translation, POS tagging or part-of-speech tagging , named entity recognition, creating chatbots, comment segmentation, question answering, etc.

Unstructured data to structured

We know that supervised and unsupervised learning and deep learning are now extensively used to manipulate human language. That’s why we need a proper understanding of the text. I am going to explain this understanding in this article.NLP is very important to get exact or useful insights from text. Meaningful information is gathered

Components of NLP

NLP is divided into two components.

Natural Language Understanding
Natural Language Generation

Natural Language Understanding

Natural Language Understanding (NLU) helps the machine to understand and analyze human language by extracting the text from large data such as keywords, emotions, relations, and semantics, etc.

Let’s see what challenges are faced by a machine-

For Example:-

He is looking for a match.

What do you understand by the ‘match’ keyword? Does it partner or cricket or football or anything else?

This is Lexical Ambiguity. It happens when a word has different meanings. Lexical ambiguity can be resolved by using parts-of-speech (POS)tagging techniques.

The Fish is ready to eat.

What do you understand by the above example? Is the fish ready to eat his/her food or fish is ready for someone to eat? Got confused!! Right? We will see it practically below.

This is Syntactical Ambiguity which means when we see more meanings in a sequence of words and also Called Grammatical Ambiguity.

Natural Language Generation

It is the process of extracting meaningful insights as phrases and sentences in the form of natural language.

It consists −

Text planning − It includes retrieving the relevant data from the domain.
Sentence planning − It is nothing but a selection of important words, meaningful phrases, or sentences.

Phases of NLP

Lexical Analysis

It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in that particular language. The lexical analysis divides the text into paragraphs, sentences, and words. So we need to perform Lexicon Normalization.

The most common lexicon normalization techniques are Stemming:

Stemming: Stemming is the process of reducing derived words to their word stem, base, or root form—generally a written word form like-“ing”, “ly”, “es”, “s”, etc
Lemmatization: Lemmatization is the process of reducing a group of words into their lemma or dictionary form. It takes into account things like POS(Parts of Speech), the meaning of the word in the sentence, the meaning of the word in the nearby sentences, etc. before reducing the word to its lemma.

Syntactic Analysis

Syntactic Analysis is used to check grammar, arrangements of words, and the interrelationship between the words.

Example: Mumbai goes to the Sara

Here “Mumbai goes to Sara”, which does not make any sense, so this sentence is rejected by the Syntactic analyzer.

Syntactical parsing involves the analysis of words in the sentence for grammar. Dependency Grammar and Part of Speech (POS)tags are the important attributes of text syntactic.

Semantic Analysis

Retrieves the possible meanings of a sentence that is clear and semantically correct. Its process of retrieving meaningful insights from text.

Discourse Integration

It is nothing but a sense of context. That is sentence or word depends upon that sentences or words. It’s like the use of proper nouns/pronouns.

For example, Ram wants it.

In the above statement, we can clearly see that the “it” keyword does not make any sense. In fact, it is referring to anything that we don’t know. That is nothing but this “it” word depends upon the previous sentence which is not given. So once we get to know about “it”, we can easily find out the reference.

Pragmatic Analysis

It means the study of meanings in a given language. Process of extraction of insights from the text. It includes the repetition of words, who said to whom? etc.

It understands that how people communicate with each other, in which context they are talking and so many aspects.

Okay! .. So at this point, we came to know that all the basic concepts of NLP.

Here we will discuss all these points practically …so let’s move on!

Implementation of NLP using Python

I am going to show you how to perform NLP using Python. Python is very simple, easy to understand and interpret.

First, we will import all necessary libraries as shown below. We will be working with the NLTK library but there is also the spacy library for this.

# Importing the libraries
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In the above code, we have imported libraries such as pandas to deal with data frames/datasets, re for regular expression, nltk is a natural language tool kit in which we have imported modules like stopwords which is nothing but “dictionary” and PorterStemmer to generate root word.

df=pd.read_csv('Womens Clothing E-Commerce Reviews.csv',header=0,index_col=0)
df.head()
# Null Entries
df.isna().sum()

Here we have read the file named “Women’s Clothing E-Commerce Reviews” in CSV(comma-separated value) format. And also checked for null values.

You can find this dataset on this link:

import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x='Rating',data=df_temp)
plt.title("Distribution of Rating")

Further, we will perform some data visualizations using matplotlib and seaborn libraries which are really the best visualization libraries in Python. I have taken only one graph, you can perform more graphs to see how your data is!

nltk.download('stopwords')
stops=stopwords.words("english")

From nltk library, we have to download stopwords for text cleaning.

review=df_temp[['Review','Recommended']]
pd.DataFrame(review)
def tokens(words):
    words = re.sub("[^a-zA-Z]"," ", words)
    text = words.lower().split()
    return " ".join(text)
review['Review_clear'] = review['Review'].apply(tokens)
review.head()
corpus=[]
for i in range(0,22628):
    Review=re.sub("[^a-zA-Z]"," ", df_temp["Review"][i])
    Review=Review.lower()
    Review=Review.split()
    ps=PorterStemmer()
    Review=[ps.stem(word) for word in Review if not word in set(stops)]
    tocken=" ".join(Review)
    corpus.append(tocken)

Here we will perform all operations of data cleaning such as lemmatization, stemming, etc to get pure data.

positive_words =[]

for i in positive.Review_clear:
    positive_words.append(i) 
positive_words = ' '.join(positive_words)
positive_words

Now it’s time to see how many positive words are there in “Reviews” from the dataset by using the above code.

negative_words = []
for j in Negative.Review_clear:
    negative_words.append(j)
negative_words = ' '.join(negative_words)
negative_words

Now it’s time to see how many negative words are there in “Reviews” from the dataset by using the above code.

# Library for WordCloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(background_color="white", max_words=len(negative_words))
wordcloud.generate(positive_words)
plt.figure(figsize=(13,13))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

By using the above code, we can simply show the word cloud of the most common words in the Reviews column in the dataset.

So, Finally, we have done all concepts with theory and implementation of NLP in Python…..!

Advantages of NLP

Removes unnecessary information.
NLP helps computers to interact with humans in their languages

Disadvantages of NLP

NLP may not show full context.
NLP is unpredictable sometimes.

Everyday NLP examples

There are many common day-to-day life applications of NLP. Apart from virtual assistants like Alexa or Siri, here are a few more examples you can see.

Email filtering. Spam messages whose content is malicious get automatically filtered by the Gmail system and put into the spam folder.

Autocorrection of any text by using techniques of NLP. Sometimes we see that in mobile chat application or google search our word/sentence get automatically autocorrected. This is because of NLP.
Text classification of tweets or reviews whether they are talking positively or negatively in the text.

Conclusion

In this tutorial for beginners we understood that NLP, or Natural Language Processing, enables computers to understand human languages through algorithms like sentiment analysis and document classification. Using NLP, fundamental deep learning architectures like transformers power advanced language models such as ChatGPT. Therefore, proficiency in NLP is crucial for innovation and customer understanding, addressing challenges like lexical and syntactic ambiguity.

Python programming language, often used for NLP tasks, includes NLP techniques like preprocessing text with libraries like NLTK for data cleaning. Given the power of NLP, it is used in various applications like text summarization, open source language models, text retrieval in search engines, etc. demonstrating its pervasive impact in modern technology.

Key Takeaways

NLP (Natural Language Processing) revolutionizes human-computer interaction, enabling machines to understand and interpret human languages effectively.
NLP encompasses Natural Language Understanding (NLU) and Generation (NLG), addressing challenges like lexical and syntactic ambiguity for accurate interpretation and generation of text.
Python serves as a fundamental tool for NLP implementation, offering libraries like NLTK for text preprocessing and data cleaning.
NLP finds extensive real-world applications including email filtering, autocorrection, and text classification, driving innovation and automation across industries.

The media shown in this article on Natural Language Processing are not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What are the 5 steps of natural language processing?

A. Preprocessing involves cleaning and tokenizing text data. Word embedding converts words into numerical vectors. Dependency parsing analyzes grammatical structure. Modeling employs machine learning algorithms for predictive tasks. Evaluation assesses model performance using metrics like those provided by Microsoft’s NLP models.

Q2. How do I start learning natural language processing?

A. To begin learning Natural Language Processing (NLP), start with foundational concepts like tokenization, part-of-speech tagging, and text classification. Utilize online courses, textbooks, and tutorials. Practice with small projects and explore NLP APIs for practical experience.

Q3 . What does natural language processing do?

A. Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. It encompasses tasks such as sentiment analysis, language translation, information extraction, and chatbot development, leveraging techniques like word embedding and dependency parsing.

Amruta

I am Software Engineer, data enthusiast , passionate about data and its potential to drive insights, solve problems and also seeking to learn more about machine learning, artificial intelligence fields.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Natural Language Processing: Step by Step Guide

Introduction

Learning Objective

Table of contents

Why NLP is so important?

Text data in a massive amount

Unstructured data to structured

Components of NLP

Natural Language Understanding

Natural Language Generation

Phases of NLP

Lexical Analysis

Syntactic Analysis

Semantic Analysis

Discourse Integration

Pragmatic Analysis

Implementation of NLP using Python

Advantages of NLP

Disadvantages of NLP

Everyday NLP examples

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#