Learn Basics of Natural Language Processing (NLP) using Gensim: Part 1

yukthab Last Updated : 11 Mar, 2022

5 min read

Natural Language Processing, or NLP for short, is a branch of Artificial Intelligence that allows machines to comprehend, process, and manipulate human languages. The breakthrough in NLP bridged the gap between humans and machines, paving the way for leading-edge technologies such as Language Translator, Voice Assistants such as Siri, Customer Service Chatbots, and many more.

In this article, I’ll walk you through the fundamentals of text analysis using the powerful NLP library, Gensim.

Basics of Natural Language Processing
Introduction to Gensim
Hands-on with Gensim

Basics of Natural Language Processing

Natural Language Processing is all about handling natural languages, which can be text, audio, and video. This article will focus on understanding how to work with text data and discuss the building blocks of text data.

Token: A token is a string with a known meaning, and a token may be a word, number or just characters like punctuation. “Hello”, “123”, and “-” are some examples of tokens.

Sentence: A sentence is a group of tokens that is complete in meaning. “The weather looks good” is an example of a sentence, and the tokens of the sentence are [“The”, “weather”, “looks”, “good].

Paragraph: A paragraph is a collection of sentences or phrases, and a sentence can alternatively be viewed as a token of a paragraph.

Documents: A document might be a sentence, a paragraph, or a set of paragraphs. A text message sent to an individual is an example of a document.

Corpus: A corpus is typically an extensive collection of documents as a Bag-of-words. A corpus comprises each word’s id and frequency count in each record. An example of a corpus is a collection of emails or text messages sent to a particular person.

Natural Language Processing

Introduction to Gensim

Gensim is a well-known open-source Python library used in NLP and Topic Modeling. Its ability to handle vast quantities of text data and its speed in training vector embeddings set it apart from the other NLP libraries. Moreover, Gensim provides popular topic modelling algorithms such as LDA, making it the go-to library for many users.

Hands-on with Gensim

Setting up Gensim is a pretty easy task. You can either install Gensim using the Pip installer or the Conda environment.

Creating a Dictionary

We can use Gensim to generate dictionaries from a list of sentences and text files. First, let’s look at making a dictionary out of a list of sentences.

You can see from the output that each token in the dictionary is assigned to a unique id.

Now, let’s make a dictionary with tokens from a text file. Initially, we’ll preprocess the file using Gensim’s simple_preprocess() function to retrieve the list of tokens from the file.

We have now successfully created a dictionary from the text file.

We can also update an existing dictionary with tokens from a new document

Updating an Existing dictionary | Natural Language Processing

Creating a Bag-of-Words

We can use the Gensim function doc2bow to generate our Bag of Words from the created dictionary. The Bag of Words returns a vector of tuples containing each token’s unique id and the number of occurrences in the document.

Saving and Loading a Gensim Dictionary and BOW

We can save both our dictionary and BOW corpus and load them whenever you want.

Creating TF-IDF

“Term Frequency – Inverse Document Frequency” (TF-IDF) is a technique for measuring the importance of each word in a document by computing the word’s weight.
In the TF-IDF vector, the weight of each word is inversely proportional to the frequency of the word in that document.

Creating Bigrams and Trigrams

Some words usually appear together in the text of a large document. When these words occur together, they may act as a single entity and have a completely different meaning than when they occur separately.

Let me use the phrase “Gateway to India” as an example. They have a completely different meaning when they occur together than when they occur separately. These groups of words are called “N-gram”.
Bigrams are N-grams of 2 words, and Trigrams are three words.

We’ll create bigrams and trigrams for the “text8” dataset, which is available for download via the Gensim Downloader API. We’ll be using Gensim’s Phrases function for this purpose.
The Trigram model is generated by passing the previously obtained bigram model to the Phrases function.

Creating a Word2Vec model

Word Embedding model is a model that represents a text as a numeric vector.
Word2Vec is a pre-built word embedding model from Gensim that uses an external neural network to embed words in a lower-dimensional vector space. Gensim’s Word2Vec model can implement the Skip-grams model and the Continuous Bag of Words model.

Let us initially train the Word2Vec model for the first 1000 words of the
‘text8″ dataset.

The above output is the word vector of “Social” found through this model.

Using the most_similar function, we can get all the words similar to the word, i.e. “Social” here.

You can also save your Word2Vec model and load it back.

Gensim also has a feature that enables you to update an existing Word2Vec model. We can update the model by calling the build_vocab function followed by the train function.

EndNotes

We’ve gone over several key NLP topics to help you become more acquainted with text data manipulation using Gensim and begin putting your NLP skills to use. I hope the above examples aid you in discovering the beauty of Natural Language Processing using Gensim.

Please read our latest articles on our website.

yukthab

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Learn Basics of Natural Language Processing (NLP) using Gensim: Part 1

Table of Contents

Basics of Natural Language Processing

Introduction to Gensim

Hands-on with Gensim

Creating a Dictionary

EndNotes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at