Natural Language Processing, or NLP for short, is a branch of Artificial Intelligence that allows machines to comprehend, process, and manipulate human languages. The breakthrough in NLP bridged the gap between humans and machines, paving the way for leading-edge technologies such as Language Translator, Voice Assistants such as Siri, Customer Service Chatbots, and many more.
In this article, I’ll walk you through the fundamentals of text analysis using the powerful NLP library, Gensim.
Natural Language Processing is all about handling natural languages, which can be text, audio, and video. This article will focus on understanding how to work with text data and discuss the building blocks of text data.
Token: A token is a string with a known meaning, and a token may be a word, number or just characters like punctuation. “Hello”, “123”, and “-” are some examples of tokens.
Sentence: A sentence is a group of tokens that is complete in meaning. “The weather looks good” is an example of a sentence, and the tokens of the sentence are [“The”, “weather”, “looks”, “good].
Paragraph: A paragraph is a collection of sentences or phrases, and a sentence can alternatively be viewed as a token of a paragraph.
Documents: A document might be a sentence, a paragraph, or a set of paragraphs. A text message sent to an individual is an example of a document.
Corpus: A corpus is typically an extensive collection of documents as a Bag-of-words. A corpus comprises each word’s id and frequency count in each record. An example of a corpus is a collection of emails or text messages sent to a particular person.
Gensim is a well-known open-source Python library used in NLP and Topic Modeling. Its ability to handle vast quantities of text data and its speed in training vector embeddings set it apart from the other NLP libraries. Moreover, Gensim provides popular topic modelling algorithms such as LDA, making it the go-to library for many users.
Setting up Gensim is a pretty easy task. You can either install Gensim using the Pip installer or the Conda environment.
We can use Gensim to generate dictionaries from a list of sentences and text files. First, let’s look at making a dictionary out of a list of sentences.
You can see from the output that each token in the dictionary is assigned to a unique id.
Now, let’s make a dictionary with tokens from a text file. Initially, we’ll preprocess the file using Gensim’s simple_preprocess() function to retrieve the list of tokens from the file.
We have now successfully created a dictionary from the text file.
We can also update an existing dictionary with tokens from a new document
Creating a Bag-of-Words
We can use the Gensim function doc2bow to generate our Bag of Words from the created dictionary. The Bag of Words returns a vector of tuples containing each token’s unique id and the number of occurrences in the document.
Saving and Loading a Gensim Dictionary and BOW
We can save both our dictionary and BOW corpus and load them whenever you want.
Creating TF-IDF
“Term Frequency – Inverse Document Frequency” (TF-IDF) is a technique for measuring the importance of each word in a document by computing the word’s weight.
In the TF-IDF vector, the weight of each word is inversely proportional to the frequency of the word in that document.
Creating Bigrams and Trigrams
Some words usually appear together in the text of a large document. When these words occur together, they may act as a single entity and have a completely different meaning than when they occur separately.
Let me use the phrase “Gateway to India” as an example. They have a completely different meaning when they occur together than when they occur separately. These groups of words are called “N-gram”.
Bigrams are N-grams of 2 words, and Trigrams are three words.
We’ll create bigrams and trigrams for the “text8” dataset, which is available for download via the Gensim Downloader API. We’ll be using Gensim’s Phrases function for this purpose.
The Trigram model is generated by passing the previously obtained bigram model to the Phrases function.
Creating a Word2Vec model
Word Embedding model is a model that represents a text as a numeric vector.
Word2Vec is a pre-built word embedding model from Gensim that uses an external neural network to embed words in a lower-dimensional vector space. Gensim’s Word2Vec model can implement the Skip-grams model and the Continuous Bag of Words model.
Let us initially train the Word2Vec model for the first 1000 words of the
‘text8″ dataset.
The above output is the word vector of “Social” found through this model.
Using the most_similar function, we can get all the words similar to the word, i.e. “Social” here.
You can also save your Word2Vec model and load it back.
Gensim also has a feature that enables you to update an existing Word2Vec model. We can update the model by calling the build_vocab function followed by the train function.
We’ve gone over several key NLP topics to help you become more acquainted with text data manipulation using Gensim and begin putting your NLP skills to use. I hope the above examples aid you in discovering the beauty of Natural Language Processing using Gensim.
Please read our latest articles on our website.