When I began working at Office People a few months back, I became interested in Large Language Models, particularly Word2Vec. Being a native Python user, I naturally concentrated on Gensim’s Word2Vec implementation and looked for papers and tutorials online. As any good data scientist would do, I directly applied and duplicated code snippets from multiple sources. I delved deeper to understand what went wrong with my method, reading through Stackoverflow conversations, Gensim’s Google Groups, and the library’s documentation.
However, I always thought that one of the most important aspects of creating a Word2Vec model was missing. During my experiments, I discovered that lemmatizing the sentences or looking for phrases/bigrams in them significantly impacted the results and performance of my models. Though the impact of preprocessing varies depending on the dataset and application, I decided to include the data preparation steps in this article and use the fantastic spaCy library alongside it.
This article was published as a part of the Data Science Blogathon.
A Google team of researchers introduced Word2Vec in two papers between September and October 2013. The researchers also published their C implementation alongside the papers. Gensim completed the Python implementation shortly after the first paper.
The underlying assumption of Word2Vec is that two words with similar contexts have similar meanings and, as a result, a similar vector representation from the model. For example, “dog,” “puppy,” and “pup” are frequently used in similar contexts, with similar surrounding words such as “good,” “fluffy,” or “cute,” and thus have a similar vector representation according to Word2Vec.
Based on this assumption, Word2Vec Gensim can be used to discover the relationships between words in a dataset, compute their similarity, or use the vector representation of those words as input for other applications like text classification or clustering.
The idea behind Gensim Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “Show me your friends, and I’ll tell you who you are”. Here’s an implementation of Gensim word2vec.
python==3.6.3
Libraries used:
import re # For preprocessing
import pandas as pd # For data handling
from time import time # To time our operations
from collections import defaultdict # For word frequency
import spacy # For preprocessing
import logging # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s",
datefmt= '%H:%M:%S', level=logging.INFO)
This dataset contains information about the characters, locations, episode details, and script lines for over 600 Simpsons episodes dating back to 1989. It is available at Kaggle. (~25MB)
While doing preprocessing will keep only two columns from a dataset which are raw_character_text and spoken_words.
Because we want to do our own preprocessing, we don’t keep normalized_text.
df = pd.read_csv('../input/simpsons_dataset.csv')
df.shape
df.head()
The missing values are from a section of the script where something happens but there is no dialogue. “(Springfield Elementary School: EXT. ELEMENTARY – SCHOOL PLAYGROUND – AFTERNOON)” is an example.
df.isnull().sum()
For each line of dialogue, we are lemmatizing and removing stopwords and non-alphabetic characters.
nlp = spacy.load('en', disable=['ner', 'parser'])
def cleaning(doc):
# Lemmatizes and removes stopwords
# doc needs to be a spacy Doc object
txt = [token.lemma_ for token in doc if not token.is_stop]
if len(txt) > 2:
return ' '.join(txt)
Removes non-alphabetic characters:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['spoken_words'])
Using the spaCy.pipe() attribute to accelerate the cleaning process:
t = time()
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000,
n_threads=-1)]
print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))
To remove missing values and duplicates, place the results in a DataFrame:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape
Bigrams are a concept used in natural language processing and text analysis. They refer to consecutive pairs of words or characters appearing in a text sequence. By analyzing bigrams, we can gain insights into the relationships between words or characters in a given text.
Let’s take an example sentence: “I love ice cream”. To identify the bigrams in this sentence, we look at pairs of consecutive words:
“I love”
“love ice”
“ice cream”
Each of these pairs represents a bigram. Bigrams can be useful in various language processing tasks. For example, in language modeling, we can use bigrams to predict the next word in a sentence based on the previous word.
Bigrams can be extended to larger sequences called trigrams (consecutive triplets) or n-grams (consecutive sequences of n words or characters). The choice of n depends on the specific analysis or task at hand.
The Gensim Phrases package is used to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html
We do this primarily to capture words like “mr_burns” and “bart_simpson”!
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in df_clean['clean']]
The following phrases are generated from the list of sentences:
phrases = Phrases(sent, min_count=30, progress_per=10000)
The goal of Phraser() is to reduce Phrases() memory consumption by discarding model state that is not strictly required for the bigram detection task:
bigram = Phraser(phrases)
Transform the corpus based on the bigrams detected:
sentences = bigram[sent]
Mostly a sanity check on the effectiveness of the lemmatization, stopword removal, and bigram addition.
word_freq = defaultdict(int)
for sent in sentences:
for i in sent:
word_freq[i] += 1
len(word_freq)
sorted(word_freq, key=word_freq.get, reverse=True)[:10]
I prefer to divide the training into three steps for clarity and monitoring.
import multiprocessing
from gensim.models import Word2Vec
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
w2v_model = Word2Vec(min_count=20,
window=2,
size=300,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html
Word2Vec Gensim requires us to create the vocabulary table (by digesting all of the words, filtering out the unique words, and performing some basic counts on them):
t = time()
w2v_model.build_vocab(sentences, progress_per=10000)
print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))
The vocabulary table is crucial for encoding words as indices and looking up their corresponding word embeddings during training or inference. It forms the foundation for training Word2Vec Gensim models and enables efficient word representation in the continuous vector space.
Training a Gensim Word2Vec model involves feeding a corpus of text data into the algorithm and optimizing the model’s parameters to learn word embeddings. The training parameters for Gensim Word2Vec include various hyperparameters and settings that affect the training process and the quality of the resulting word embeddings. Here are some commonly used training parameters for Word2Vec:
t = time()
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))
We are calling init_sims() to make the model much more memory-efficient since we do not intend to train it further:
w2v_model.init_sims(replace=True)
These parameters control aspects such as the context window size, the trade-off between frequent and rare words, the learning rate, the training algorithm, and the number of negative samples for negative sampling. Adjusting these parameters can impact the quality, efficiency, and memory requirements of the Word2Vec training process.
Once a Word2Vec model is trained, you can explore it to gain insights into the learned word embeddings and extract useful information. Here are some ways to explore the Word2Vec Gensim model:
In Word2Vec, you can find the words most similar to a given word based on the learned word embeddings. The similarity is typically calculated using cosine similarity. Here’s an example of finding words most similar to a target word using Gensim Word2Vec:
Let’s see what we get for the show’s main character:
similar_words = w2v_model.wv.most_similar(positive=["homer"])
for word, similarity in similar_words:
print(f"{word}: {similarity}")
Just to be clear, when we look at the words that are most similar to “homer,” we do not necessarily get his family members, personality traits, or even his most memorable quotes.
Compare that to what the bigram “homer_simpson” returns:
w2v_model.wv.most_similar(positive=["homer_simpson"])
What about Marge now?
w2v_model.wv.most_similar(positive=["marge"])
Let’s check Bart now:
w2v_model.wv.most_similar(positive=["bart"])
Looks like it is making sense!
Here’s an example of finding the cosine similarity between two words using Gensim Word2Vec:
Example: Calculating cosine similarity between two words.
w2v_model.wv.similarity("moe_'s", 'tavern')
Who could forget Moe’s tavern? Not Barney.
w2v_model.wv.similarity('maggie', 'baby')
Maggie is indeed the most renown baby in the Simpsons!
w2v_model.wv.similarity('bart', 'nelson')
Bart and Nelson, though friends, are not that close, makes sense!
Here, we ask our model to give us the word that does not belong to the list!
Between Jimbo, Milhouse, and Kearney, who is not a bully?
w2v_model.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])
What if we compared the friendship between Nelson, Bart, and Milhouse?
w2v_model.wv.doesnt_match(["nelson", "bart", "milhouse"])
Seems like Nelson is the odd one here!
Lastly, how is the relationship between Homer and his two sister-in-laws?
w2v_model.wv.doesnt_match(['homer', 'patty', 'selma'])
Damn, they do not like you, Homer!
Which word is to woman as homer is to marge?
w2v_model.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)
“man” comes at the first position, that looks about right!
Which word is to woman as Bart is to man?
w2v_model.wv.most_similar(positive=["woman", "bart"], negative=["man"], topn=3)
Lisa is Bart’s sister, her male counterpart!
In conclusion, Word2Vec is a widely used algorithm in natural language processing (NLP) that learns word embeddings by representing words as dense vectors in a continuous vector space. It captures semantic and syntactic relationships between words based on their co-occurrence patterns in a large text corpus.
Word2Vec uses the Continuous Bag-of-Words (CBOW) or Skip-gram model, which are neural network architectures. Word embeddings, generated by Word2Vec Gensim, are dense vector representations of words that encode semantic and syntactic information. They allow for mathematical operations like word similarity calculation and can be used as features in various NLP tasks.
While Word2Vec is a powerful algorithm, it has some limitations. A large amount of training data is required to learn accurate word embeddings. It treats each word as an atomic entity and does not capture word sense disambiguation. Out-of-vocabulary words may pose a challenge as they have no pre-existing embeddings.
Word2Vec has significantly contributed to advancements in NLP and continues to be a valuable tool for tasks such as information retrieval, sentiment analysis, machine translation, and more.
A: Word2Vec is a popular algorithm for natural language processing (NLP) tasks. A shallow, two-layer neural network learns word embeddings by representing words as dense vectors in a continuous vector space. Word2Vec captures the semantic and syntactic relationships between words based on their co-occurrence patterns in a large text corpus.
A: Word2Vec uses a “distributed representation” technique to learn word embeddings. It employs a neural network architecture, the Continuous Bag-of-Words (CBOW) or Skip-gram model. The CBOW model predicts the target word based on its context words, while the Skip-gram model predicts the context words given a target word. During training, the model adjusts the word vectors to maximize the likelihood of correctly predicting the target or context words.
A: Word embeddings are dense vector representations of words in a continuous vector space. They encode semantic and syntactic information about words, capturing their relationships based on their distributional properties in the training corpus. They enable mathematical operations like word similarity calculation and use them as features in various NLP tasks, such as sentiment analysis, machine translation etc.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.