Building Language Models in NLP

Koushiki Last Updated : 16 Jan, 2025
10 min read

Building language models in NLP involves using probabilistic models to predict the likelihood of a word sequence in a sentence based on previous words. These models are key to tasks like predictive text, speech recognition, machine translation, and spelling correction. They analyze training data and output probabilities for word sequences. Depending on the approach, they can use one word (unigram), two words (bigram), three words (trigram), or more (n-gram) to predict the next word.

This article was published as a part of the Data Science Blogathon.

What is Language Models?

Language models are a fundamental component of natural language processing (NLP) systems. A language model is a statistical model that assigns probabilities to sequences of words, allowing it to predict which word or sequence of words is most likely to occur next given the previous words.

Language models play a crucial role in many NLP applications:

  • Predictive text input (auto-complete)
  • Speech recognition
  • Machine translation
  • Spelling and grammar correction
  • Generating human-like text

There are different types of language models in language modeling in NLP, from relatively simple n-gram models that consider only the last n words, to advanced neural network models like transformers that can capture long-range contextual dependencies.

What sets language models apart is their ability to quantify linguistic knowledge in a statistical framework that computers can process. This allows NLP systems to generate, understand, and translate natural language with increasing fluency and accuracy using language models in NLP.

Moreover, large language models pretrained on vast text data have emerged as a powerful basis for transfer learning to many downstream NLP model tasks in language modeling in NLP, substantially advancing the field’s capabilities.

Read more about the

Why Language Models?

Language models form the backbone of Natural Language Processing. They are a way of transforming qualitative information about text into quantitative information that machines can understand. They have applications in a wide range of industries like tech, finance, healthcare, military etc. All of us encounter language models daily, be it the predictive text input on our mobile phones or a simple Google search. Hence NLP models form an integral part of any natural language processing application.

Building Language Models in NLP

In this article, we will be learning how to build unigram, bigram and trigram language models on a raw text corpus and perform next word prediction using them.

Checkout this article how to Build Large Language Models

Reading the Raw Text Corpus

We will begin by reading the text corpus which is an excerpt from Oliver Twist. You can download the text file from here. Once it is downloaded, read the text file and find the total number of characters in it.

file = open("rawCorpus.txt", "r")
rawReadCorpus = file.read()
print ("Total no. of characters in read dataset: {}".format(len(rawReadCorpus)))

We need to import the nltk library to perform some basic text processing tasks which we will do with the help of the following code :

import nltk
nltk.download() 
from nltk.tokenize import word_tokenize,sent_tokenize

Read more about the Text Preprocessing in NLP with Python Codes

Preprocessing the Raw Text 

Firstly, we need to remove all new lines and special characters from the text corpus. We do that by the following code :

import string
string.punctuation = string.punctuation +'“'+'”'+'-'+'’'+'‘'+'—'
string.punctuation = string.punctuation.replace('.', '')
file = open('rawCorpus.txt').read()
#preprocess data to remove newlines and special characters
file_new = ""
for line in file:
    line_new = line.replace("n", " ")      
    file_new += line_new
preprocessedCorpus = "".join([char for char in file_new if char not in string.punctuation])

After removing newlines and special characters, we can break up the corpus to obtain the words and the sentences using sent_tokenize and word_tokenize from nltk.tokenize. Let us print the first 5 sentences and the first 5 words obtained from the corpus :

sentences = sent_tokenize(preprocessedCorpus)
print("1st 5 sentences of preprocessed corpus are : ")
print(sentences[0:5])
words = word_tokenize(preprocessedCorpus)
print("1st 5 words/tokens of preprocessed corpus are : ")
print(words[0:5])

Output:

Output

We also need to remove stopwords from the corpus. Stopwords are some commonly used words like ‘and’, ‘the’, ‘at’ which do not add any special meaning or significance to a sentence. A list of stopwords are available with nltk, and they can be removed from the corpus using the following code :

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in words if not w.lower() in stop_words]

Creating Unigram, Bigram and Trigram Language Models

We can create n-grams using the ngrams module from nltk.util. N-grams are a sequence of n consecutive words occurring in the corpus. For example, the sentence “I love dogs” – ‘I’, ‘love’ and ‘dogs’ are unigrams while ‘I love’ and ‘love dogs’ are bigrams. ‘I love dogs’ is itself a trigram i.e. a contiguous sequence of three words. We obtain unigrams, bigrams and trigrams from the corpus using the following code :

from collections import Counter
from nltk.util import ngrams
unigrams=[]
bigrams=[]
trigrams=[]
for content in (sentences): # *** Write code ***
    content = content.lower()
    content = word_tokenize(content)
    for word in content:
        if (word =='.'):
            content.remove(word) 
        else:
            unigrams.append(word)
    bigrams.extend(ngrams(content,2))
    ##similar for trigrams 
    # *** Write code ***
    trigrams.extend(ngrams(content,3))
print ("Sample of n-grams:n" + "-------------------------")
print ("--> UNIGRAMS: n" + str(unigrams[:5]) + " ...n")
print ("--> BIGRAMS: n" + str(bigrams[:5]) + " ...n")
print ("--> TRIGRAMS: n" + str(trigrams[:5]) + " ...n")

Output:

Next, we obtain those unigrams, bigrams and trigrams from the corpus which do not have stopwords like articles, prepositions or determiners in them. For example, we remove bigrams like ‘in the’ and we remove unigrams like ‘the’, ‘a’ etc. We use the following code for the removal of stopwords from n-grams.

def stopwords_removal(n, a):     
    b = []
    if n == 1:
        for word in a:
            count = 0
            if word in stop_words:
                count = 0
            else:
                count = 1
            if (count==1):
                b.append(word)
        return(b)
    else:
        for pair in a:
            count = 0
            for word in pair:
                if word in stop_words:
                    count = count or 0
                else:
                    count = count or 1
            if (count==1):
                b.append(pair)
        return(b)
unigrams_Processed = stopwords_removal(1,unigrams)
bigrams_Processed = stopwords_removal(2,bigrams)
trigrams_Processed = stopwords_removal(3,trigrams)
print ("Sample of n-grams after processing:n" + "-------------------------")
print ("--> UNIGRAMS: n" + str(unigrams_Processed[:5]) + " ...n")
print ("--> BIGRAMS: n" + str(bigrams_Processed[:5]) + " ...n")
print ("--> TRIGRAMS: n" + str(trigrams_Processed[:5]) + " ...n")

The unigrams, bigrams and trigrams obtained in this way look like :

Building Language Models in NLP

We can obtain the count or frequency of each n-gram appearing in the corpus. This will be useful later when we need to calculate the probabilities of the next possible word based on previous n-grams. We write a function get_ngrams_freqDist which returns the frequency corresponding to each n-gram sent to it. We obtain the frequencies of all unigrams, bigrams and trigrams in this way.

def get_ngrams_freqDist(n, ngramList):
    ngram_freq_dict = {}
    for ngram in ngramList:
        if ngram in ngram_freq_dict:
            ngram_freq_dict[ngram] += 1
        else:
            ngram_freq_dict[ngram] = 1
    return ngram_freq_dict
unigrams_freqDist = get_ngrams_freqDist(1, unigrams)
unigrams_Processed_freqDist = get_ngrams_freqDist(1, unigrams_Processed)
bigrams_freqDist = get_ngrams_freqDist(2, bigrams)
bigrams_Processed_freqDist = get_ngrams_freqDist(2, bigrams_Processed)
trigrams_freqDist = get_ngrams_freqDist(3, trigrams)
trigrams_Processed_freqDist = get_ngrams_freqDist(3, trigrams_Processed)

Predicting Next Three words using Bigram and Trigram Models

The chain rule is used to compute the probability of a sentence in a language model. Let w1w2…wn be a sentence where w1, w2, wn are the individual words. Then the probability of the sentence occurring is given by the following formula :

For example, the probability of the sentence “I love dogs” is given by :

P(I love dogs) = P(I)P(love | I)P(dogs | I love)

Now the individual probabilities can be obtained in the following way :

P(I) = Count(‘I’) / Total no. of words

P(love | I) = Count(‘I love’) / Count(‘I’)

P(dogs | I love) = Count(‘I love dogs’) / Count(‘I love’)

Note that Count(‘I’), Count(‘I love’) and Count(‘I love dogs’) are the frequencies of the respective unigram, bigram and trigram which we computed earlier using the get_ngrams_freqDist function.

Now, when we use a bigram model to compute the probabilities, the probability of each new word depends only on its previous word. That is, for the previous example, the probability of the sentence becomes :

P(I love dogs) = P(I)P(love | I)P(dogs | love)

Similarly, for a trigram model, the probability will be given by :

P(I love dogs) = P(I)P(love | I)P(dogs | I love) since the probability of each new word depends on the previous two words.

Trigram modelling can be better explained by the following diagram

Trigram modelling

However, there is a catch involved in this kind of modelling. Suppose there is some bigram that does not appear in the training set but appears in the test set. Then we will assign a probability of 0 to that bigram, making the overall probability of the test sentence 0, which is undesirable. Smoothing is done to overcome this problem. Parameters are smoothed (or regularized) to reassign some probability mass to unseen events. One way of smoothing is Add-one or Laplace smoothing, which we will be using in this article. Add-one smoothing is performed by adding 1 to all bigram counts and V (no. of unique words in the corpus) to all unigram counts.

unigram counts

Now that we have understood what smoothed bigram and trigram models are, let us write the code to compute them. We will be using the unprocessed bigrams and trigrams (without articles, determiners removed) for prediction.

smoothed_bigrams_probDist = {}
V = len(unigrams_freqDist)
for i in bigrams_freqDist:
    smoothed_bigrams_probDist[i] = (bigrams_freqDist[i] + 1)/(unigrams_freqDist[i[0]]+V)
smoothed_trigrams_probDist = {}
for i in trigrams_freqDist:
    smoothed_trigrams_probDist[i] = (trigrams_freqDist[i] + 1)/(bigrams_freqDist[i[0:2]]+V)

Next, we try to predict the next three words of three test sentences using the computed smoothed bigram and trigram language models.

testSent1 = "There was a sudden jerk, a terrific convulsion of the limbs; and there he"
testSent2 = "They made room for the stranger, but he sat down"
testSent3 = "The hungry and destitute situation of the infant orphan was duly reported by"

First, we tokenize the test sentences into component words and obtain the last unigrams and bigrams appearing in them.

token_1 = word_tokenize(testSent1)
token_2 = word_tokenize(testSent2)
token_3 = word_tokenize(testSent3)
ngram_1 = {1:[], 2:[]}    
ngram_2 = {1:[], 2:[]}
ngram_3 = {1:[], 2:[]}
for i in range(2):
    ngram_1[i+1] = list(ngrams(token_1, i+1))[-1]
    ngram_2[i+1] = list(ngrams(token_2, i+1))[-1]
    ngram_3[i+1] = list(ngrams(token_3, i+1))[-1]
print("Sentence 1: ", ngram_1,"nSentence 2: ",ngram_2,"nSentence 3: ",ngram_3)

Next, we write functions to predict the next word and the next 3 words respectively of the three test sentences using the smoothed bigram model.

def predict_next_word(last_word,probDist):
    next_word = {}
    for k in probDist:
        if k[0] == last_word[0]:
            next_word[k[1]] = probDist[k]
    k = Counter(next_word)
    high = k.most_common(1) 
    return high[0]
def predict_next_3_words(token,probDist):
    pred1 = []
    pred2 = []
    next_word = {}
    for i in probDist:
        if i[0] == token:
            next_word[i[1]] = probDist[i]
    k = Counter(next_word)
    high = k.most_common(2) 
    w1a = high[0]
    w1b = high[1]
    w2a = predict_next_word(w1a,probDist)
    w3a = predict_next_word(w2a,probDist)
    w2b = predict_next_word(w1b,probDist)
    w3b = predict_next_word(w2b,probDist)
    pred1.append(w1a)
    pred1.append(w2a)
    pred1.append(w3a)
    pred2.append(w1b)
    pred2.append(w2b)
    pred2.append(w3b)
    return pred1,pred2
print("Predicting next 3 possible word sequences with smoothed bigram model : ")
pred1,pred2 = predict_next_3_words(ngram_1[1][0],smoothed_bigrams_probDist)
print("1a)" +testSent1 +" "+ '33[1m' + pred1[0][0]+" "+pred1[1][0]+" "+pred1[2][0] + '33[0m')
print("1b)" +testSent1 +" "+ '33[1m' + pred2[0][0]+" "+pred2[1][0]+" "+pred2[2][0] + '33[0m')
pred1,pred2 = predict_next_3_words(ngram_2[1][0],smoothed_bigrams_probDist)
print("2a)" +testSent2 +" "+ '33[1m' + pred1[0][0]+" "+pred1[1][0]+" "+pred1[2][0] + '33[0m')
print("2b)" +testSent2 +" "+ '33[1m' + pred2[0][0]+" "+pred2[1][0]+" "+pred2[2][0] + '33[0m')
pred1,pred2 = predict_next_3_words(ngram_3[1][0],smoothed_bigrams_probDist)
print("3a)" +testSent3 +" "+ '33[1m' + pred1[0][0]+" "+pred1[1][0]+" "+pred1[2][0] + '33[0m')
print("3b)" +testSent3 +" "+ '33[1m' + pred2[0][0]+" "+pred2[1][0]+" "+pred2[2][0] + '33[0m')

The predictions from the smoothed bigram model are :

Bigram Model

We obtain predictions from the smoothed trigram model similarly.

def predict_next_word(last_word,probDist):
    next_word = {}
    for k in probDist:
        if k[0:2] == last_word:
            next_word[k[2]] = probDist[k]
    k = Counter(next_word)
    high = k.most_common(1) 
    return high[0]
def predict_next_3_words(token,probDist):
    pred = []
    next_word = {}
    for i in probDist:
        if i[0:2] == token:
            next_word[i[2]] = probDist[i]
    k = Counter(next_word)
    high = k.most_common(2) 
    w1a = high[0]
    tup = (token[1],w1a[0])
    w2a = predict_next_word(tup,probDist)
    tup = (w1a[0],w2a[0])
    w3a = predict_next_word(tup,probDist)
    pred.append(w1a)
    pred.append(w2a)
    pred.append(w3a)
    return pred
print("Predicting next 3 possible word sequences with smoothed trigram model : ")
pred = predict_next_3_words(ngram_1[2],smoothed_trigrams_probDist)
print("1)" +testSent1 +" "+ '33[1m' + pred[0][0]+" "+pred[1][0]+" "+pred[2][0] + '33[0m')
pred = predict_next_3_words(ngram_2[2],smoothed_trigrams_probDist)
print("2)" +testSent2 +" "+ '33[1m' + pred[0][0]+" "+pred[1][0]+" "+pred[2][0] + '33[0m')
pred = predict_next_3_words(ngram_3[2],smoothed_trigrams_probDist)
print("3)" +testSent3 +" "+ '33[1m' + pred[0][0]+" "+pred[1][0]+" "+pred[2][0] + '33[0m')

Output:

Output | Building Language Models in NLP

Conclusion

Building Language Models in NLP involves creating tools to predict word sequences in natural language. This article explains how to build unigram, bigram, and trigram models from raw text and use smoothing techniques to handle unseen n-grams. While these models provide decent predictions, advanced neural models offer better performance by capturing longer-range dependencies. Understanding n-gram models is key to mastering more complex NLP techniques.

If you are Want to Change your Carrer in AI and ML here is Course for you where you can get Certified with this Program of AI and ML Program.

Frequently Asked Questions

Q1. What are language models in NLP?

A. Language models are probabilistic statistical models that determine the probability of a sequence of words occurring in a sentence or text based on the previous words. They are fundamental to many NLP tasks like predictive text, speech recognition, and machine translation.

Q2. How to build an NLP model?

A. To build an NLP model, you typically need to preprocess text data, extract relevant features (e.g. n-grams, word embeddings), choose an appropriate model architecture (e.g. neural networks, ensemble methods), train the model on labeled data, tune hyperparameters, and evaluate performance.

Q3. How to build a language model?

A. To build a language model, you tokenize the text into words/n-grams, count their frequencies, and estimate probabilities of word sequences using techniques like maximum likelihood estimation with smoothing. Neural language models use neural networks trained on text to model these probabilities.

Q4. How are language models built?

A. Language models are built by first preprocessing a text corpus, then extracting n-grams (sequences of n words) and counting their frequencies. Probabilities of word sequences are estimated from these counts, often using smoothing techniques to account for unseen n-grams. Advanced neural language models learn these probabilities automatically through training on text data.

I am a pre-final year undergraduate at IIT Kharagpur. I am highly interested in all things ML and DL, and try to find applications of them in areas like healthcare and biology. Feel free to connect with me on LinkedIn!

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details