A beginner’s guide to understanding Text Summarization with NLP

Alifia Ghantiwala Last Updated : 25 Nov, 2024

8 min read

Consider a scenario where you don’t have to read an entire article or research paper. Instead, you could read just the most important statements. This is made possible through text summarization, a widely used technique in NLP. Text summarization takes a sequence of words as input (the article) and returns a summary as output. This makes it an essential application of sequence-to-sequence models in NLP machine learning. It is highly useful in domains like financial research, question-answer bots, media monitoring, and social media marketing. In this article, we would cover text summarization in detail, including its techniques and applications in NLP and text analytics.

This article was published as a part of the Data Science Blogathon.

Types of neural text summarization
Using a pre-trained summarizer and evaluating its output
Understanding BLEU score and its calculation:-
Coding a text summarization model in python from scratch
- Implementation in python is as below:
Conclusion
Frequently Asked Questions

Types of neural text summarization

In school, most of us had to understand and convert long text articles into their succinct summaries, the technique we used then was to grasp the underlying idea of the text and reproduce the summary that would cover all the important points. This is similar to the idea of abstractive text summarization, wherein the machine learning model would output the main idea of the input text using similar words but not exact sentences from the input.

The second type of summarization is extractive summarization, where the model output represents a subset of the input text that conveys the main idea of the input article. This approach makes extractive summarization an important application of text summarization in NLP. A personal analogy that I would like to share is, you can consider extractive summarization as highlighting important points of a reference paper that you are trying to understand. Extractive summarization represents a commonly used approach in text summarization NLP techniques, aiming to extract and present the most relevant information from the original text while preserving its meaning.

As you may have guessed, extractive summarization is simpler to model than abstractive summarization because in abstractive summarization, the model must understand language and its nuances to derive meaning and produce a valid summary. Whereas in extractive summarization using some form of scoring (which we would discuss in detail later in this article), the model has to threshold and output the most important sentences of the input itself.

Naturally, there is more research available for extractive summarization than abstractive summarization. In this article, we would look into extractive summarization in further detail.

Using a pre-trained summarizer and evaluating its output

What do we mean by pre-trained models? These models have already undergone training on large datasets. When a model trains on huge amounts of data, it naturally predicts better. However, the inability to collect large amounts of data and the resulting higher training time are some reasons why we can benefit from using a pre-trained model instead of training one from scratch.

We would be using the BBC News Summary dataset for this article and bert-extractive-summarizer as the pre-trained model.

Below code, snippet includes loading the necessary libraries

!pip install bert-extractive-summarizer

!pip install spacy
!pip install transformers # > 4.0.0
!pip install neuralcoref

!python -m spacy download en_core_web_md

After importing the above libraries and downloading the spacy model we would now call the summarizer and pass a sample text to view its output.

#from summarizer import Summarizer

model = Summarizer()

text = "Learning NLP involves understanding basic principles of machine learning which then need to be customized for words. With the advent of using transfer learning for NLP I think it hads made a huge progress in terms of its research"

As you can see in the below output the model does provide an appropriate summary given our input text.

Now let us use the same model on our BBC news dataset, the below snippet takes care of the same. As we have a total of 2225 input articles with an average length of 3000 words, to save execution time I have predicted the summary items only for the first 10 input articles.

from tqdm import tqdm
bert_predicted_summary = []
k = 0
for i in tqdm(df['text']):
    if k < 10:
        x = model(str(i))
        bert_predicted_summary.append(x)
        k+=1

Below is the attached output, the first one is what the pre-trained model predicted and the second one is the actual summary provided in the dataset.

Using simple preprocessing techniques like removing newline characters(n) or end of sentence characters(b) is always recommended. As the popular saying goes garbage in is garbage out, so we need to clean our input before passing it to our model. I have used simple regular expressions for preprocessing the input, the code snippet for the same is as below.

path = '/kaggle/input/bbc-news-summary/bbc news summary/BBC News Summary/News Articles/'
for i in os.listdir(path):
    for j in os.listdir(os.path.join(path+i)):
        with open(os.path.join(path+i+'/'+j),'rb') as f:
                article = f.readlines()
                article = re.sub('b'','',str(article))
                article = re.sub('[\nnt-\/]','',article)
                article = re.sub('n'','',article)
                article = re.sub('xc2xa','',article)
                article = article.lower()
                text.append(article)
                type_.append(i)

For evaluating the output the metric we use is the BLEU score, in the next section of the article, we would go through the same in detail.

Understanding BLEU score and its calculation:-

BLEU score stands for Bilingual Evaluation Understudy, it is a metric widely used for machine translation, text generation, and for models having a word sequence as output. Let us understand how it is calculated.

The range of BLEU scores is between 0 and 1, where 0 signifies no match between the expected output and the predicted output and 1 means a perfect match. BLEU can be considered as a modification to precision to handle sequence outputs.

Consider an example. Suppose our predicted summary (or candidate) is awesome awesome awesome. The actual or expected summary (also known as the reference) is NLP is awesome. All the words in the predicted output are present in the reference, giving it a precision of 1. However, we can all agree that this is a poor-quality summary.

To overcome this, BLEU performs a simple modification: it clips the number of times a word appears in the candidate or predicted output to the maximum number of times it appears in the reference or expected output. So in the case of our example, the score now becomes 1/3 as awesome is present only once in the reference.

Taking another example, let’s say our reference is “I want to learn NLP”, and our candidate is “NLP is what I want to learn” if we consider only unigrams BLEU score would be perfect, i.e. 1. But so would be the BLEU score for “NLP learn I want to”, which is not correct grammatically.

This is why BLEU also considers n-grams(bigrams, trigrams 4-grams). If we account for bigrams in the same example, then bigrams that are possible from our candidate are “NLP is”, “is what”, “what I”, “I want”, “want to”, “to learn”. and the bigram precision score now becomes 3/4. This explains that BLEU rewards exact matching sequences of words between candidate and reference.

BLEU also penalizes sentences shorter than the reference sentence. To understand why it does this, we can extend the original example. Now consider our candidate to be “NLP is.” If we consider bigrams, this candidate would receive a BLEU score of 1. BLEU then penalizes the score by multiplying it with a penalty calculated by dividing the length of the reference sentence by the length of our output, subtracting one from that, and raising it to the power of e. In our case, the penalty would be 0.36 making our BLEU score 0.36 from 1.

We can all now agree why BLEU is a widely used metric but it does have some flaws like it does not consider meaning. You can further read about problems with BLEU to gain a better understanding of the metric on this Blog.

We now look at the below BLEU scores for our generated summaries through the pre-trained BERT model

Code snippet for the calculation

def calculate_bleu_score(bert_predicted_summary,df):
    for i in range(len(bert_predicted_summary)):
        candidate = list(bert_predicted_summary[i].split("."))
        reference = list(str(df['summary'][i]).split("."))
        print(corpus_bleu(reference[:len(candidate)],candidate))
calculate_bleu_score(bert_predicted_summary,df)

Output

We can see that with basic preprocessing and without fine-tuning the pre-trained model, for the first 10 predicted summaries we receive a good score for each of the summaries with an average of 0.6 BLEU score.

Now let us dig further deep and create our own text summarizer using python.

Coding a text summarization model in python from scratch

Why do we need to build an extractive summarizer from scratch when we already have amazing pre-trained models available?

To help build intuition and not consider it simply as a black box that gives us our desired output. With that said, now let us dig further deep and create our own text summarizer using python. As we had discussed earlier extractive summarizer needs to score sentences and return the most important sentences as the summary. There are many scoring functions possible, let us consider the below.

We assign a score to each word based on its frequency in the entire corpus. Be sure to remove stop words, as they can skew frequency counts. Next, score the sentences in each input article by summing up the frequencies of their constituent words.

Implementation in python is as below:

def count_freq():

    res = {}
    for i in df['cleaned_text']:
        for k in word_tokenize(i):
            if k in res:
                res[k] += 1
            else:
                res[k] = 1
    return res
word_freq = count_freq()

In the above function, we create a dictionary word_freq which includes word count for every word present in the corpus.

def sentence_rank(text):
    weights = []
    sentences = sent_tokenize(text)
    for sentence in sentences:
        temp = 0
        words = word_tokenize(sentence)
        for word in words:
            temp += word_freq[word]
        weights.append(temp)
    return weights

As part of the sentence_rank function, we provide weight to the sentences which would be the sum of word counts of all words present in the sentence.

n = 14
for i in range(10):
    ranked_sentences = sentence_rank(df['cleaned_text'][i])
    sentences = sent_tokenize(df['cleaned_text'][i])
    sort_list = np.argsort(ranked_sentences)[::-1][:n]
    result = ''
    for i in range(n):
        result += '{} '.format(sentences[sort_list[i]])
    candidate = result
    reference = df['summary'][i]
    print(corpus_bleu(reference[:len(candidate)],candidate[:len(reference)]))

In the above code snippet, we are just making use of the sentence_rank function we discussed above, to summarize each of the input articles and calculate the bleu scores. n is a hyperparameter that controls the length of the generated summary, after iterating over some values I have chosen a length of 14 as it was giving me a good BLEU score. As you can see below with our very basic text summarizer we are able to achieve on average a BLEU score of 0.5 which is 0.1 lesser than what we achieved with the pre-trained model on the same input.

For improving the text summarizer, we could use

1) TF-IDF scores instead of just using word frequencies

2) Sequence to Sequence Encoder-Decoder models and so on

While there is definitely scope for improvement for our text summarizer, I would end this article here. If you have any suggestions regarding the improvement of the article, feel free to comment below.

Conclusion

In summary, this guide introduced you to making short text summaries using NLP. We covered different types, using ready-made tools, and how to measure success with BLEU score. Plus, you’ve got a taste of creating your own summarizer in Python. Now, you’re all set to summarize text like a pro!

I am Alifia, currently working as an analyst. By writing these articles I try to deepen my understanding of applied machine learning.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Frequently Asked Questions

Q1.How BERT is Used for Text Summarization ?

BERT serves as a smart tool for summarizing text. It learns from lots of examples and then fine-tunes itself to create short and clear summaries. This helps in making quick and efficient summaries of long pieces of writing.

Q2. What is the objective of text summarization?

The goal of text summarization is to make things shorter while keeping the important stuff. It’s like making a quick version that highlights the main ideas, making it easier and faster for people to understand.

Alifia Ghantiwala

I investigate data on a daily basis to find insights! I write so that I can understand more clearly. Have completed my graduation in Computer Engineering, and won my first public data science competition last month, March 2021, hosted on Kaggle by Google Developers.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

A beginner’s guide to understanding Text Summarization with NLP

Table of contents

Types of neural text summarization

Using a pre-trained summarizer and evaluating its output

Understanding BLEU score and its calculation:-

Output

Coding a text summarization model in python from scratch

Implementation in python is as below:

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics