Often, there are many situations where we don’t have/get enough time to read and understand lengthy documents, research papers, or news articles. Similarly, summarizing a large text while retaining essential information is crucial in many fields, such as journalism, research, and business. This is where NLP text summarization comes into play, which is a technique that automatically generates a condensed version of a given text while preserving its essential meaning. This article will explore the two main approaches of NLP text summarization, extractive and abstractive, and examine their applications, strengths, and weaknesses.
Objectives
Broadly, the NLP text summarization can be divided into two main categories.
Let’s dive a little deeper into each of the above-mentioned categories.
So, what exactly happens in the extractive summarization model? It simply takes out the important sentences or phrases from the original text and joins them to form a summary.
The question is, exactly on what basis are those sentences important? So, a ranking algorithm is used, which assigns scores to each sentence in the text based on their relevance to the overall meaning of the document. The most relevant sentences are then chosen to be included in the summary.
There are various ways through which the ranking of sentences can be performed.
TF-IDF (term frequency-inverse document frequency)
Graph-based methods such as TextRank
Machine learning-based methods such as Support Vector Machines (SVM) and Random Forests.
The main motive of the extractive summarization models is to maintain the original meaning of the text. This method also works well when the input text/content is already well-structured, physically and logically, like newspaper content.
Okay, now let’s come to the abstractive summarization method. The name implies that it has arrived from the root form of the word abstract, which means outline/summary or the basic idea of a voluminous thing (text). Unlike the extractive summarization models, it simply doesn’t pick out the important sentences. Rather, it analyses the input text and generates new phrases or sentences that capture the essence of the original text and convey the same meaning as the original text but more concisely and coherently.
Again, how exactly is the summary generated using this method? In brief, the input text is analyzed by a neural network model that learns to generate new phrases and sentences that capture the essence of the original text. The model is trained on large amounts of text data and learns to understand the relationships between words and sentences, generating new text that conveys the same meaning as the original text in a more understandable manner.
This method uses advanced NLP techniques such as natural language generation (NLG) and deep learning to understand the context and generate the summary. The resulting summaries are usually shorter and more readable than the ones generated by the extractive summarization models, but they can sometimes contain errors or inaccuracies.
Note that here in this article, we’ll only use the extractive text summarization method.
Here, we’ll focus on the extractive summarization models and understand it more with an example.
But, before that, let’s quickly understand it with a flowchart.
Here, we will implement the extractive summarization models using a Python library called NLTK (Natural Language Toolkit). NLTK provides a wide range of functionalities for natural language processing, including text tokenization, stopword removal, and sentence scoring.
Let’s take a look at the following code that demonstrates how to use NLTK to generate a summary from a given text:
# import the required libraries
import nltk
nltk.download('punkt') # punkt tokenizer for sentence tokenization
nltk.download('stopwords') # list of stop words, such as 'a', 'an', 'the', 'in', etc, which would be dropped
from collections import Counter # Imports the Counter class from the collections module, used for counting the frequency of words in a text.
from nltk.corpus import stopwords # Imports the stop words list from the NLTK corpus
# corpus is a large collection of text or speech data used for statistical analysis
from nltk.tokenize import sent_tokenize, word_tokenize # Imports the sentence tokenizer and word tokenizer from the NLTK tokenizer module.
# Sentence tokenizer is for splitting text into sentences
# word tokenizer is for splitting sentences into words
# this function would take 2 inputs, one being the text, and the other being the summary which would contain the number of lines
def generate_summary(text, n):
# Tokenize the text into individual sentences
sentences = sent_tokenize(text)
# Tokenize each sentence into individual words and remove stopwords
stop_words = set(stopwords.words('english'))
# the following line would tokenize each sentence from sentences into individual words using the word_tokenize function of nltk.tokenize module
# Then removes any stop words and non-alphanumeric characters from the resulting list of words and converts them all to lowercase
words = [word.lower() for word in word_tokenize(text) if word.lower() not in stop_words and word.isalnum()]
# Compute the frequency of each word
word_freq = Counter(words)
# Compute the score for each sentence based on the frequency of its words
# After this block of code is executed, sentence_scores will contain the scores of each sentence in the given text,
# where each score is a sum of the frequency counts of its constituent words
# empty dictionary to store the scores for each sentence
sentence_scores = {}
for sentence in sentences:
sentence_words = [word.lower() for word in word_tokenize(sentence) if word.lower() not in stop_words and word.isalnum()]
sentence_score = sum([word_freq[word] for word in sentence_words])
if len(sentence_words) < 20:
sentence_scores[sentence] = sentence_score
# checks if the length of the sentence_words list is less than 20 (parameter can be adjusted based on the desired length of summary sentences)
# If condition -> true, score of the current sentence is added to the sentence_scores dictionary with the sentence itself as the key
# This is to filter out very short sentences that may not provide meaningful information for summary generation
# Select the top n sentences with the highest scores
summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n]
summary = ' '.join(summary_sentences)
return summary
text = '''
Weather is the day-to-day or hour-to-hour change in the atmosphere.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather.
Ways to measure weather are wind speed, wind direction, temperature and humidity.
People try to use these measurements to make weather forecasts for the future.
These people are scientists that are called meteorologists.
They use computers to build large mathematical models to follow weather trends.'''
summary = generate_summary(text, 5)
summary_sentences = summary.split('. ')
formatted_summary = '.\n'.join(summary_sentences)
print(formatted_summary)
Output
The following output is what we would be getting as a summary. This summary would contain 5 sentences.
We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of weather.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and more.
Wind speed, direction, temperature, and humidity are ways to measure weather.
So, the above code takes a text and a desired number of sentences for the summary as input and returns a summary generated using the extractive summarization models. The method first tokenizes the text into individual sentences and then tokenizes each sentence into individual words. Stopwords are removed from the words, and then the frequency of each word is computed.
Then, the score for each sentence is computed based on the frequency of its words, and the top n sentences with the highest scores are selected to form the summary. Finally, the summary is generated by joining the selected sentences together.
In the next section, we will explore how the extractive summarization models can be further improved using advanced techniques such as TF-IDF.
# importing the required libraries
# importing TfidfVectorizer class to convert a collection of raw documents to a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer
# importing cosine_similarity function to compute the cosine similarity between two vectors.
from sklearn.metrics.pairwise import cosine_similarity
# importing nlargest to return the n largest elements from an iterable in descending order.
from heapq import nlargest
def generate_summary(text, n):
# Tokenize the text into individual sentences
sentences = sent_tokenize(text)
# Create the TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)
# Compute the cosine similarity between each sentence and the document
sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]
# Select the top n sentences with the highest scores
summary_sentences = nlargest(n, range(len(sentence_scores)), key=sentence_scores.__getitem__)
summary_tfidf = ' '.join([sentences[i] for i in sorted(summary_sentences)])
return summary_tfidf
text = '''
Weather is the day-to-day or hour-to-hour change in the atmosphere.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather.
Ways to measure weather are wind speed, wind direction, temperature and humidity.
People try to use these measurements to make weather forecasts for the future.
These people are scientists that are called meteorologists.
They use computers to build large mathematical models to follow weather trends.'''
summary = generate_summary(text, 5)
summary_sentences = summary.split('. ')
formatted_summary = '.\n'.join(summary_sentences)
print(formatted_summary)
The following output is what we would be getting as a summary. This summary would contain 5 sentences.
Energy from the Sun affects the weather, too.
Weather changes can affect our mood and life.
We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of the weather.
People try to use these measurements to make weather forecasts for the future.
The above code generates a summary for a given text using a tf idf approach. A function to create a summary that takes a text parameter and an n parameter(number of sentences in summary). The function tokenizes the text into individual sentences, creates a TF-IDF matrix using the TfidfVectorizer class, and computes the cosine similarity between each sentence and the document using the cosine_similarity function.
Next, the function selects the top n sentences with the highest scores using the nlargest function from the heapq library and joins them into a string using the join method.
Okay, before proceeding, let’s quickly understand the cosine similarity. If you already know this, you can skip to the next part.
So, the cosine similarity considers the angle between the vectors of word frequencies for each document rather than just their magnitudes. This means that documents with similar word frequencies and distributions will have a smaller angle between their vectors and, thus, a higher cosine similarity score. Let’s understand this with a simple example.
We have two sentences.
We first need to convert each sentence into a vector representation to calculate the similarity between these two sentences using cosine similarity with TF-IDF. Here’s how we can do that:
We need to perform the following steps.
1. Break the sentence into individual words -> tokenization:
2. Now, Create a vocabulary of unique words from both sentences:
[‘I’, ‘love’, ‘cats’, ‘and’, ‘dogs’, ‘.’, ‘only’]
3. Now convert each sentence into a binary vector of size equal to the vocabulary, where 1 represents the presence of the word in the sentence and 0 represents its absence.
“I love cats and dogs.” -> [1, 1, 1, 1, 1, 1, 0]
Explanation:
‘I’ is present, hence 1
‘love’ is present, hence 1
‘cats’ is present, hence 1
‘and’ is present, hence 1
‘dogs’ is present, hence 1
‘.’ is present, hence 1
‘only’ is absent, hence 0
“I love only cats.” -> [1, 1, 1, 0, 0, 1, 1]
Explanation:
‘I’ is present -> 1
‘love’ is present -> 1
‘cats’ is present -> 1
‘and’ is absent -> 0
‘dogs’ is absent -> 0
‘.’ is present -> 1
‘only’ is present -> 1
Each vector has six elements corresponding to the six unique words in the sentences. The values in each vector represent the frequency of each word in its respective sentence.
Next, we compute the TF-IDF weights for each word in both sentences. Let’s assume all words’ inverse document frequency (IDF) is the same for simplicity. Then, the weights are:
“I love cats and dogs.” -> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
“I love only cats.” -> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Since each word occurs in both sentences, its IDF values are zero, and the TF-IDF weights for each word are also zero.
Finally, we compute the cosine similarity between the two vectors using the formula:
cosine_similarity = (v1 . v2) / (||v1|| * ||v2||)
where v1 and v2 are the vector representations of the sentences, and ‘.’ denotes the dot product of two vectors. ||v1|| and ||v2|| are the Euclidean norms of the two vectors.
Using the vector representations and the formula above, the cosine similarity between the two sentences is:
The dot product of the vectors [1, 1, 1, 1, 1, 1, 0] and [1, 1, 1, 0, 0, 1, 1] is:
1*1 + 1*1 + 1*1 + 1*0 + 1*0 + 1*1 + 0*1 = 4
The magnitude (or Euclidean length) of the first vector [1, 1, 1, 1, 1, 1, 0] is:
sqrt(1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2) = sqrt(6) -> 2.44
Similarly, the magnitude for the second vector [1, 1, 1, 0, 0, 1, 1] is:
sqrt(1^2 + 1^2 + 1^2 + 0^2 + 0^2 + 1^2 + 1^2) = sqrt(5) -> 2.23
Therefore, the cosine similarity between the two sentences is:
cosine_similarity = 4 / (2.44 * 2.23) => 4 / 5.4412 = 0.74 (approx)
This indicates that the two sentences are somewhat similar but not very similar.
Let’s now check how well our approach is working. I got this particular text from this link.
Following is the text.
Weather is the day-to-day or hour-to-hour change in the atmosphere. It includes wind, lightning, storms, hurricanes, tornadoes (twisters), rain, hail, snow, and lots more. Energy from the Sun Also affects the weather. Climate tells us what kinds of weather usually happen in an area at different times of the year. Weather changes can affect our mood and life. We wear different clothes and do different things in different weather conditions. We choose different foods in different seasons.
Weather stations around the world measure different parts of the weather. Wind speed, direction, temperature, and humidity are ways to measure weather. People try to use these measurements to make weather forecasts for the future. These people are scientists who are called meteorologists. They use computers to build large mathematical models to follow weather trends.
How can we check the accuracy of the above text’s summary when we generate one? So, one way is to use human evaluation as the ground truth. In this approach, we can create summaries using each method (frequency-based, TF-IDF), and then ask human evaluators to rate the quality of each summary based on different criteria such as coherence, readability, and relevance to the original text. We can then calculate the average score for each method based on the ratings given by the evaluators. This will give us a quantitative measure of the performance of each method.
Another approach is to use ROUGE (Recall-Oriented Understudy for Gisting Evaluation), a commonly used metric for evaluating text summarization models. ROUGE measures the overlap between the generated and reference summaries (i.e., the ground truth).
Let’s first go with the human evaluation method.
We wear different clothes and do different things in various weather conditions.
Weather stations around the world measure different parts of the weather.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Wind speed, direction, temperature, and humidity are ways to measure weather.
Energy from the Sun affects the weather too.
Weather changes can affect our mood and life.
We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of the weather.
People try to use these measurements to make weather forecasts for the future.
The average rating human evaluators rated the frequency-based approach as ⅘ and the TF-IDF approach as ⅗
So, as per human evaluation, the frequency-based approach works better.
Now, let’s see how the machine evaluates.
Let’s see the evaluation using ROUGE. The following has a human-generated reference summary. We will check how well the artificially generated summary compares to the human-generated summary.
# in case it's not installed onto your system.
! pip install rouge
import rouge
from rouge import Rouge
# a defined function called evaluate_rouge taking two arguments,
# one being reference text and the other summary text,
# and uses the ROUGE metric to evaluate the quality of the summary text compared to the reference text.
# The function uses the rouge library to compute the ROUGE scores and returns the F1 score of the ROUGE-1 metric.
def evaluate_rouge(reference_text, summary_text):
rouge = Rouge()
scores = rouge.get_scores(reference_text, summary_text)
return scores[0]['rouge-1']['f']
# the following is a human generated summary
reference_summary = '''
Weather is a gradual slow change through days and hours in the atmosphere and can vary from wind to snow.
Climate tells a lot about the weather in an area.
The livelihood of people changes according to the change in weather.
Weather stations measure different parts of weather.
People who use measurements to make weather forecasts for the future are called meteorologists, and are scientists.'''
# the sample text from Wikipedia
text = '''
Weather is the day-to-day or hour-to-hour change in the atmosphere.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Energy from the Sun affects the weather too.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions.
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather.
Ways to measure weather are wind speed, wind direction, temperature and humidity.
People try to use these measurements to make weather forecasts for the future.
These people are scientists that are called meteorologists.
They use computers to build large mathematical models to follow weather trends.'''
# Generate summary using frequency-based/TF-IDF approach
summary = generate_summary(text, 5)
# Evaluate the summary using ROUGE
rouge_score = evaluate_rouge(reference_summary, summary)
print(f"ROUGE score: {rouge_score}")
# For frequency based approach we are getting a score of 0.336
# For TF-IDF approach we are getting a score of 0.465
Here, a reference summary and a text are defined. Then, a summary is generated from the text using the frequency-based approach and the tf-idf approach. Next, the generated summary’s ROUGE score is evaluated against the reference summary using the evaluate_rouge() function. The ROUGE score measures the similarity between the generated and reference summaries. The higher the ROUGE score, the more similar the two summaries are.
Now, here, for the frequency-based approach, we get a score of 0.336; using the TF-IDF approach, we get a score of 0.465. So, in this evaluation method, the TF-IDF approach works better.
Aspect | Extractive Summarization | Abstractive Summarization |
Language | Uses the same text as in the original | Less complex than the abstractive method |
Complexity | Less complex than abstractive method | Uses different text from the original but captures the essence |
Accuracy | Tends to be more accurate, selecting direct sentences from the text | Summarizes information from the original text |
Domain-specific | Suitable for domain-specific cases with less language variation | More suitable for general texts |
The future of this particular field finds its way to the higher steps of the technology ladder as the R&D teams explore new techniques and ways every day. The use of machine learning and NLP will gradually improve the quality and accuracy of the summaries that will be generated.
This field also includes the usage of deep learning models, such as recurrent neural networks and transformers, which leads to a better understanding of the text’s content. Additionally, further advancements in language generation techniques will lead to the development of more sophisticated abstractive summarization methods.
The advanced solutions would help us save time, increase productivity, and make information more accessible and easily digestible.
Text summarization is a fast-growing field in natural language processing, and it has the potential to revolutionize the way we consume and process information. In this article, we covered
A. Extractive text summarization involves selecting key sentences or phrases directly from the source text to form a concise summary. It identifies important parts based on statistical or linguistic features without generating new sentences or altering the original content.
A. Determining the “best” extractive summarizer depends on the application. However, popular tools include BERTSUM and Sumy. BERT-based models like BERTSUM achieve high accuracy by leveraging contextual embeddings.
A. Extractive summarization selects key sentences directly from the source, while abstractive summarization generates new sentences that paraphrase and condense the original content, providing more coherent and human-like summaries.
A. Text summarization in NLP aims to create shorter versions of texts while retaining essential information. It includes two main types: extractive summarization (selecting key text segments) and abstractive summarization (generating new condensed text).