Consider a scenario wherein instead of reading an entire article or research paper you could just read the most important statements, this is possible through text summarization, a widely used technique in NLP. Text summarization takes an input of a sequence of words aka the input article and returns an output of words aka the summary, making it an essential application of sequence-to-sequence models in NLP machine learning. Text summarization can be a useful case study in domains like financial research, question-answer bots, media monitoring, social media marketing, and so on. In this article, we would cover text summarization in detail, including its techniques and applications in NLP and text analytics.
This article was published as a part of the Data Science Blogathon.
In school, most of us had to understand and convert long text articles into their succinct summaries, the technique we used then was to grasp the underlying idea of the text and reproduce the summary that would cover all the important points. This is similar to the idea of abstractive text summarization, wherein the machine learning model would output the main idea of the input text using similar words but not exact sentences from the input.
The second type of summarization is extractive summarization in which the model output can be considered a subset of the input text which conveys the main idea of the input article, making it an important application of text summarization in NLP. A personal analogy that I would like to share is, you can consider extractive summarization as highlighting important points of a reference paper that you are trying to understand. Extractive summarization is a commonly used approach in text summarization NLP techniques, which aims to extract and present the most relevant information from the original text while preserving its meaning.
As you may have guessed extractive summarization is simpler to model than abstractive summarization, this is because in abstractive summarization the model is expected to understand language and its nuances to make any meaning out of it and produce a valid summary. Whereas in extractive summarization using some form of scoring (which we would discuss in detail later in this article), the model has to threshold and output the most important sentences of the input itself.
Naturally, there is more research available for extractive summarization than abstractive summarization. In this article, we would look into extractive summarization in further detail.
What do we mean by pre-trained models:- These models have already been trained on large datasets. If a model is trained on huge amounts of data it will naturally predict better, however, the inability to collect large amounts of data and subsequently higher training time are some of the reasons why instead of training a model from scratch we could benefit by using a pre-trained model.
We would be using the BBC News Summary dataset for this article and bert-extractive-summarizer as the pre-trained model.
Below code, snippet includes loading the necessary libraries
!pip install bert-extractive-summarizer
!pip install spacy
!pip install transformers # > 4.0.0
!pip install neuralcoref
!python -m spacy download en_core_web_md
After importing the above libraries and downloading the spacy model we would now call the summarizer and pass a sample text to view its output.
#from summarizer import Summarizer
model = Summarizer()
text = "Learning NLP involves understanding basic principles of machine learning which then need to be customized for words. With the advent of using transfer learning for NLP I think it hads made a huge progress in terms of its research"
As you can see in the below output the model does provide an appropriate summary given our input text.
Now let us use the same model on our BBC news dataset, the below snippet takes care of the same. As we have a total of 2225 input articles with an average length of 3000 words, to save execution time I have predicted the summary items only for the first 10 input articles.
from tqdm import tqdm
bert_predicted_summary = []
k = 0
for i in tqdm(df['text']):
if k < 10:
x = model(str(i))
bert_predicted_summary.append(x)
k+=1
Below is the attached output, the first one is what the pre-trained model predicted and the second one is the actual summary provided in the dataset.
Using simple preprocessing techniques like removing newline characters(n) or end of sentence characters(b) is always recommended. As the popular saying goes garbage in is garbage out, so we need to clean our input before passing it to our model. I have used simple regular expressions for preprocessing the input, the code snippet for the same is as below.
path = '/kaggle/input/bbc-news-summary/bbc news summary/BBC News Summary/News Articles/'
for i in os.listdir(path):
for j in os.listdir(os.path.join(path+i)):
with open(os.path.join(path+i+'/'+j),'rb') as f:
article = f.readlines()
article = re.sub('b'','',str(article))
article = re.sub('[\nnt-\/]','',article)
article = re.sub('n'','',article)
article = re.sub('xc2xa','',article)
article = article.lower()
text.append(article)
type_.append(i)
For evaluating the output the metric we use is the BLEU score, in the next section of the article, we would go through the same in detail.
BLEU score stands for Bilingual Evaluation Understudy, it is a metric widely used for machine translation, text generation, and for models having a word sequence as output. Let us understand how it is calculated.
The range of BLEU scores is between 0 and 1, where 0 signifies no match between the expected output and the predicted output and 1 means a perfect match. BLEU can be considered as a modification to precision to handle sequence outputs.
Considering an example, suppose our predicted summary (or candidate) is “awesome awesome awesome” and our actual or expected summary(also known as reference) is “NLP is awesome”, as all of the words in our predicted output are present in the reference it has a precision of 1, however, we can all agree, it is a pretty bad summary.
To overcome this BLEU performs a simple modification, it clips the number of times a word is seen in the candidate or predicted output to the maximum times it appears in the reference or expected output. So in the case of our example, the score now becomes 1/3 as awesome is present only once in the reference.
Taking another example, let’s say our reference is “I want to learn NLP”, and our candidate is “NLP is what I want to learn” if we consider only unigrams BLEU score would be perfect, i.e. 1. But so would be the BLEU score for “NLP learn I want to”, which is not correct grammatically.
This is why BLEU also considers n-grams(bigrams, trigrams 4-grams). If we account for bigrams in the same example, then bigrams that are possible from our candidate are “NLP is”, “is what”, “what I”, “I want”, “want to”, “to learn”. and the bigram precision score now becomes 3/4. This explains that BLEU rewards exact matching sequences of words between candidate and reference.
BLEU also penalizes sentences shorter than the reference sentence, to understand why it does so we extend the original example, now consider our candidate to be “NLP is”, if we consider bigrams, this candidate would receive a BLEU score of 1, BLEU would penalize the score by multiplying it with the penalty which is calculated as, divide the length of the reference sentence by the length of our output, subtract one from that, and raise it to the power of e. In our case, the penalty would be 0.36 making our BLEU score 0.36 from 1.
We can all now agree why BLEU is a widely used metric but it does have some flaws like it does not consider meaning. You can further read about problems with BLEU to gain a better understanding of the metric here
We now look at the below BLEU scores for our generated summaries through the pre-trained BERT model
Code snippet for the calculation
def calculate_bleu_score(bert_predicted_summary,df):
for i in range(len(bert_predicted_summary)):
candidate = list(bert_predicted_summary[i].split("."))
reference = list(str(df['summary'][i]).split("."))
print(corpus_bleu(reference[:len(candidate)],candidate))
calculate_bleu_score(bert_predicted_summary,df)
We can see that with basic preprocessing and without fine-tuning the pre-trained model, for the first 10 predicted summaries we receive a good score for each of the summaries with an average of 0.6 BLEU score.
Now let us dig further deep and create our own text summarizer using python.
def count_freq():
res = {}
for i in df['cleaned_text']:
for k in word_tokenize(i):
if k in res:
res[k] += 1
else:
res[k] = 1
return res
word_freq = count_freq()
def sentence_rank(text):
weights = []
sentences = sent_tokenize(text)
for sentence in sentences:
temp = 0
words = word_tokenize(sentence)
for word in words:
temp += word_freq[word]
weights.append(temp)
return weights
n = 14
for i in range(10):
ranked_sentences = sentence_rank(df['cleaned_text'][i])
sentences = sent_tokenize(df['cleaned_text'][i])
sort_list = np.argsort(ranked_sentences)[::-1][:n]
result = ''
for i in range(n):
result += '{} '.format(sentences[sort_list[i]])
candidate = result
reference = df['summary'][i]
print(corpus_bleu(reference[:len(candidate)],candidate[:len(reference)]))
For improving the text summarizer, we could use
In summary, this guide introduced you to making short text summaries using NLP. We covered different types, using ready-made tools, and how to measure success with BLEU score. Plus, you’ve got a taste of creating your own summarizer in Python. Now, you’re all set to summarize text like a pro!
BERT is a smart tool used to summarize text. It learns from lots of examples and then fine-tunes itself to create short and clear summaries. This helps in making quick and efficient summaries of long pieces of writing.
The goal of text summarization is to make things shorter while keeping the important stuff. It’s like making a quick version that highlights the main ideas, making it easier and faster for people to understand.