Revolutionizing Text Summarization: Exploring GPT-2 and XLNet Transformers

Amrutha Last Updated : 11 Aug, 2023
14 min read

Introduction

We don’t have enough time to read everything and understand. That’s where text summarization comes into play. It helps us to understand the entire text by making it shorter. It’s like getting the essential information without reading all the details. Text summarization is really helpful in many situations. Imagine if you are a student and have an exam tomorrow but haven’t started reading yet. You have to study 3 chapters for the exam and have only today to study. Dont worry. Use text summarizer. It will help you in passing tomorrow’s exam. Exciting, right? This article will explore text summarization using GPT-2 and XLNet transformer models.

Learning Objectives

In this article, we will learn

  • About Text Summarization and its types
  • How the transformer model emerged and how its architecture works
  • About transformer summarizers such as GPT-2 and XLNet
  • Finally, implementation using their different variants

This article was published as a part of the Data Science Blogathon.

What is Text Summarization?

Have you ever faced a situation where you needed to go through some pages of the book but couldn’t because of your laziness? Even though the book was interesting, sometimes we just cannot flip pages. Thanks to Text Summarization. Using it, we can understand the summary of the entire text without actually reading through all the lines and all the pages of the book.

Text Summarization is converting a long text into a short one while keeping the important information. It is just like creating the summary of the text. Text summarization is a fascinating field in natural language processing (NLP). It preserves the main ideas and essential information of the original text. In simple words, the goal of text summarization is to capture the critical points of the original text and enable the readers to quickly grasp the content of the text without actually reading it.

 Source: Microsoft
Source: Microsoft

Types of Summarization

There are two main types of text summarization approaches. They are:

  • Extractive
  • Abstractive

Let’s understand them in detail.

Extractive Summarization

It involves selecting and combining important sentences from the original text to form the summary. This type of summarization aims to extract the most relevant and informative sentences. These sentences should represent the main idea and context of the original text. The selected sentences directly form the summary without any modifications. Some standard techniques used in extractive summarization include:

  • Sentence Scoring: This is a score-based approach. The system selects sentences for the summary based on word frequency, sentence position, and the importance of keywords. It will choose sentences that score high for inclusion in the summary. In this way, all the high-scored sentences form the summary of the entire original text.
  • Graph-Based: In graph-based methods, we use graphs to represent sentence relationships. Here all the sentences are nodes, and edges represent the similarity or relatedness between sentences. Using some graph algorithms, identify essential sentences, and all the important sentences will form the summary.
 Source: SpringerLink
Source: SpringerLink
  • Statistical Methods: These techniques use statistical tools and algorithms to evaluate the importance and relevance of individual sentences within the text. These methods aim to identify the most relevant and informative sentences by assigning scores and weights or utilizing optimization techniques. All the important sentences, in turn, form the summary of the text.

Abstractive Summarization

Abstractive Summarization involves generating a summary that may contain rephrased sentences or even new sentences which are not part of the original text. It understands the content of the text and generates a summary to capture the main ideas. Unlike extractive summarization, where it directly adds original texts without any modifications to the summary, abstractive summarization is like a human generating the summary in their own words. Abstractive summarization techniques rely on advanced natural language generation models like neural networks or transformers. These advanced models can interpret and generate human-like language. Abstractive summarization has the advantage of producing more human-like summaries and can handle complex texts better. Some standard techniques used in abstractive summarization include:

Sequence-to-Sequence Models: Sequence-to-Sequence Models use neural architecture, take source text as input, and provide a summary as the output. During training, Seq2Seq models are trained on pairs of the source text and corresponding summaries. So the model learns to map the input text and the output summary while optimizing the loss function. The loss function denotes the difference between the generated summary and the original target summary provided. By optimizing the loss function, the model will learn better and generates the best summaries.

Attention Mechanisms: Attention mechanisms are critical in many natural language processing (NLP) tasks, including text summarization. Here models focus on only relevant parts of the input sequence and generate the summary as output. This will help the model focus on important information and produce more accurate and contextually appropriate summaries.

Reinforcement Learning: Reinforcement learning techniques fine-tune the summarization model by providing rewards or penalties. It is based on the quality of the generated summaries. So the model improves itself and provides better summaries.

Transformer Models

Initially, we have Recurrent Neural Networks(RNNs), which are feed-forward neural networks dealing with sequence data. But these RNNs have some limitations, including it is time-consuming to train the model and sometimes long sequences leading to the vanishing of the gradients. After that, Long Short-Term Memory (LSTM) came into the picture. LSTMs can process longer sequences compared to RNNs. If RNNs are slow to train, LSTMs are even slower than RNNs due to their complex structure. For both LSTMs and RNNs, input data has to be passed sequentially. Today’s GPUs are designed for parallel computation and unsuitable for sequential flow. Then the transformers came into the picture.

Attention is All You Need

The paper ” Attention is all you need” introduced a novel architecture called Transformers. The Transformers network uses the encoder-decoder architecture, which is similar to the RNN architecture, but it allows passing input sequences in parallel. First, the Transformers network passes an input sequence and converts it into input embedding to represent the meaning of the sequence. Then, it adds positional encoding to capture each word’s context in the sentence.

The attention block computes the attention vectors for each word. The problem here is that the attention vector may weigh its relation with itself much higher. But we need the interaction of particular words with other words. The system calculates eight attention vectors per word and obtains the final attention vector for each word by calculating a weighted average. This process involves using multiple attention vectors, which gives rise to the term “multi-head attention block.” The system then passes these attention vectors through the feed-forward network. The final output will be some encoded vectors representing each word.

 Source: Machine Learning Mastery
Source: Machine Learning Mastery

Decoder Network

Now let’s see the decoder network. First, we will get the embeddings of the output. Then we do positional encoding by adding positional value to retain context. Next, the system passes it through the first multi-head attention block, known as the masked multi-head attention block. During the processing of each word, all subsequent words are masked, and attention vectors are formed using only the preceding words. Following this, there is a second multi-head attention block. Subsequently, these blocks are passed through the feed-forward layer. Finally, the system feeds the output through the linear and softmax layers to make predictions for the next word.

GPT-2 for Text Summarization

GPT-2 is a transformers model which was pre-trained on a large corpus of English data in a self-supervised manner. It stands for (Generative Pretrained Transformer). This model is pre-trained on raw texts without any labels. GPT-2 is a powerful language model that can be used for text summarization tasks.  It is well known for generating contextually relevant summaries of input texts.  It was actually first designed as a generative model to guess the following words in the sentences.

During training, it takes input sequences with a certain length. And the target sequences are similar to the input sequences but shifted by one word or token. So the model learns to predict the next word based on the previous words. GPT-2 makes use of mask attention so that it will make sure that only previous tokens are used for prediction. All the future tokens will be masked. Through its training process, GPT-2 learns about how words and sentences fit together in the English language. It develops an understanding of the patterns and structures. This knowledge is stored within the model and can be used to generate new text which sounds like a human wrote it.

Variants of GPT-2

Different variants of GPT-2 are based on their model size and parameters. All the pretrained GPT-2 models are available from the Hugging Face Model Hub and we can finetune them based on the requirements. Now we will see variants of GPT2 in detail.

1. GPT2-Small: Compact and Quick

This is the smallest version of GPT-2. It has fewer parameters than other versions and is faster to use. This version is suitable for tasks that need basic language understanding and generation. It may not be good with complex language patterns, but still, it generates sentences that make sense. GPT2-Small is great for situations where we don’t have a lot of time or powerful computers. It has 124M parameters. The model is named gpt2 on the official website of Hugging Face.

GPT2-Small can be fine-tuned on specific tasks or domains to enhance performance in targeted applications. Fine-tuning involves training the model on a smaller, task-specific dataset to adapt to a particular context. This process allows GPT2-Small to improve its performance in specific domains.

Due to its smaller size, it may struggle with generating highly detailed or contextually rich text. Sometimes it may generate unrelated texts, especially when faced with ambiguous prompts. But most of the time, it will generate meaningful and related texts.

2. GPT2-Medium: Finding a Balance

GPT2-Medium is a version of GPT-2 which is in the middle regarding size and performance. It gives a balance between the size of the model and its performance. The GPT2-Small and GPT2-Large variants regarding capacity and capabilities have 355M parameters in total. So it captures more complex patterns compared to GPT2-Small. These parameters are the internal learned representations that enable the model to understand and generate text.

Compared to GPT2-Small, GPT2-Medium can generate higher-quality text with improved coherence and fluency. It is especially required while generating high-quality text outputs. It offers enhanced capabilities, but it also requires more computational resources compared to GPT2-Small.

3. GPT2-Large: Advanced Language Skills

This GLPT2-Large model takes the capabilities of text generation and understanding to new heights. It takes more parameters than GPT2-Medium and enhances language modeling capabilities. So it enables the generation of contextually rich texts. It has 774M parameters in total. These parameters enable the model to capture many complex linguistic patterns, including long-range dependencies.

It can generate longer and more elaborate responses resembling human-generated content. Due to its larger size, GPT-2 Large requires more computational resources to train and utilize effectively. Some of its applications include advanced chatbots, creative content generation, and virtual assistants.

4. GPT2-XL: Supercharged Performance

GPT-2 XL is a GPT-2 language model variant incorporating the XLNet architecture. This is the most advanced variant of GPT-2 in language modeling. It takes the highest number of parameters compared to all the variants of GPT-2. It takes 1.5B parameters in total. The XLNet architecture in GPT-2 XL enables a deeper understanding of context. It has successfully captured complex relationships between the words and even in long sequences. Also, improved performance in all the tasks.

You will see a wide range of applications where advanced language modeling is essential. By leveraging the power of GPT-2 XL, researchers and developers can unlock new possibilities in natural language processing.

Implementation Using GPT-2

Let’s use different GPT-2 variants for text summarization.

Initially, I took a small story and provided it. The goal is to generate a summary of the entire story.

text='''Once upon a time, in a small town called Willow Creek, a crime took place that left 
everyone startled. The local convenience store, a beloved gathering spot for the community,
was robbed. The news quickly spread, causing fear and concern among the townsfolk. 
Officer Sarah Johnson was assigned to investigate the crime. She carefully examined the store 
for any clues, hoping to uncover the truth. Footprints, fingerprints, and a broken lock were found, 
providing valuable evidence. The store's security cameras revealed a masked figure sneaking 
in late at night. Days turned into weeks, but there was no breakthrough in the case. 
The town's residents grew anxious, wondering who could commit such a crime. 
Officer Johnson was determined to crack the case, working tirelessly day and night. One evening, 
while patrolling the neighborhood, Johnson spotted a suspicious person acting strangely 
near the store. She discreetly followed the person, who turned out to be a young man named Alex. 
Alex confessed to the crime, explaining that he was facing financial difficulties and made a 
foolish choice out of desperation. Officer Johnson showed empathy towards Alex's situation, 
understanding the pressures he faced. She made sure he received the help he needed instead of 
harsh punishment. News of the arrest spread throughout the town, and the incident served as a 
reminder for the community to support each other during tough times. Officer Johnson's dedication 
and compassion were applauded, making her a respected figure in Willow Creek.'''

For text summarization, we use bert-extractive-summarizer. Then we import TransformerSummarizer from the summarizer module.

pip install bert-extractive-summarizer
from summarizer import TransformerSummarizer

First, we will use the GPT2-Small variant. So create an instance of the TransformerSummarizer class and assign it to the variable GPT2_model. It takes two parameters. The first one is the trasnformer_type, specifying which transformer model we are using. Here we are using GPT2. The following parameter is transformer_model_key which specifies the variant. Here we are using gpt2. And then, we will print the generated summary with a minimum length of 50.

GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2")
Summary = ''.join(GPT2_model(text, min_length=50))
print(Summary) 

Once upon a time, in a small town called Willow Creek, a crime took place that left everyone startled. Officer Sarah Johnson was assigned to investigate the crime. The town’s residents grew anxious, wondering who could commit such a crime. Officer Johnson’s dedication and compassion were applauded, making her a respected figure in Willow Creek.

Now we will use the GPT2-Medium variant.

GPT2_medium_model = TransformerSummarizer(transformer_type="GPT2",
                                          transformer_model_key="gpt2-medium")
Summary = ''.join(GPT2_medium_model(text, min_length=50))
print(Summary)

Once upon a time, in a small town called Willow Creek, a crime took place that left everyone startled. The town’s residents grew anxious, wondering who could commit such a crime. Officer Johnson was determined to crack the case, working tirelessly day and night. One evening, while patrolling the neighborhood, Johnson spotted a suspicious person acting strangely near the store.

Now we will use the GPT2-Large variant.

GPT2_large_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-large")
Summary = ''.join(GPT2_large_model(text, min_length=50))
print(Summary)

Once upon a time, in a small town called Willow Creek, a crime took place that left everyone startled. She carefully examined the store for any clues, hoping to uncover the truth. Officer Johnson was determined to crack the case, working tirelessly day and night. News of the arrest spread throughout the town, and the incident served as a reminder for the community to support each other during tough times.

XLNet for Text Summarization

XLNet is a state-of-the-art transformer-based language model, and it stands for  “eXtreme Language understanding NETwork”. It was actually designed to overcome some limitations of the previous models like BERT (Bidirectional Encoder Representations from Transformers). BERT models usually process the input sequence in a bidirectional manner. But XLNET is different. Here it allows the model to take all the possible permutations of the input sequence. This is called permutation-based training. This helps XLNet capture more contextual information and improve performance on various natural language processing tasks.

XLNet also introduces the concept of “autoregressive” training. Autoregressive models predict the next token in a sequence based on the previous tokens. This allows them to generate more coherent and contextually relevant texts. XLNet combines autoregressive and permutation-based training, improving performance on various tasks.

Variants of XLNet

XLNet has various variants available and differs in model size, and the type of tokenization used (cased or uncased). We can choose the appropriate variant based on their task requirements and computational resources. All the pre-trained XLNet models are available from the Hugging Face Model Hub, and we can finetune them based on the requirements. Now we will see some of the popular variants of XLNet.

1. XLNet Base Cased: Preserves Case Sensitivity

XLNet Base Cased is a specific variant of the XLNet model that is trained on cased text. It means it retains the original capitalization of words in the training data. It has 110M parameters in total. The “Base” in XLNet Base Cased refers to its base architecture, which includes several layers, attention mechanisms, and other components.

The “Cased” aspect of XLNet Base Cased indicates that the model preserves the case information of words during training and inference. This means that uppercase and lowercase letters are treated differently. Also, the model can distinguish between them. This is very beneficial where capitalization carries semantic meaning. With its balanced architecture and case-preserved training, this model actually has a good balance between complexity, computational resources, and overall performance.

2. XLNet Large Cased: Large-scale Model with Case Sensitivity

XLNet Large Cased is another variant of XLNet with increased capacity and capabilities, making it suitable for more complex tasks. It has more parameters compared to XLNet-Base Cased. It has 340M parameters in total. With its larger model size and enhanced architecture, it can effectively capture intricate language patterns. The architecture of XLNet Large builds upon the foundation of XLNet Base Cased. Add some additional layers and components to enhance its capabilities.

XLNet Large Cased also requires high computational resources and increased capacity and performance. Training and fine-tuning XLNet Large Cased may require exceptionally high computational resources, including powerful GPUs and hardware setups. XLNet Large Cased excels in various challenging NLP tasks, including machine translation, text classification, sentiment analysis, question answering, and document summarization.

3. XLNet Base Multilingual Cased: Multilingual Support for Cross-Lingual Tasks

XLNet Base Multilingual Cased is a variant of the XLNet model, specifically designed to support multilingual applications and cross-lingual tasks. It has enhanced capabilities and capacity to handle multiple languages. The model has undergone training on a large corpus of multilingual text. This helps the model to learn more representations and capture linguistic patterns. With multilingual training, the model can transfer knowledge and will be able to generalize well across different languages. This capability also aids in dealing with languages that were not utilized during the training process.

XLNet Base Multilingual Cased retains the case information of words during training and inference, making it case-sensitive. This means that it can distinguish between uppercase and lowercase letters. This has many applications and is particularly useful for cross-lingual tasks. These include machine translation, cross-lingual document classification, etc.

4. XLNet Base Cased IMDb

XLNet Base Cased IMDb is another XLNet variant specifically trained and finetuned on the IMDb dataset for sentiment analysis. This IMDB dataset is very popular in Natural Language Processing(NLP), and it contains movie reviews labeled with positive or negative sentiments. They fine-tuned this model for 5 epochs using a batch size of 32 and a learning rate of 2e-05. The maximum sequence length was set to 512. Since it was a classification task, they used the cross-entropy loss function to train the model. Although this model is primarily intended for classification tasks, it can still generate results for summarization tasks, but the outcomes may not be optimal.

Implementation Using XLNet

We use the exact text that we used before for GPT-2. All the imported modules remain the same.

First, we will use the xlnet-base-cased model

xlnet_base_cased_model = TransformerSummarizer(transformer_type="XLNet",
                                               transformer_model_key="xlnet-base-cased")
Summary = ''.join(xlnet_base_cased_model(text, min_length=50))
print(Summary)

Once upon a time, in a small town called Willow Creek, a crime took place that left everyone startled. Officer Sarah Johnson was assigned to investigate the crime. She carefully examined the store for any clues, hoping to uncover the truth. The store’s security cameras revealed a masked figure sneaking in late at night.

Let’s see the results using xlnet-large-cased model.

xlnet_large_cased_model = TransformerSummarizer(transformer_type="XLNet",
                                                transformer_model_key="xlnet-large-cased")
Summary = ''.join(xlnet_large_cased_model(text, min_length=50))
print(Summary)

Once upon a time, in a small town called Willow Creek, a crime took place that left everyone startled. The local convenience store, a beloved gathering spot for the community, was robbed. She made sure he received the help he needed instead of harsh punishment. News of the arrest spread throughout the town, and the incident served as a reminder for the community to support each other during tough times.

Finally, we will use the xlnet-base-cased-imdb model.

xlnet_base_cased_imdb_model = TransformerSummarizer(transformer_type="XLNet",
                              transformer_model_key="textattack/xlnet-base-cased-imdb")
Summary = ''.join(xlnet_base_cased_imdb_model(text, min_length=50))
print(Summary)

Once upon a time, in a small town called Willow Creek, a crime took place that left everyone startled. She carefully examined the store for any clues, hoping to uncover the truth. The store’s security cameras revealed a masked figure sneaking in late at night. News of the arrest spread throughout the town, and the incident served as a reminder for the community to support each other during tough times.

Conclusion

Text summarization simplifies the process of extracting key information from large texts. It enables us to grasp the main points quickly. With the emergence of new transformer models like GPT-2 and XLNet, text summarization has reached new heights. This article taught us about text summarization and the transformer model. The emergence of transformer models and the architecture of it and understanding how exactly it works. Then we explored different variants of GPT-2 and XLNet.

Key Takeaways

  • Text summarization is an extraordinary technique for extracting important information from significant texts.
  • Using text summaries, we can save time understanding the text without reading it and provide more precise and coherent summaries.
  • It has many applications, including Social Media Analysis, Document summarization, News Aggregation, and many more.
  • GPT-2 and XLNet are powerful transformer models, and these models have made significant contributions to the field of text summarization.

Frequently Asked Questions

Q1. What is Transformer Summarizer?

A. Transformer summarizer is a text summarization technique using transformer models. It takes longer texts as the input and provides the summary of the entire text simply as the output. This helps the readers understand the entire text without actually reading.

Q2. What are the different approaches for text summarization?

A. Extractive and Abstractive Summarization are the two main text summarization approaches. Extractive summarization involves selecting and combining important sentences to summarize the entire text. Whereas abstractive summarization involves understanding the entire text and generating summaries where the sentences may or may not be present in the original text.

Q3. Can Transformer Summarizer handle multiple languages?

A. Yes, transformer summarizer can handle multiple languages. By using multilingual transformer models and training data in multiple languages to train, it can summarize texts for various language inputs.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

This is Amrutha, I am pursuing B.Tech in the Computer science Department. I am interested in developing ML Models with python and Data Analysis. And also I have an interest in Web Development. I hope my articles in Analytics Vidhya help you to learn better. Thank you!!

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details