Comparing LLMs for Text Summarization and Question Answering

Aadya Singh Last Updated : 20 Nov, 2024
12 min read

Large Language Models like BERT, T5, BART, and DistilBERT are powerful tools in natural language processing where each is designed with unique strengths for specific tasks. Whether it’s summarization, question answering, or other NLP applications. These models vary in their architecture, performance, and efficiency. In our code we will compare these models across two tasks: text summarization and question answering, BART and T5 for text summarization and DistilBERT and BERT for question answering. By comparing their performance on real-world datasets we aim to determine which model excels in each task helping optimize results and resources for practical applications.

Learning Objectives

  • Understand the core differences between BERT, DistilBERT, BART, and T5 for NLP tasks like text summarization and question answering.
  • Understand the fundamentals of Text Summarization and Question Answering, and apply advanced NLP models to enhance performance.
  • Learn how to select and optimize models based on task-specific requirements like computational efficiency and result quality.
  • Explore practical implementations of text summarization using BART and T5, and question answering with BERT and DistilBERT.
  • Acquire hands-on experience with NLP pipelines and datasets like CNN/DailyMail and SQUAD to derive actionable insights.

This article was published as a part of the Data Science Blogathon.

Understanding Text Summarization

 Summarization is the process where we take a passage of text and reduce its length while keeping its meaning intact. The LLM models which we will be using for comparison are:

Bidirectional and Auto- Regressive Transformers

BART is a combination of two model types. It first processes text in a bidirectional way to understand the context of words it then generates a summary in a left to right manner. Thereby it combines the bidirectional nature of BERT with the autoregressive text generation approach seen in GPT. BART also uses an encoder-decoder structure like T5 but is specifically designed for text generation tasks. For summarization first BART’s encoder reads the entire passage and captures the relationships between words in a bidirectional manner. This deep contextual understanding allows it to focus on the key parts of the input text.
The decoder then generates an abstractive summary from this input, producing new, shortened phrases rather than merely extracting sentences.

T5: The Text-to-Text Transfer Game-Changer

T5 is based on the Transformer architecture. It generates summaries that are abstractive rather than extractive. Instead of copying phrases directly from the text, it often rephrases content to create a concise version.

Verdict: T5 tends to be faster and more computationally efficient than BART but BART might perform better in terms of natural language fluency in certain cases.

Exploring Question Answering Tasks

Question answering is when we ask a model a question, and it finds the answer in a given context or passage of text. Here’s how the two models for question answering work and how they compare:

Bidirectional Encoder Representations from Transformers

BERT is a large, powerful model that looks at words in both directions to understand their meaning based on the context. When you provide BERT with a question and a passage of text it first looks for the most relevant part of the text that answers the question. BERT is one of the most accurate models for question answering tasks,. It performs very well because of its ability to understand the relationship between words in a passage and their context.

DistilBERT

DistilBERT is a smaller, lighter version of BERT. BERT was trained to understand language in both directions (left and right), making it very powerful for tasks like question answering. DistilBERT does the same thing but with fewer parameters, which makes it faster but with slightly less accuracy compared to BERT.It can answer questions based on a given passage of text, and it’s particularly useful for tasks that need less computational power or a quicker response time.

Verdict: BERT is more accurate and can handle more complex questions and texts, but it requires more computational power and takes longer to give results. DistilBERT, being a smaller model, is quicker but might not always perform as well on more complicated texts.

Code Implementation and Setup

Below we will go through the code implementation along with data set overview and setup:

Link to notebook (for editor use )

Dataset Overview

  • Dataset for summarization task: CNN/Daily Mail dataset
  • The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail.
  • Supported tasks : ‘summarization’: Versions 2.0.0 and 3.0.0 of the CNN / DailyMail Dataset can be used to train a model for abstractive and extractive summarization.

Data fields:

  • id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
  • article: a string containing the body of the news article
  • highlights: a string containing the highlight of the article as written by the article author
  • Data instances: For each instance, there is a string for the article, a string for the highlights, and a string for the id.
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
 'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
 'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
dataset
  • Dataset for question answering task: SQuAD (Stanford Question Answering Dataset)
  • Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.
  • Supported Tasks: ‘Question Answering’.

Data Items

  • id: a unique identifier for each sample in the dataset
  • title: The title of the article or document from which the question is derived.
  • context: The text passage (context) from which the answer to the question can be derived.
  • question: The question related to the provided context.
  • answers: a dictionary feature containing:
    • text: answer to the question extracted from the context
    • answer_start: indicates the starting position (index) of the answer in the context string
{
    "answers": {
        "answer_start": [1],
        "text": ["This is a test text"]
    },
    "context": "This is a test context.",
    "id": "1",
    "question": "Is this a test?",
    "title": "train test"
}
data items:  Text Summarization and Question Answering
from transformers import pipeline
from datasets import load_dataset
import time
  • Pipeline is a tool from Hugging Face’s transformers library that provides NLP model pipelines for multiple tasks.
  • load_dataset enables easy loading of a variety of datasets directly from Hugging Face’s dataset hub.
  • time is used here to calculate how long each model takes to respond.

Loading Our Dataset

# Load our datasets
# CNN/Daily Mail for summarization
summarization_dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]")  # Use 1% of the training data

# SQuAD for question answering
qa_dataset = load_dataset("squad", split="validation[:1%]")  # Use 1% of the validation data
  • Next, load_dataset(“cnn_dailymail”, “3.0.0”, split=”train[:1%]”) loads the CNN/Daily Mail dataset, a large dataset of news articles commonly used for summarization tasks. “3.0.0” specifies the dataset version. split=”train[:1%]” means we are only using 1% of the training set to reduce the dataset size for quicker testing. The summarization_dataset will contain smaller subset of original dataset
  • load_dataset(“squad”, split=”validation[:1%]”) , This loads the SQuAD (Stanford Question Answering Dataset) which s a popular dataset used for question answering tasks. split=”validation[:1%]” specifies using only 1% of the validation data. The qa_dataset contains questions paired with context passages, where the answer to each question can be found within its corresponding passage.

Task1: Text Summarization

# Task 1: Text Summarization
def summarize_with_bart(text):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    return summarizer(text, max_length=50, min_length=25, do_sample=False)[0]["summary_text"]

def summarize_with_t5(text):
    summarizer = pipeline("summarization", model="t5-small")
    return summarizer(text, max_length=50, min_length=25, do_sample=False)[0]["summary_text"]
  • In the function summarize_with_bart(text) .pipeline(“summarization”, model=”facebook/bart-large-cnn”) creates summarization pipeline using BART (Bidirectional and Auto-Regressive Transformers) with the facebook/bart-large-cnn model, a version of BART fine-tuned specifically for summarization tasks. summarizer(text, max_length=50, min_length=25, do_sample=False)[0][“summary_text”] This calls the summarizer on the input text, do_sample=False ensures deterministic output.[0][“summary_text”] extracts the generated summary text from the output.
  • For the function summarize_with_t5(text) ,pipeline(“summarization”, model=”t5-small”) Creates a summarization pipeline using the T5 (Text-To-Text Transfer Transformer) model, with the t5-small variant. summarizer(text, max_length=50, min_length=25, do_sample=False)[0][“summary_text”] Similar to BART, this line calls the summarization model on the input text

Task2: Question Answering

# Task 2: Question Answering
def answer_with_distilbert(question, context):
    qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
    return qa_pipeline(question=question, context=context)["answer"]

def answer_with_bert(question, context):
    qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")
    return qa_pipeline(question=question, context=context)["answer"]
    
  • In the function answer_with_distibert, pipeline(“question-answering”, model=”distilbert-base-uncased-distilled-squad”) This initializes a question-answering pipeline using the BERT model pipeline(“question-answering”) function simplifies the process of asking questions about a given context. qa_pipeline(question=question, context=context)[“answer”] here the pipeline processes the question and context to find the answer to the question within the context. [“answer”] extracts the text of the answer from the pipeline’s output, which is a dictionary containing the answer, score, and other relevant information.
  • answer_with_bert function , pipeline(“question-answering”, model=”bert-large-uncased-whole-word-masking-finetuned-squad”) This initializes a question-answering pipeline using the BERT model. The model bert-large-uncased-whole-word-masking-finetuned-squad is a large BERT model designed to answer questions based on context. ‘uncased’ means the model ignores case (lowercase all input text), and ‘whole-word-masking’ refers to how the model processes words by considering the entire word during training and prediction.qa_pipeline(question=question, context=context)[“answer”] passes the question and context to the pipeline which then processes the text and returns an answer. As with the distilbert version, it extracts the answer from the output.

Summarization Performance Analysis

Let us now write the code to compare the performance of summarization models:

# Function to compare summarization performance
def analyze_summarization_performance(models, dataset, num_samples=5, max_length=1024):
    results = {}
    for model_name, model_func in models.items():
        summaries = []
        times = []
        for i, sample in enumerate(dataset):
            if i >= num_samples:
                break
            # Truncate the text to the model's max length
            text = sample["article"][:max_length]
            start_time = time.time()
            summary = model_func(text)
            times.append(time.time() - start_time)
            summaries.append(summary)
        results[model_name] = {
            "summaries": summaries,
            "average_time": sum(times) / len(times)
        }
    return results
  • Since we are comparing model performance, it is an easier approach to create analysis functions for both summarization and question answering that take model and respective datasets as input parameter.
  • models is a dictionary where the keys are the model names (like “BART”, “T5”).The dataset that contains the articles for summarization. num_samples=5 .the number of samples (articles) to summarize, it is set to 5. max_length=1024 is the maximum length for the input text to each model. This ensures that the text doesn’t exceed the model’s token limit.
  • the for loop , models.items() returns each model name and its associated summarization function. summaries is a list to store the summaries generated by the current model. time is used to store the time taken for each sample summary.

Question Answering Performance Analysis

Below is the code to compare the performance of question-answering models:

# Function to compare question-answering performance
def analyze_qa_performance(models, dataset, num_samples=5):
    results = {}
    for model_name, model_func in models.items():
        answers = []
        times = []
        for i, sample in enumerate(dataset):
            if i >= num_samples:
                break
            start_time = time.time()
            answer = model_func(sample["question"], sample["context"])
            times.append(time.time() - start_time)
            answers.append(answer)
        results[model_name] = {
            "answers": answers,
            "average_time": sum(times) / len(times)
        }
    return results
  • models is a dictionary where the keys are the model names (like “BART”, “T5”).The dataset containing questions and their corresponding contexts. num_samples=5 .the number of samples (questions) to process it is set to 5. 
  • the for loop goes through the dataset with sample containing each question and context.If the number of processed samples reaches the limit (num_samples), it stops further processing.
  • start_time = time.time() Captures the current time to measure the time taken by the model to generate an answer. answer = model_func(sample[“question”], sample[“context”]) Calls the model’s question-answering function with the current sample’s question and context. The answer is stored in answer.times.append(time.time() – start_time) records the time taken for generating the answer by calculating the difference between the current time and start_time. answers.append(answer) appends the generated answer to the answers list.
  • After processing all samples for a given model, the answers list and the average_time (calculated by summing the times and dividing by the number of samples) are stored in the results dictionary under the model’s name.
# Define tasks to analyze
tasks = {
    "Summarization": {
        "bart": summarize_with_bart,
        "t5": summarize_with_t5
    },
    "Question Answering": {
        "distilbert": answer_with_distilbert,
        "bert": answer_with_bert
    }
}
  • For Summarization, the dictionary has two models: bart (using the summarize_with_bart function) and t5 (using the summarize_with_t5 function).
  • For Question Answering, the dictionary lists two models: distilbert (using the answer_with_distilbert function) and bert (using the answer_with_bert function).

Run Summarization Analysis

# Analyze summarization performance
print("Summarization Task Results:")
summarization_results = analyze_summarization_performance(tasks["Summarization"], summarization_dataset)
for model, result in summarization_results.items():
    print(f"\nModel: {model}")
    for i, summary in enumerate(result["summaries"], start=1):
        print(f"Sample {i} Summary: {summary}")
    print(f"Average Time Taken: {result['average_time']} seconds")
Run Summarization Evaluation:  Text Summarization and Question Answering
# Analyze question-answering performance
print("\nQuestion Answering Task Results:")
qa_results = analyze_qa_performance(tasks["Question Answering"], qa_dataset)
for model, result in qa_results.items():
    print(f"\nModel: {model}")
    for i, answer in enumerate(result["answers"], start=1):
        print(f"Sample {i} Answer: {answer}")
    print(f"Average Time Taken: {result['average_time']} seconds")
comparison:  Text Summarization and Question Answering

Output Interpretation 

Below we will see output interpretation in detail:

Summarization Task

Model Sample 1 Summary Sample 2 Summary Sample 3 Summary Sample 4 Summary Sample 5 Summary Average Time Taken (seconds)
BART Harry Potter star Daniel Radcliffe turns 18 on Monday, gaining access to a £20 million fortune. He says he has no plans to waste his money on fast cars or drink. Miami-Dade pretrial detention facility houses mentally ill inmates, often facing charges like drug offenses or assaulting an officer. Judge: Arrests stem from confrontations with police. Survivor Gary Babineau describes falling 30-35 feet after the Mississippi bridge collapsed. “Cars were in the water,” he recalls. Doctors removed five small polyps from President Bush’s colon. All were under one centimeter. Bush reclaimed presidential power after the procedure. Atlanta Falcons quarterback Michael Vick was suspended after admitting to participating in a dogfighting ring. 19.74
T5 The young actor plans not to waste his wealth on fast cars or drink. He will be able to gamble in a casino and watch the horror film “Hostel: Part”. Inmates with severe mental illnesses are detained until ready to appear in court. They typically face drug or assault charges. Mentally ill individuals become more paranoid. Survivor recalls a 30-35 foot fall when the Mississippi bridge collapsed. He suffered back injuries but could still move. Several people were injured. Polyps removed from Bush were sent for testing. Vice President Cheney assumed presidential power at 9:21 a.m. The NFL suspended Michael Vick for admitting to involvement in a dogfighting ring, making a strong statement against such conduct. 4.0

Question Answering Task

Model Sample 1 Answer Sample 2 Answer Sample 3 Answer Sample 4 Answer Sample 5 Answer Average Time Taken (seconds)
DistilBERT Denver Broncos Carolina Panthers Levi’s Stadium Denver Broncos gold 0.8554
BERT Denver Broncos Carolina Panthers Levi’s Stadium in the San Francisco Bay Area at Santa Clara, California Denver Broncos gold 2.8684

Key Insights

We will now explore key insights below:

  • Summarization Task: 
    • BART took a significantly longer time on average (19.74 seconds) compared to T5 (4.02 seconds).
    • BART generally provides more detailed summaries, while T5 tends to summarize in a more concise manner.
  • Question Answering Task:
    • Both DistilBERT and BERT models provided correct answers, but DistilBERT was significantly faster (0.86 seconds vs. 2.87 seconds).

The answers were quite similar across both models, with BERT providing a slightly more detailed answer (e.g., “Levi’s Stadium in the San Francisco Bay Area at Santa Clara, California”).

Both tasks show that DistilBERT and T5 offer faster responses, while BART and BERT provide more thorough and detailed outputs at the cost of additional time.

Conclusion 

T5, or the Text-to-Text Transfer Transformer, represents a groundbreaking shift in natural language processing, simplifying diverse tasks into a unified text-to-text framework. By leveraging transfer learning and pretraining on a massive corpus, T5 showcases unparalleled versatility, from translation and summarization to sentiment analysis and beyond. Its innovative approach not only enhances model performance but also streamlines the development of NLP applications, making it a pivotal tool for researchers and developers. As advancements in language models continue, T5 stands as a testament to the potential of unifying diverse linguistic tasks into a single, cohesive architecture.

Key Takeaways

  • Lighter models like DistilBERT and T5 are faster and more efficient, providing quicker responses compared to larger models like BERT and BART.
  • While faster models provide reasonably good summaries, more complex models like BART and BERT offer higher-quality and more detailed outputs.
  • For applications requiring speed over detail, smaller models (DistilBERT, T5) are ideal, whereas tasks needing more nuanced responses can benefit from the more computationally expensive BERT and BART models.

Frequently Asked Questions

Q1. What is the difference between BERT and DistilBERT?

A. DistilBERT is a smaller, faster, and more efficient version of BERT. It retains 97% of BERT’s language understanding capabilities while being 60% smaller and 60% faster, making it ideal for real-time applications with limited computational resources.

Q2. Which model is best for summarization tasks?

A. For summarization tasks, BART generally performs better in terms of summary quality, producing more coherent and contextually rich summaries. However, T5 is also a strong contender, offering good quality summaries with faster processing times.

Q3. Why is BERT slower than DistilBERT?

A. BERT is a large, complex model with more parameters, which requires more computational resources and time to process input. DistilBERT is a distilled version of BERT, meaning it has fewer parameters and is optimized for speed, making it faster while maintaining much of BERT’s performance.

Q4. How do I choose the right model for my task?

A. For tasks requiring detailed understanding or context, BERT and BART are preferable due to their high accuracy. If speed is crucial, such as in real-time systems, smaller models like DistilBERT and T5 are better suited, balancing performance and efficiency.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Aadya Singh is a passionate and enthusiastic individual excited about sharing her knowledge and growing alongside the vibrant Analytics Vidhya Community. Armed with a Bachelor's degree in Bio-technology from MS Ramaiah Institute of Technology in Bangalore, India, she embarked on a journey that would lead her into the intriguing realms of Machine Learning (ML) and Natural Language Processing (NLP).

Aadya's fascination with technology and its potential began with a profound curiosity about how computers can replicate human intelligence. This curiosity served as the catalyst for her exploration of the dynamic fields of ML and NLP, where she has since been captivated by the immense possibilities for creating intelligent systems.

With her academic background in bio-technology, Aadya brings a unique perspective to the world of data science and artificial intelligence. Her interdisciplinary approach allows her to blend her scientific knowledge with the intricacies of ML and NLP, creating innovative and impactful solutions.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details