Large Language Models like BERT, T5, BART, and DistilBERT are powerful tools in natural language processing where each is designed with unique strengths for specific tasks. Whether it’s summarization, question answering, or other NLP applications. These models vary in their architecture, performance, and efficiency. In our code we will compare these models across two tasks: text summarization and question answering, BART and T5 for text summarization and DistilBERT and BERT for question answering. By comparing their performance on real-world datasets we aim to determine which model excels in each task helping optimize results and resources for practical applications.
This article was published as a part of the Data Science Blogathon.
Summarization is the process where we take a passage of text and reduce its length while keeping its meaning intact. The LLM models which we will be using for comparison are:
BART is a combination of two model types. It first processes text in a bidirectional way to understand the context of words it then generates a summary in a left to right manner. Thereby it combines the bidirectional nature of BERT with the autoregressive text generation approach seen in GPT. BART also uses an encoder-decoder structure like T5 but is specifically designed for text generation tasks. For summarization first BART’s encoder reads the entire passage and captures the relationships between words in a bidirectional manner. This deep contextual understanding allows it to focus on the key parts of the input text.
The decoder then generates an abstractive summary from this input, producing new, shortened phrases rather than merely extracting sentences.
T5 is based on the Transformer architecture. It generates summaries that are abstractive rather than extractive. Instead of copying phrases directly from the text, it often rephrases content to create a concise version.
Verdict: T5 tends to be faster and more computationally efficient than BART but BART might perform better in terms of natural language fluency in certain cases.
Question answering is when we ask a model a question, and it finds the answer in a given context or passage of text. Here’s how the two models for question answering work and how they compare:
BERT is a large, powerful model that looks at words in both directions to understand their meaning based on the context. When you provide BERT with a question and a passage of text it first looks for the most relevant part of the text that answers the question. BERT is one of the most accurate models for question answering tasks,. It performs very well because of its ability to understand the relationship between words in a passage and their context.
DistilBERT is a smaller, lighter version of BERT. BERT was trained to understand language in both directions (left and right), making it very powerful for tasks like question answering. DistilBERT does the same thing but with fewer parameters, which makes it faster but with slightly less accuracy compared to BERT.It can answer questions based on a given passage of text, and it’s particularly useful for tasks that need less computational power or a quicker response time.
Verdict: BERT is more accurate and can handle more complex questions and texts, but it requires more computational power and takes longer to give results. DistilBERT, being a smaller model, is quicker but might not always perform as well on more complicated texts.
Below we will go through the code implementation along with data set overview and setup:
Link to notebook (for editor use )
Data fields:
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
{
"answers": {
"answer_start": [1],
"text": ["This is a test text"]
},
"context": "This is a test context.",
"id": "1",
"question": "Is this a test?",
"title": "train test"
}
from transformers import pipeline
from datasets import load_dataset
import time
# Load our datasets
# CNN/Daily Mail for summarization
summarization_dataset = load_dataset("cnn_dailymail", "3.0.0", split="train[:1%]") # Use 1% of the training data
# SQuAD for question answering
qa_dataset = load_dataset("squad", split="validation[:1%]") # Use 1% of the validation data
# Task 1: Text Summarization
def summarize_with_bart(text):
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
return summarizer(text, max_length=50, min_length=25, do_sample=False)[0]["summary_text"]
def summarize_with_t5(text):
summarizer = pipeline("summarization", model="t5-small")
return summarizer(text, max_length=50, min_length=25, do_sample=False)[0]["summary_text"]
# Task 2: Question Answering
def answer_with_distilbert(question, context):
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
return qa_pipeline(question=question, context=context)["answer"]
def answer_with_bert(question, context):
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")
return qa_pipeline(question=question, context=context)["answer"]
Let us now write the code to compare the performance of summarization models:
# Function to compare summarization performance
def analyze_summarization_performance(models, dataset, num_samples=5, max_length=1024):
results = {}
for model_name, model_func in models.items():
summaries = []
times = []
for i, sample in enumerate(dataset):
if i >= num_samples:
break
# Truncate the text to the model's max length
text = sample["article"][:max_length]
start_time = time.time()
summary = model_func(text)
times.append(time.time() - start_time)
summaries.append(summary)
results[model_name] = {
"summaries": summaries,
"average_time": sum(times) / len(times)
}
return results
Below is the code to compare the performance of question-answering models:
# Function to compare question-answering performance
def analyze_qa_performance(models, dataset, num_samples=5):
results = {}
for model_name, model_func in models.items():
answers = []
times = []
for i, sample in enumerate(dataset):
if i >= num_samples:
break
start_time = time.time()
answer = model_func(sample["question"], sample["context"])
times.append(time.time() - start_time)
answers.append(answer)
results[model_name] = {
"answers": answers,
"average_time": sum(times) / len(times)
}
return results
# Define tasks to analyze
tasks = {
"Summarization": {
"bart": summarize_with_bart,
"t5": summarize_with_t5
},
"Question Answering": {
"distilbert": answer_with_distilbert,
"bert": answer_with_bert
}
}
# Analyze summarization performance
print("Summarization Task Results:")
summarization_results = analyze_summarization_performance(tasks["Summarization"], summarization_dataset)
for model, result in summarization_results.items():
print(f"\nModel: {model}")
for i, summary in enumerate(result["summaries"], start=1):
print(f"Sample {i} Summary: {summary}")
print(f"Average Time Taken: {result['average_time']} seconds")
# Analyze question-answering performance
print("\nQuestion Answering Task Results:")
qa_results = analyze_qa_performance(tasks["Question Answering"], qa_dataset)
for model, result in qa_results.items():
print(f"\nModel: {model}")
for i, answer in enumerate(result["answers"], start=1):
print(f"Sample {i} Answer: {answer}")
print(f"Average Time Taken: {result['average_time']} seconds")
Below we will see output interpretation in detail:
Model | Sample 1 Summary | Sample 2 Summary | Sample 3 Summary | Sample 4 Summary | Sample 5 Summary | Average Time Taken (seconds) |
---|---|---|---|---|---|---|
BART | Harry Potter star Daniel Radcliffe turns 18 on Monday, gaining access to a £20 million fortune. He says he has no plans to waste his money on fast cars or drink. | Miami-Dade pretrial detention facility houses mentally ill inmates, often facing charges like drug offenses or assaulting an officer. Judge: Arrests stem from confrontations with police. | Survivor Gary Babineau describes falling 30-35 feet after the Mississippi bridge collapsed. “Cars were in the water,” he recalls. | Doctors removed five small polyps from President Bush’s colon. All were under one centimeter. Bush reclaimed presidential power after the procedure. | Atlanta Falcons quarterback Michael Vick was suspended after admitting to participating in a dogfighting ring. | 19.74 |
T5 | The young actor plans not to waste his wealth on fast cars or drink. He will be able to gamble in a casino and watch the horror film “Hostel: Part”. | Inmates with severe mental illnesses are detained until ready to appear in court. They typically face drug or assault charges. Mentally ill individuals become more paranoid. | Survivor recalls a 30-35 foot fall when the Mississippi bridge collapsed. He suffered back injuries but could still move. Several people were injured. | Polyps removed from Bush were sent for testing. Vice President Cheney assumed presidential power at 9:21 a.m. | The NFL suspended Michael Vick for admitting to involvement in a dogfighting ring, making a strong statement against such conduct. | 4.0 |
Model | Sample 1 Answer | Sample 2 Answer | Sample 3 Answer | Sample 4 Answer | Sample 5 Answer | Average Time Taken (seconds) |
---|---|---|---|---|---|---|
DistilBERT | Denver Broncos | Carolina Panthers | Levi’s Stadium | Denver Broncos | gold | 0.8554 |
BERT | Denver Broncos | Carolina Panthers | Levi’s Stadium in the San Francisco Bay Area at Santa Clara, California | Denver Broncos | gold | 2.8684 |
We will now explore key insights below:
The answers were quite similar across both models, with BERT providing a slightly more detailed answer (e.g., “Levi’s Stadium in the San Francisco Bay Area at Santa Clara, California”).
Both tasks show that DistilBERT and T5 offer faster responses, while BART and BERT provide more thorough and detailed outputs at the cost of additional time.
T5, or the Text-to-Text Transfer Transformer, represents a groundbreaking shift in natural language processing, simplifying diverse tasks into a unified text-to-text framework. By leveraging transfer learning and pretraining on a massive corpus, T5 showcases unparalleled versatility, from translation and summarization to sentiment analysis and beyond. Its innovative approach not only enhances model performance but also streamlines the development of NLP applications, making it a pivotal tool for researchers and developers. As advancements in language models continue, T5 stands as a testament to the potential of unifying diverse linguistic tasks into a single, cohesive architecture.
A. DistilBERT is a smaller, faster, and more efficient version of BERT. It retains 97% of BERT’s language understanding capabilities while being 60% smaller and 60% faster, making it ideal for real-time applications with limited computational resources.
A. For summarization tasks, BART generally performs better in terms of summary quality, producing more coherent and contextually rich summaries. However, T5 is also a strong contender, offering good quality summaries with faster processing times.
A. BERT is a large, complex model with more parameters, which requires more computational resources and time to process input. DistilBERT is a distilled version of BERT, meaning it has fewer parameters and is optimized for speed, making it faster while maintaining much of BERT’s performance.
A. For tasks requiring detailed understanding or context, BERT and BART are preferable due to their high accuracy. If speed is crucial, such as in real-time systems, smaller models like DistilBERT and T5 are better suited, balancing performance and efficiency.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.