Guide to BART (Bidirectional & Autoregressive Transformer)

Jay Singh Last Updated : 18 Nov, 2024
14 min read

BART is truly one of a kind in the ever-changing realm of NLP, as a strong model that has drastically changed the way we think of text generation and understanding. BART, which is short for Bidirectional and Autoregressive Transformer, takes the best aspects of both sides of the transformer architectures into one single view. In this article, we will dig deep into BART’s architecture, function, and practical implementation-in a more accessible way for data science enthusiasts of all skill levels.

BART

What is BART?

Understanding BART requires contextual frames of its development. In 2019, Facebook AI presented BART as a language model that should cater to language models’ flexibility and power requirements in view of emerging trends. To develop this model, the developers have extensively relied on transformer-related successful concepts such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). BERT performed exceptionally in contextual understanding through bidirectional text analysis. GPT did well on coherence generation. BART integrates approaches since it provides a model capable of performing highly on both contextual understanding and coherent text generation.

BART Architecture

transformer architecture
Source: Link

At the bottom, BART is a sequence-to-sequence model that follows the encoder-decoder framework. And it is this architectural configuration that enables BART to take a sequence of text and output a corresponding sequence. The use of the encoder’s bidirectional properties 

with the autoregressive properties of the decoder is what makes BART unique. The encoder used in BART is quite similar to BERT. This model examines the input sequence in both directions, which allows it to capture contextual information both from the left and right sides of each individual word. This bidirectional approach ensures a thorough interpretation of the input text. 

The Encoder 

The encoder of BART is responsible for understanding the input text. Similar to BERT, BART uses bidirectional encoding; it reads everything at once from the input sequence but includes the context of left and right at every word. This way, BART captures the relationship between words in a series of words, even if they are quite far apart. 

From the mathematics point of view, the encoder used in BART is a stack of layers that consist of multi-head self-attention and feed-forward neural networks. The self-attention mechanism allows every word in the input sequence to attend to all other words so the output contains attention weights reflecting interrelation between the tokens. Then, the latter are combined with the input embeddings to form a new representation of the input sequence. This process is repeated through multiple layers in order to build strong reinforcement of the input text. Additionally, the encoder is designed with a focus on corrupted input data preparation during the pretraining phase. As such, denoising is very relevant here, as it allows an encoder to produce the original input despite the missing or disorderly parts of some text segments. It is the multi-head attention that actually makes it possible for a model to capture different dimensions of relationships between words, encompassing syntactic dependencies and semantic similarities. 

The Decoder 

The decoder of BART is an autoregressive model, like GPT. In other words, it generates text one token at a time. The decoder uses the previously generated tokens as context during the generation process in order to predict the next element to be fed into the sequence. The procedure goes on till the whole output sequence is generated. Mathematically, the autoregressive decoder generates the next token by maximizing the likelihood of the next token given the previous tokens: 

The decoder within BART incorporates a cross-attention mechanism, enabling it to focus on the encoder’s output. This incorporation assures that the generated text is still aligned with the input sequence, thus making it more relevant and coherent. The combination of self-attention that captures internal dependencies to the generated sequence and cross-attention allows the input sequence to give BART capabilities that are better than others in tasks like machine translation and text generation.

architecture
Source: Link

Bidirectional Encoder & Autoregressive Decoder

BART Architecture
Source: Link
  1. Bidirectional Encoder (Left):
    • This part of the architecture processes the input sequence in a bidirectional manner, meaning it considers context from both the left and right of each token.
    • The example input sequence is A _ B _ E, where some tokens have been masked or corrupted. The encoder takes this corrupted input and creates contextualized representations for each token.
  2. Autoregressive Decoder (Right):
    • The decoder is tasked with generating the original, uncorrupted sequence autoregressively (left-to-right).
    • In the example, the decoder starts with a special start token <s> and uses the context to predict the next tokens (A, B, C, D, E) step by step.
    • The decoder conditions on the output generated so far to predict the next token, producing the entire sequence from the hidden states passed from the encoder.

Functionality:

  • Encoding and Decoding: The encoder processes the input to capture bidirectional dependencies, while the decoder generates the sequence in a unidirectional, autoregressive manner, ensuring fluent and coherent output reconstruction.
  • Purpose: This combination allows BART to be flexible and effective for both text comprehension (due to the bidirectional encoder) and generation tasks (due to the autoregressive decoder).

This setup is powerful for tasks like text generation, text summarization, and other natural language processing tasks where understanding context and generating coherent sequences are essential. Also, BART is designed to combine the strengths of BERT’s bidirectional encoding and GPT’s autoregressive generation, enabling it to perform both comprehension and generation tasks.

The Pre-training Process 

One of the key innovations in BART is its pre-training process. Unlike models that use masked language modelling (like BERT) or autoregressive language modelling (like GPT) alone, BART introduces a more flexible approach called “text infilling.” In text infilling, the model is given a text where some spans (continuous sequences of words) are masked out. The task for BART is to reconstruct the original text. This process can involve: 

  1. Predicting missing tokens 
  2. Filling in longer masked spans 
  3. Correcting shuffled sentences 

This diverse set of tasks during pre-training allows BART to learn a wide range of language understanding and generation skills. It becomes adept at tasks like summarization, translation, and question-answering, as these all involve some form of text transformation.

Fine-tuning for Specific Tasks 

After pre-training, BART can be fine-tuned for specific NLP tasks, involving training on task specific datasets for adapting its general language understanding to particular applications. Some of the more common tasks for which BART is fine-tuned include: 

  1. Text Summarization: BART is capable of producing succinct summaries of extended texts, effectively encapsulating the critical information. 
  2. Machine Translation: BART can learn translation between languages by fine-tuning on parallel corpora.
  3. Question Answering BART can be fine-tuned to understand questions and extract or generate relevant answers from a given context.
  4.  Text Generation: Starting with creative writing to then reaching speech generation, BART generates coherent and contextually appropriate text. 
  5. Sentiment Analysis: BART can be further fine-tuned to understand and classify the sentiment of text passages.

How to use BART with Huggingface Library?

To understand how BART works in practice, let’s take a simple example of using BART for text summarization. We’ll use the Hugging Face Transformers library to provide a simplified interface when working with BART and other transformer models.

First, let us create our environment and import our required libraries: 

from transformers import BartForConditionalGeneration, BartTokenizer 

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn') 
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn') 

input_text = """ 
The field of artificial intelligence has seen remarkable progress in recent years.  
From self-driving cars to voice assistants, AI is becoming an integral part of our daily lives.  Machine learning algorithms are now capable of recognizing patterns and making decisions with  unprecedented accuracy.""" 

inputs = tokenizer([input_text], max_length=1024, return_tensors='pt') 
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=100, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) 

print("Summary:", summary)

The current example uses a fine-tuned variant of BART specifically tailored for summarization purposes. This model takes in a large amount of text about artificial intelligence and generates a summary as output. The `generate` function uses beam search (which has been set to `num_beams=4`) to search across many possible summaries to find the most suitable one. By writing just a few lines of code, we are able to leverage a powerful language model to perform a complex task like summarization.

Understanding BART’s Internals

The BART (Bidirectional and Auto-Regressive Transformers) model is a sequence-to-sequence framework developed to handle a broad spectrum of natural language processing (NLP) tasks, including generation, translation, and comprehension. The key aspect of BART is its ability to serve as a denoising autoencoder, meaning it can reconstruct original sequences from corrupted inputs. This approach generalizes and extends concepts from other models like BERT and GPT by integrating both bidirectional and autoregressive components, thus enhancing its versatility for various tasks. 

Core Architecture 

BART employs a standard Transformer architecture consisting of a bidirectional encoder and an autoregressive decoder. This design allows it to function effectively as a denoising autoencoder by processing corrupted inputs and generating coherent outputs. The encoder reads corrupted text bi-directionally, whereas the decoder autoregressively generates text. This setup enables BART to manage diverse forms of corruption in the input, whether they involve missing or scrambled text, thus generalizing earlier models such as BERT, which uses only masked language modeling, or GPT, which handles left-to-right text generation. 

Pre-Training Mechanism 

The pre-training phase of BART is crucial to its functionality. During this phase, the model learns to predict and reconstruct original text from corrupted versions. Text corruption is achieved through several noising strategies, such as token masking (where random tokens are replaced with a [MASK] token), token deletion, text infilling (where a mask replaces continuous spans of text), sentence permutation (shuffling sentences within a document), and document rotation (randomly changing the starting point of a document). These strategies force BART to understand a wider context, ensuring it learns both local and global dependencies within the text. 

BART’s flexibility in applying these arbitrary noising functions sets it apart from models that focus on specific pre-training schemes. It allows the model to address different levels of text corruption, encouraging it to develop more robust contextual understanding. For instance, text infilling forces the model to estimate the appropriate length of missing spans, a task that extends beyond simple word prediction. 

Fine-Tuning and Task Adaptability 

BART is particularly effective when fine-tuned for tasks requiring text generation, such as abstractive summarization and machine translation. During fine-tuning, the model adapts to specific datasets by learning to map complete inputs to the desired outputs. It excels in this context due to its pre-training setup, which familiarizes it with reconstructing complete sequences from incomplete or noisy inputs. BART’s autoregressive decoder architecture

allows it to be directly applicable to generative tasks, a limitation in models like BERT, which cannot be efficiently used for generation due to their design. Additionally, BART matches the performance of RoBERTa on discriminative tasks, highlighting its versatility. It leverages representations from both its encoder and decoder, allowing it to handle classification tasks effectively, where input texts need to be encoded in their entirety. 

Integration and Practical Applications 

Beyond text generation and comprehension, BART can enhance machine translation. The model can effectively translate languages by combining additional encoder layers with a BART-based decoder. Experiments have shown improvements in translation accuracy over strong baselines. Notably, the BART model was trained using 160GB of text data, including news articles, books, and web content, ensuring broad language exposure during its pre-training phase. 

BART has significantly improved across multiple benchmarks, including SQuAD for question answering, CNN/DailyMail for summarization, and WMT for translation tasks. Its success can be attributed to its comprehensive pre-training objectives, which allow it to generalize well across different tasks without compromising performance on any particular application.

BART vs Other Language Models 

Understanding where BART stands in comparison to other popular language models makes all the differences between it and them even more meaningful. So let’s dig into these comparisons: 

BART vs BERT 

BERT was a landmark model that established the principle of bidirectional contextual comprehension, whereas its follower, BART, significantly advances the principle. It differs from its predecessor, which only functions as a specialized encoder, as it uses an encoder decoder architecture. This allows BART not only to deeply understand the text but also generate it efficiently. BERT employs masked language modeling and next sentence prediction as its primary pre-training methods; in contrast, BART employs a more flexible text infilling method that enables it to acquire a greater range of types of textual transformations. For the fine-tuning stage, BERT outperforms BART in classification and in token-level, including named entity recognition.

This architecture of an encoder-decoder makes BART much more flexible and makes it best for generating purposes. Both models show robust contextual understanding; however, BART’s ability to recover more extended masked segments gives it a potential advantage in understanding contexts of greater length. While both architectures have their merits, BART is more distinct because it has a wider range of capabilities, which can increase complexity and computing loads. Conclusion In summary, BERT to BART is a major step that marks the second significant leap in natural language processing over several application domains. 

BART vs GPT 

The GPT models are well-known for their extraordinary capabilities in text generation, but BART presents several unique benefits. One of the main differences is the directionality feature—whereas GPT processes text from left to right, BART’s encoder goes from both sides at once by being bidirectional, helping to understand the input more coherently, which proves especially beneficial when the task demands deep understanding. 

Regarding pre-training methodologies, GPT employs an autoregressive language modeling technique, which involves forecasting the subsequent word predicated on the previously given words. Conversely, BART utilizes a more adaptable text infilling strategy, allowing it to assimilate a broader array of textual patterns and configurations. This increased adaptability frequently results in a more resilient comprehension of language and improves BART’s capacity to proficiently manage intricate or partially provided inputs. 

As for task suitability comparison, GPT especially shines in the open-ended generation of text. This makes it super suitable for tasks as diverse as creative writing and dialogue. BART, on the other hand, can generate but is best where both understanding and generation are necessary, as in the case of summarizing or translation tasks. Its ability to maintain longer range context better than GPT is also borne out in its bidirectional encoding.  

BART vs T5 

T5 (Text-to-Text Transfer Transformer) is another powerful sequence-to-sequence model that stands out in its own way, but there are key differences when comparing it to BART. One of the main distinctions lies in the pre-training philosophy. T5 treats all NLP tasks as text-to-text problems, using a unified framework to approach everything from translation to question-answering in the same way. BART, though highly versatile, doesn’t explicitly frame every task this way, giving it a slightly different flexibility in certain areas. Both models share an encoder-decoder architecture, but T5 utilizes a modified version of the original transformer design, while BART remains closer to the original blueprint of the transformer architecture, keeping its design more conventional in some respects. 

When it comes to pre-training data, the difference becomes even more pronounced. T5 was trained on a cleaned version of the Common Crawl dataset (C4), a massive collection of web data, giving it broad coverage of general knowledge. In contrast, BART was pre-trained using a curated mixture of books, Wikipedia, and news articles, which might give it a stronger grasp of structured, well-organized information, potentially making it more effective in certain knowledge-driven tasks. This variation in their foundational data can influence how each model approaches different problems and the types of information they excel at processing. 

As for task performance, both models are strong contenders across a wide range of NLP benchmarks. T5 has achieved state-of-the-art results in many benchmarks, largely due to its unified text-to-text approach, which helps it generalize well across various tasks. However, BART often matches or even surpasses T5, especially in areas like summarization and translation, where its bidirectional encoding and decoder structure help it reconstruct meaning and generate more coherent outputs. This makes BART particularly adept at tasks requiring both deep understanding and precise generation, even though T5 is no slouch in those areas either. Both models are incredibly capable, but the slight differences in design and training data lead to nuanced strengths that make them each stand out in their own right. 

BART vs RoBERTa 

RoBERTa (Robustly Optimized BERT Approach) is an enhanced version of BERT designed to improve performance on various natural language understanding tasks, but it differs significantly from BART in several ways. One key distinction lies in their architecture— RoBERTa, like BERT, is an encoder-only model, meaning it specializes in understanding and analyzing text. BART, on the other hand, follows an encoder-decoder framework, making it more versatile, particularly in tasks that require both understanding and generation of text. This gives BART a distinct advantage in generative applications such as summarization and machine translation, where the ability to produce coherent text is critical. 

When it comes to pre-training, RoBERTa uses masked language modeling with dynamic masking, which introduces variations in which parts of the text are masked during training. This approach improves RoBERTa’s ability to generalize across a range of language understanding tasks. BART, however, employs a more flexible pre-training method, incorporating text infilling and denoising objectives. This allows BART to learn from a wider range of text transformations, equipping it with a stronger grasp of both understanding and generating coherent sequences of text, which can be particularly beneficial in more complex scenarios where partial or noisy inputs are present. 

In terms of performance, both models excel in natural language understanding tasks like question answering and sentiment analysis. However, BART’s additional generative capabilities give it a clear edge in tasks requiring text generation, where RoBERTa typically struggles unless task-specific architectures are added. Fine-tuning also highlights their differences: RoBERTa generally requires custom architectures to be built on top of it for generation tasks, while BART can be fine-tuned directly for both understanding and generation, making it more versatile and easier to adapt across different applications. This flexibility gives BART an upper hand in handling a broader variety of tasks without requiring significant modifications. 

In summary, BART combines many of the strengths of these other models. It has the bidirectional understanding of BERT and RoBERTa, the generative capabilities of GPT, and the sequence-to-sequence versatility of T5. This makes BART a highly flexible model suitable for a wide range of NLP tasks. You can check out some of the T5 and Roberta models from sbert page

Essential Python Libraries for Working with BART 

When it comes to implementing and working with BART in Python, several libraries stand out as particularly useful. Let’s explore these essential tools: 

Hugging Face Transformers 

The Hugging Face Transformers library is arguably the most important tool for working with BART and other transformer-based models. It provides a high-level API for using pre-trained models, fine-tuning them on custom datasets, and deploying them in production environments. 

Key features: 

  • Easy access to pre-trained BART models 
  • Tools for tokenization and data preprocessing 
  • Functions for model training and evaluation 
  • Pipeline abstractions for common NLP tasks

PyTorch 

While not specific to BART, PyTorch is the underlying framework used by many BART implementations, including the Hugging Face version. Understanding PyTorch can be crucial for tasks like: 

  • Customizing model architectures 
  • Implementing custom loss functions 
  • Optimizing model performance 
  • Handling GPU acceleration 

Example of using PyTorch with BART:

import torch 
from transformers import BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn') model.to('cuda')  # Move model to GPU 
inputs = inputs.to('cuda')  # Move inputs to GPU 

with torch.no_grad(): 
    outputs = model.generate(inputs)

Advanced Techniques for Fine-tuning BART 

Fine-tuning BART for specific tasks is crucial for leveraging its full potential. Here are some advanced techniques to consider: 

Gradient Accumulation 

When fine-tuning BART on limited GPU memory, gradient accumulation allows you to simulate larger batch sizes: 

from transformers import Trainer, TrainingArguments 
training_args = TrainingArguments( 
    output_dir="./results", 
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=16, 
    num_train_epochs=3, 
) 
trainer = Trainer( 
    model=model, 
    args=training_args, 
    train_dataset=train_dataset, 
) 
trainer.train()

Learning Rate Scheduling 

Implementing a learning rate scheduler can significantly improve the fine-tuning process:

from transformers import get_linear_schedule_with_warmup 

optimizer = AdamW(model.parameters(), lr=5e-5) 
scheduler = get_linear_schedule_with_warmup( 
    optimizer, 
    num_warmup_steps=100, 
    num_training_steps=len(train_dataloader) * num_epochs 
) 
for epoch in range(num_epochs): 
    for batch in train_dataloader: 
        outputs = model(**batch) 
        loss = outputs.loss 
        loss.backward() 
        optimizer.step() 
        scheduler.step() 
        optimizer.zero_grad() 

Optimizing BART for Production 

When deploying BART models in production environments, several optimization techniques can be crucial:

Model Quantization 

Quantization can significantly reduce model size and inference time:

import torch 

quantized_model = torch.quantization.quantize_dynamic( 
    model, {torch.nn.Linear}, dtype=torch.qint8 
) 

Model Pruning 

Pruning can remove unnecessary weights, reducing model size without significant performance loss:

import torch.nn.utils.prune as prune 

for name, module in model.named_modules(): 
    if isinstance(module, torch.nn.Linear): 
        prune.l1_unstructured(module, name='weight', amount=0.3) 

Conclusion 

BART is an incredibly versatile transformer model combining the best of both worlds: bidirectional and autoregressive techniques. It can understand and generate text; therefore, it forms a very big tool in the modern NLP landscape. The advantage of BART in summarizing, translating, and question-answering tasks is of state-of-the-art quality. With this progress, models like BART are paving the way to true language understanding systems that are more sophisticated and accurate. With its special architecture and new applications, BART will undoubtedly continue to play a key role in NLP for years to come.

If you are looking for a generative AI course online, then explore: GenAI Pinnacle Program!

Frequently Asked Questions

Q1. What is BART, and how does it stand out in NLP?

Ans. BART is a unique model in NLP that integrates the strengths of bidirectional encoding (like BERT) and autoregressive decoding (like GPT). This combination enables BART to excel in both understanding and generating coherent text, making it suitable for diverse tasks such as summarization, translation, and question-answering.

Q2. What type of architecture does BART use?

Ans. BART employs a sequence-to-sequence encoder-decoder architecture. The encoder reads the input text bi-directionally to capture full context, while the autoregressive decoder generates the output one token at a time. This structure allows BART to handle complex input-output text transformations effectively.

Q3. How is BART pre-trained?

Ans. BART’s pre-training involves a denoising autoencoder approach called “text infilling,” where spans of text are masked, and the model learns to reconstruct the original sequence. It also uses techniques like token deletion, sentence permutation, and document rotation, which train it to handle noisy or incomplete inputs.

Q4. What tasks can BART be fine-tuned for?

Ans. BART can be fine-tuned for various tasks, including text summarization, machine translation, question answering, text generation, and sentiment analysis. Fine-tuning adapts BART’s general language capabilities to specific task requirements.

Q5. How can BART be implemented using Python?

Ans. BART can be easily used with Python through the Hugging Face Transformers library. By importing BartForConditionalGeneration and BartTokenizer, you can load pre-trained models for tasks like summarization and generate results with a few lines of code.

Hey there, I am a final year student at IIT Kharagpur. I’m a data enthusiast, in the field of Machine Learning/ Data Science for past 3 years, turning complex problems into actionable solutions using AI/ML.
You can reach me on : [email protected]
Let’s go data !!

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details