The world of language models has undergone a dramatic evolution in the past few years, particularly with the emergence of Large Language Models (LLMs). These models, equipped with billions of parameters and a profound understanding of natural language, have been instrumental in transforming the field of artificial intelligence. Today, we’ll explore this revolution, emphasizing the transition from closed-source to open-source LLMs, the significance of fine-tuning, and the development of efficient fine-tuning techniques that have recently emerged.
Learning Objectives:
The landscape of language models has witnessed a dichotomy between closed-source models provided by companies like OpenAI and open-source variants offered by institutions such as Meta, Google, and various research labs. Closed-source LLMs like ChatGPT, GPT 3.5, and GPT 4 present a compelling starting point due to their managed infrastructure and rapid proof-of-concept capabilities. These models offer high-quality pre-trained datasets and require no infrastructure setup, making them an easy entry point for those exploring the capabilities of LLMs.
However, despite their accessibility, closed-source LLMs exhibit fundamental limitations. They lack model ownership and grant minimal customization abilities, making them less suitable for long-term investment, especially for sectors where data privacy and model control are paramount. In contrast, open-source LLMs boast a promising alternative. They enable complete model ownership, and customization, and facilitate immediate access to innovative developments in the open-source space. The trade-off is the cost and challenge of self-hosting these models.
Fine-tuning emerges as a critical process for maximizing the potential of LLMs, especially when considering domain-specific tasks. Closed-source models often lack the flexibility required for fine-tuning, whereas open-source models offer complete control over this process. Fine-tuning allows the adaptation of pre-trained LLMs to a specific task by updating model weights, leading to performance improvements. It’s a means to personalize these general models for specialized applications, optimizing performance for unique tasks.
The debate between fine-tuning and models like Retrieval Augmented Generation (RAG) revolves around the need for models tailored to specific tasks, as opposed to general-purpose intelligence. The open-source nature of LLMs allows the customization and efficient fine-tuning needed to achieve superior task-specific performance.
Traditional fine-tuning involves updating all model parameters, a process proven to be resource-intensive, time-consuming, and not always yield optimal task-specific performance. However, recent innovations in parameter-efficient fine-tuning have offered a breakthrough. By freezing the pre-trained LLM and only training a very small set of task-specific layers—less than 1% of the total model weight—efficient fine-tuning proves to be both resource-friendly and more effective.
The shift towards parameter-efficient fine-tuning has significantly impacted how LLMs are adapted to specific tasks. By focusing on training only a minimal set of task-specific layers, the process becomes more cost-effective and time-efficient. This innovative approach facilitates optimal task-specific performance, even with smaller datasets, showcasing the potential of open-source LLMs over closed-source models.
Research, such as the LIMA paper by Meta, supports the idea that fine-tuning on smaller datasets can surpass the performance of closed-source models like GPT 4. This concept of doing more with less data underlines the efficiency and effectiveness of open-source LLMs when fine-tuned appropriately.
In the realm of leveraging pre-trained models for specific tasks, LoRA (Low Rank Adaptation) and QLoRA (Quantized Low Rank Adaptation) have emerged as innovative methodologies that effectively fine-tune large language models (LLMs). These methods are instrumental in tailoring pre-trained models for specialized tasks while minimizing the additional parameters.
LoRA’s architecture involves a Low Rank Decomposition, which works by breaking down large weight matrices within the transformer architectures into smaller ones. In the context of transformers, LoRA focuses on the query, key, and value linear projections.
Typically, these linear projections have large weight matrices, such as 1024 by 1024, which LoRA decomposes into smaller matrices, like 1024 by 8 and 8 by 1024. These smaller matrices multiply to produce the original dimension. The compression drastically reduces the fine-tunable parameters, to about half to 1% of the total LLM parameters.
In the context of the transformer architecture, LoRA integrates adapter modules for key and query projection layers. These adapters, constructed via the low-rank decomposition, maintain the original shapes while enabling insertion into the transformer layers. The base layers remain frozen while only the adapter weights are trainable.
QLoRA, an extension of LoRA, innovatively represents model weights using 4-bit precision instead of the standard 32-bit floating-point representation. By using 4-bit values, the weight of each parameter is compressed, reducing the model size significantly. QLoRA’s efficiency allows fine-tuning even for colossal models on less memory-intensive platforms, such as Colab.
This quantization approach drastically minimizes the memory required for fine-tuning, making it possible to fine-tune large models even with limited computational resources, such as T4 GPUs.
LoRA and QLoRA offer diverse pathways for fine-tuning large language models. LoRA primarily operates by low-rank decomposition, allowing for effective modification of pre-trained models with reduced parameters. On the other hand, QLoRA, the refined version, introduces quantization to compress the weights significantly, thereby reducing the model’s memory footprint. Both LoRA and QLoRA are pivotal in the field of Parameter-Efficient Fine-Tuning (PEFT) for LLMs.
In exploring the landscape of open-source LLMs, Ludwig emerges as a prominent player. Ludwig offers a declarative approach to machine learning, providing an accessible interface to control and customize models without extensive coding. Its YAML-based configurations empower users to manage different input features and output tasks efficiently. Ludwig’s multimodal capability enables it to handle diverse data types, making it a versatile and user-friendly tool in the domain of LLMs.
By combining the ease of AutoML with the flexibility of low-level APIs, Ludwig bridges the gap, offering customizable models without the need for extensive coding. Its modular architecture makes deep learning experimentation easier and more accessible, providing users with a convenient platform to explore the potential of LLMs.
Leveraging LoRA involves integrating adapter modules into the transformer layers, allowing for specific fine-tuning while keeping the base layers frozen. LoRA’s low-rank decomposition compresses the fine-tunable parameters to a small fraction of the original LLM size. This method aids in adapting pre-trained models to suit custom tasks without altering the base architecture extensively.
Ludwig introduces an accessible approach for configuring the LoRA-based fine-tuning of language models. By utilizing Ludwig, users can set up the model architecture, define the input and output features, and apply LoRA or QLoRA configurations through YAML-based configurations.
These configurations streamline the process of implementing LoRA-based fine-tuning, such as model type (LLM), base model selection, and specification of input and output features for the intended task.
# Install Ludwig and Ludwig's LLM related dependencies.
!pip uninstall -y tensorflow --quiet
!pip install ludwig --quiet
!pip install ludwig[llm] --quiet
# Enable text wrapping so we don't have to scroll horizontally and create a function to flush CUDA cache.
from IPython.display import HTML, display
def set_css():
display(HTML('''
<style>
pre {
white-space: pre-wrap;
}
</style>
'''))
get_ipython().events.register('pre_run_cell', set_css)
def clear_cache():
if torch.cuda.is_available():
torch.cuda.empty_cache()
# Setup Your HuggingFace Token
import getpass
import locale; locale.getpreferredencoding = lambda: "UTF-8"
import logging
import os
import torch
import yaml
from ludwig.api import LudwigModel
os.environ["HUGGING_FACE_HUB_TOKEN"] = getpass.getpass("Token:")
assert os.environ["HUGGING_FACE_HUB_TOKEN"]
# Import The Code Generation Dataset
from google.colab import data_table; data_table.enable_dataframe_formatter()
import numpy as np; np.random.seed(123)
import pandas as pd
df = pd.read_json("https://raw.githubusercontent.com/sahil280114/codealpaca/master/data/code_alpaca_20k.json")
# We're going to create a new column called `split` where:
# 90% will be assigned a value of 0 -> train set
# 5% will be assigned a value of 1 -> validation set
# 5% will be assigned a value of 2 -> test set
# Calculate the number of rows for each split value
total_rows = len(df)
split_0_count = int(total_rows * 0.9)
split_1_count = int(total_rows * 0.05)
split_2_count = total_rows - split_0_count - split_1_count
# Create an array with split values based on the counts
split_values = np.concatenate([
np.zeros(split_0_count),
np.ones(split_1_count),
np.full(split_2_count, 2)
])
# Shuffle the array to ensure randomness
np.random.shuffle(split_values)
# Add the 'split' column to the DataFrame
df['split'] = split_values
df['split'] = df['split'].astype(int)
# For this webinar, we will just 500 rows of this dataset.
df = df.head(n=1000)
In our journey exploring Ludwig and its capabilities in natural language processing tasks, we’ve grasped the essence of fine-tuning and the significant impact it has on models. Now, let’s delve deeper into the nitty-gritty of advanced configuration and the fine-tuning parameters that Ludwig offers.
The true power of a model emerges not just from its architecture but from the fine-tuning process that molds it to suit our needs. As mentioned before, the effectiveness of fine-tuning hinges on steering the model in the right direction. One way to achieve this is by providing specific prompts and data wrapped within these prompts.
Imagine a world where we feed the model a prompt, pair it with specific instructions and context, and let the magic happen. The prompt acts as a guide, steering the model’s understanding of the task at hand. And this is where Ludwig’s advanced features come into play.
Code:
qlora_fine_tuning_config = yaml.safe_load(
"""
model_type: llm
base_model: meta-llama/Llama-2-7b-hf
input_features:
- name: instruction
type: text
output_features:
- name: output
type: text
prompt:
template: >-
Below is an instruction that describes a task, paired with an input that may provide further context. Write a response that appropriately completes the request.
### Instruction: {instruction}
### Input: {input}
### Response:
generation:
temperature: 0.1
max_new_tokens: 512
adapter:
type: lora
quantization:
bits: 4
preprocessing:
global_max_sequence_length: 512
split:
type: random
probabilities:
- 0.9 # train
- 0.05 # val
- 0.05 # test
trainer:
type: finetune
epochs: 1
batch_size: 1
eval_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 0.0004
learning_rate_scheduler:
warmup_fraction: 0.03
"""
)
model = LudwigModel(config=qlora_fine_tuning_config, logging_level=logging.INFO)
results = model.train(dataset=df)
Checkout the full code here – Ludwig: Fine-Tune Llama-2-7b
After meticulous fine-tuning, it’s time to witness the model in action and observe its inference capabilities. This phase is where the rubber meets the road, where the model churns out outputs based on the training it received.
By setting parameters, using YAML configurations, and defining key aspects such as adapters, quantization, and training-related specifics, Ludwig provides a user-friendly yet robust environment to mold the model to one’s liking.
Moreover, the significance of monitoring the fine-tuning process and understanding the memory implications cannot be overstated. For instance, using the LoRA adapter along with quantization significantly reduces memory usage, making the process more efficient and practical.
Post fine-tuning, inference becomes the focal point. The model, now primed to tackle the assigned tasks, generates outputs based on the provided prompts. However, the intricacies lie in the fact that these models are auto-regressive, meaning they produce one token at a time. The inference process, though slow due to token generation and computation, provides a glimpse of the model’s capabilities.
The inference outputs may not be perfect, especially if the fine-tuning epochs are limited. However, by tweaking parameters like generation configuration (temperature, maximum new tokens, etc.), the outputs can be altered, thereby refining the model’s responses.
The LLM evolution from closed-source to open-source models underscores the role of fine-tuning and Ludwig’s advanced features in shaping adaptable, efficient language models. The future holds promise for diverse, customized LLMs despite limited datasets. As we continue exploring the realm of language models, the open-source LLM advancements will not only shape AI’s future but also offer groundbreaking opportunities across various industries, setting the stage for innovative applications and tailored solutions. Collaboration and open-source contributions will pave the way for a more comprehensive, accessible, and efficient approach to language modeling.
Key Takeaways:
A. Fine-tuning adapts pre-trained language models to specific tasks by updating model weights, enhancing performance for specialized applications.
A. Ludwig’s YAML configurations, including adapters and training specifics, allow users to mold models to their preferences for optimal performance.
A. LoRA’s low-rank decomposition compresses parameters, and QLoRA introduces quantization, reducing memory usage, and enhancing fine-tuning efficiency.
Arnav Garg, a Senior Machine Learning Engineer at Predibase, is your guide on this journey. He’s a master of applied machine learning and large-scale training, with a keen focus on fine-tuning optimizations. Arnav’s expertise extends to scaling distributed training and building reliability mechanisms for cost-effective and efficient training.
DataHour Page: https://community.analyticsvidhya.com/c/datahour/efficient-fine-tuning-of-llms-on-single-t4-gpu-using-ludwig
LinkedIn: https://www.linkedin.com/in/arnavgarg/