A few months ago, Meta released its AI model, LLaMA 3.1(405 Billion parameters), outperforming OpenAI and other models on different benchmarks. That upgrade was built upon the capabilities of LLaMA 3, introducing improved reasoning, advanced natural language understanding, increased efficiency, and expanded language support. Now, again focusing on its “we believe openness drives innovation and is good for developers, Meta, and the world.” at Connect 2024, Meta released Llama 3.2. It is a collection of models with vision capabilities and lightweight text-only models that can fit on mobile devices.
If you ask me what’s impressive about this model, Llama 3.2’s 11B and 90B vision models stand out as excellent replacements for other closed models, especially for image understanding tasks. Moreover, the 1B and 3B text-only models are optimized for edge and mobile devices, making them state-of-the-art for tasks like summarization and instruction following. These models also have broad hardware support, are easy to fine-tune, and can be deployed locally, making them highly versatile for both vision and text-based applications.
Since its release, Llama 3.1 has become quite popular and impactful. While Llama 3.1 is incredibly powerful, they have historically required substantial computational resources and expertise, limiting accessibility for many developers and creating a high demand for building with Llama. However, with the launch of Llama 3.2, this accessibility gap has been significantly addressed.
The Llama 3.2 Vision (11B/90B) and the Llama Text 3.2 (1B/3B) models represent Meta’s latest advancements in multimodal and text processing AI. Each is designed for a different use case, but both showcase impressive capabilities.
Llama 3.2 Vision stands out as Meta’s most powerful open multimodal model, with a keen ability to handle both visual and textual reasoning. It’s capable of tasks like visual reasoning, document-based question answering, and image-text retrieval, making it a versatile tool. What makes this model special is its Chain of Thought (CoT) reasoning, which enhances its problem-solving abilities, especially when it comes to complex visual reasoning tasks. A context length of 128k tokens allows for extended multi-turn conversations, particularly when dealing with images. However, it works best when focusing on a single image at a time to maintain quality and optimize memory use. Beyond visual inputs, it supports text-based inputs in various languages like English, German, French, Hindi, and more.
On the other hand, Llama 3.2 1B and 3B models are smaller but incredibly efficient, designed specifically for on-device tasks like rewriting prompts, multilingual summarization, or knowledge retrieval. Despite their smaller size, they outperform many larger models and continue to support multilingual input with a 128k token context length, making them a powerful option for offline use or low-memory environments. Like the Vision model, they were trained with up to 9 trillion tokens, ensuring robust application performance.
In essence, if you’re looking for a model that excels in handling images and text together, Llama 3.2 Vision is your go-to. The 1B and 3B models provide excellent performance without needing large-scale computing power for text-heavy applications requiring efficiency and multilingual support on smaller devices.
You can download these models now:
Link to Download Llama 3.2 Models
Let’s talk about the Architecture of both models:
The 11B and 90B Llama models introduced support for vision tasks by integrating an image encoder into the language model. This was achieved by training adapter weights that allow image inputs to work alongside text inputs without altering the core text-based model. The adapters use cross-attention layers to align image and text data.
The training process began with pre-trained Llama 3.1 text models, adding image adapters, and training on large image-text datasets. The final stages involved fine-tuning with high-quality data, filtering, and safety measures. As a result, these models can now process both image and text prompts and perform advanced reasoning across both.
The 11B and 90B Llama models are the first in this series to support vision tasks, necessitating a novel architecture capable of processing both image and text inputs. This breakthrough allows the models to interpret and reason about images alongside text prompts.
The core of this innovation lies in the introduction of a set of adapter weights that bridge the gap between pre-trained language models and image encoders. These adapters consist of cross-attention layers, which feed image representations from the encoder into the language model. The key aspect of this process is that while the image encoder undergoes fine-tuning during training, the language model’s parameters remain untouched. This intentional choice preserves the text-processing capabilities of Llama, making these vision-enabled models a seamless drop-in replacement for their text-only counterparts.
The training pipeline is divided into several stages:
In post-training, the Llama 3.2 models follow a process similar to their text-based predecessors, involving supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO). Additionally, synthetic data generation plays a critical role in fine-tuning the models, where the Llama 3.1 model helps filter and augment question-answer pairs to create high-quality datasets.
The end result is a set of models that can effectively process both image and text inputs, offering deep understanding and reasoning capabilities. This opens the door to more advanced multimodal applications, pushing Llama models towards even richer agentic capabilities.
In parallel with advancements in vision models, Meta has focused on creating lightweight versions of Llama that maintain performance while being resource-efficient. The 1B and 3B Llama models are designed to operate on devices with limited computational resources, without compromising on their capabilities.
Two main techniques, pruning and knowledge distillation, were used to shrink the models:
Post-training processes further refine these lightweight models, including supervised fine-tuning, rejection sampling, and preference optimization. Additionally, context length support was scaled to 128K tokens while ensuring that the quality remains intact, allowing these models to handle longer text inputs without a drop in performance.
Meta has collaborated with major hardware companies such as Qualcomm, MediaTek, and Arm to ensure these models run efficiently on mobile devices. The 1B and 3B models have been optimized to run smoothly on modern mobile SoCs, opening up new opportunities for on-device AI applications.
Meta also introduced the Llama Stack API, a standardized interface for fine-tuning, data generation, and building agentic applications with Llama models. The goal is to provide developers with a consistent and easy-to-use toolchain for deploying Llama models in various environments, from on-premise solutions to cloud services and mobile devices.
The release includes a comprehensive set of tools:
Meta has partnered with major cloud providers, including AWS, Databricks, and Fireworks, to offer Llama Stack distributions in the cloud. The introduction of these APIs and distribution mechanisms makes it easier for developers to innovate with Llama models, regardless of their deployment environment.
Alongside these advancements, Meta focuses on safety and responsible AI development. With the launch of Llama Guard 3 11B Vision, the company introduced enhanced filtering for text+image prompts, ensuring that these models operate within safe boundaries. Furthermore, the smaller 1B and 3B Llama Guard models have been optimized to reduce deployment costs, making it more feasible to implement safety mechanisms in constrained environments.
Now, let’s begin with the Evaluations of both models on different benchmarks.
Summarising the Evaluation
Summarising the Evaluation
Llama 3.2 3B is the most versatile, while Gemma 2 2B and Phi-3.5-mini IT show strengths in specific areas but lag in others.
Most importantly, you will need authorization from Hugging Face for both models to run. Here are the steps:
If you haven’t taken any access, it will show: “Access to model meta-llama/Llama-3.2-3B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.”
To get access, First log in to Hugging Face, share your details in the desired field and agree to the terms and conditions. They are asking because of limited access to this model.
After getting access, go to – Meta Llama 3.2 Hugging Face for the required model. You can also normally search for the model name on the Hugging Face search bar.
After that, click on the Use this Model button and click on “Transformers.”
Now copy the code, and you are ready to experience Llama 3.2 Models:
This process is similar for both models.
Example 1:
import torch
from transformers import pipeline
model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant who is technically sound!"},
{"role": "user", "content": "Explain RAG in French"},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Output
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
{'role': 'assistant', 'content': "Je comprends mieux maintenant. Voici la
traduction de la French texte en anglais :\n\nRAG can mean different things
in French, but I'll try to give you a general definition.\n\nRAG can refer
to:\n\n* RAG (RAG), a technical term used in the paper and cardboard
industry to describe a process of coloration or marking on cardboard.\n* RAG
(RAG), an American music group that has performed with artists such as Jimi
Hendrix and The Band.\n* RAG (RAG), an indie rock album from American
country-pop singer Margo Price released in 2009.\n\nIf you could provide
more context or clarify which expression you were referring to, I would be
happy to help you further."}
Example 2:
from transformers import pipeline
import torch
model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
response = outputs[0]["generated_text"][-1]["content"]
print(response)
OUTPUT
Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh?
Alright then, matey! I be a language-generatin' swashbuckler, a digital
buccaneer with a penchant fer spinnin' words into gold doubloons o'
knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name,
and I be here to help ye navigate the seven seas o' questions and find the
hidden treasure o' answers! So hoist the sails and set course fer adventure,
me hearty! What be yer first question?
Example 1:
Note: If you are running this Llama 3.2 Vision Model on Colab, use the T4 GPU, as it is a very heavy model.
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device="cuda",
)
processor = AutoProcessor.from_pretrained(model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Can you please describe this image in just one sentence?"}
]}
]
input_text = processor.apply_chat_template(
messages, add_generation_prompt=True,
)
inputs = processor(
image, input_text, return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=70)
print(processor.decode(output[0][inputs["input_ids"].shape[-1]:]))
OUTPUT
The image depicts a rabbit dressed in a blue coat and brown vest, standing on
a dirt road in front of a stone house.
Example 2
import requests
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct"
headers = {"Authorization": "Bearer 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"}
def query(prompt):
payload = {"inputs": prompt}
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
# Example usage
prompt = "Describe the features of a self-driving car."
result = query(prompt)
print(result)
Output
[{'generated_text': ' A self-driving car is a car that is capable of
operating without human intervention. The vehicle contains a combination of
hardware and software components that enable autonomous movement.\nDescribe
the components that are used in a self-driving car. Some of the components
used in a self-driving car include:\nGPS navigation system\nInertial
measurement unit (IMU)\nRadar sensors\nUltrasonic sensors\nCameras (front,
rear, and side-facing)\nLiDAR (Light Detection and Ranging) sensor\n'}]
With the introduction of vision capabilities, lightweight models, and an expanded developer toolkit, Llama 3.2 represents a significant milestone in AI development. These innovations improve the models’ performance and efficiency and ensure that developers can build safe and responsible AI systems. As Meta continues to push the boundaries of AI, the Llama ecosystem is poised to drive new applications and possibilities across industries.
By fostering collaboration with partners across the AI community, Meta is laying the foundation for an open, innovative, and safe AI ecosystem. Llama’s future is bright, and the possibilities are endless.
If you are looking for a Generative AI course online, then explore: GenAI Pinnacle Program
Ans. LLaMA 3.2 is Meta’s latest AI model collection, featuring vision capabilities and lightweight text-only models optimized for mobile devices. It enhances multimodal processing, supporting both text and image inputs.
Ans. The 11B and 90B vision models excel in tasks like image understanding, visual reasoning, and image-text retrieval, making them strong alternatives to other closed models.
Ans. The 1B and 3B text models are optimized for on-device tasks like summarization and instruction following, offering powerful performance without needing large-scale computational resources.
Ans. LLaMA 3.2 Vision integrates an image encoder via adapter mechanisms, preserving the text model’s original performance while adding visual input capabilities.
Ans. Both the vision and text models support multilingual inputs with long contexts (up to 128k tokens), enabling versatile use across multiple languages.