Discover the latest milestone in AI language models with Meta’s Llama 3 family. From advancements like increased vocabulary sizes to practical implementations using open-source tools, this article dives into the technical details and benchmarks of Llama 3. Learn how to deploy and run these models locally, unlocking their potential within consumer hardware.
This article was published as a part of the Data Science Blogathon.
Meta’s Llama 3 is a large language model (LLM) that they released in 2024. Here’s a summary of what makes it special:
Ollama is an open-source framework designed to make working with Large Language Models (LLMs) easier. It allows you to run these powerful AI models directly on your own computer.
Here are some key features of Ollama:
Overall, Ollama is a valuable tool for developers, data scientists, and researchers who want to work with LLMs on their local machines. It simplifies the process and offers a secure environment for experimentation and development.
Introducing the Llama 3 family: a new era in language models. With pre-trained base and chat models available in 8B and 70B sizes, it brings forth significant advancements. These include an expanded vocabulary size, now at 128k tokens, enhancing token encoding efficiency and enabling better multi-lingual text generation. Additionally, it implements Grouped Query Attention (GQA) across all models, ensuring more coherent and extended responses compared to its predecessors.
Furthermore, Meta’s rigorous training regimen, utilizing 15 trillion tokens for the 8B model alone, signifies a commitment to pushing the boundaries of natural language processing. With plans for multi-modal models and even larger 400B+ models on the horizon, the Llama 3 series heralds a new era of AI language modeling, poised to revolutionize various applications across industries.
You can click here to access model.
Llama 3 with all these performance metrics is the most appropriate model for running locally. Thanks to the advancement in model quantization method we can run the LLM’s inside consumer hardware. There are different ways to run these models locally depending on hardware specifications. If your system has enough GPU memory (~48GB), you can comfortably run 8B models with full precision and a 4-bit quantized 70B model. Output might be on the slower side. You may also use cloud instances for inferencing. Here, we will use the free tier Colab with 16GB T4 GPU for running a quantized 8B model. The 4-bit quantized model requires ~5.7 GB of GPU memory, which is fine for running on T4 GPU.
To run these models, we can use different open-source tools. Here are a few tools for running models locally.
HuggingFace has already rolled out support for Llama 3 models. We can easily pull the models from HuggingFace Hub with the Transformers library. You can install the full-precision models or the 4-bit quantized ones. This is an example of running it on the Colab free tier.
Install accelerate and bitsandbytes libraries and upgrade the transformers library.
!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes
Now we will install the model and start querying.
import transformers
import torch
model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={
"torch_dtype": torch.float16,
"quantization_config": {"load_in_4bit": True},
"low_cpu_mem_usage": True,
},
)
Now send queries to the model for inferencing.
messages = [
{"role": "system", "content": "You are a helpful assistant!"},
{"role": "user", "content": """Generate an approximately fifteen-word sentence
that describes all this data:
Midsummer House eatType restaurant;
Midsummer House food Chinese;
Midsummer House priceRange moderate;
Midsummer House customer rating 3 out of 5;
Midsummer House near All Bar One"""},
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
Output of the query: “Here is a 15-word sentence that summarizes the data:
Midsummer House is a moderate-priced Chinese eatery with a 3-star rating near All Bar One.”
You can wrap this inside a Gradio to have an interactive chat interface. Install Gradio and run the code below.
import gradio as gr
messages = []
def add_text(history, text):
global messages #message[list] is defined globally
history = history + [(text,'')]
messages = messages + [{"role":'user', 'content': text}]
return history
def generate(history):
global messages
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
response_msg = outputs[0]["generated_text"][len(prompt):]
for char in response_msg:
history[-1][1] += char
yield history
pass
with gr.Blocks() as demo:
chatbot = gr.Chatbot(value=[], elem_id="chatbot")
with gr.Row():
txt = gr.Textbox(
show_label=False,
placeholder="Enter text and press enter",
)
txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
generate, inputs =[chatbot,],outputs = chatbot,)
demo.queue()
demo.launch(debug=True)
Here is a demo of the Gradio app and Llama 3 in action.
Ollama is another open-source software for running LLMs locally. To use Ollama, you have to download the software.
Once downloaded use this command to start a local server.
ollama run llama3:instruct #for 8B instruct model
ollama run llama3:70b-instruct #for 70B instruct model
ollama run llama3 #for 8B pre-trained model
ollama run llama3:70b #for 70B pre-trained
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'
You will receive a JSON response.
{
"model": "llama3",
"created_at": "2024-04-19T19:22:45.499127Z",
"response": "The sky is blue because it is the color of the sky.",
"done": true,
"context": [1, 2, 3],
"total_duration": 5043500667,
"load_duration": 5025959,
"prompt_eval_count": 26,
"prompt_eval_duration": 325953000,
"eval_count": 290,
"eval_duration": 4709213000
}
We have discovered not just advances in language modeling but also useful implementation strategies of Llama 3. Running Llama 3 locally is now possible because to technologies like HuggingFace Transformers and Ollama, which opens up a wide range of applications across industries. Looking ahead, Llama 3’s open-source design encourages innovation and accessibility, opening the door for a time when advanced language models will be accessible to developers everywhere.
A. Llama 3 is a family of large language models from Meta AI. There are two models 8B and 70B with both a pre-trained base model and an instruction-tuned model for chat application.
A. Yes, it is open-source. The model can be deployed commercially and further fine-tuned on custom datasets.
A. The first batch of these models is not multi-modal but Meta has confirmed the future release of multi-modal models.
A. The Llama 3 70B model is better than GPT 3.5 but it is still not better than GPT 4.
A. GPT-4 is generally considered more advanced, but LLaMA 3 excels in specific tasks like coding and summarization. Choose based on your needs and preferences.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Hey sunil, When I executed step 3, I am not getting the answer but it is not the complete answer. can we do something for that? Example: messages = [ {"role": "system", "content": "You are a digital marketer who is expert in writing awesome enthusiastic blogs!"}, {"role": "user", "content": """Generate an 15 points long blog that describes about solid pathway to quickly become Machine Learning Engineer """}, ] Response: **Unlock the Power of Machine Learning: A 15-Point Path to Becoming a Machine Learning Engineer** Are you ready to unlock the secrets of machine learning and become a sought-after expert in this high-demand field? Look no further! In this blog, we'll outline a solid pathway to quickly become a machine learning engineer, covering the essential skills, tools, and best practices to get you started. **Point 1: Start with the Basics** Begin by learning the fundamentals of machine learning, including supervised and unsupervised learning, regression, classification, and clustering. **Point 2: Get Familiar with Python** Python is the de facto language for machine learning. Learn the basics of Python programming and get comfortable with popular libraries like NumPy, Pandas, and scikit-learn. **Point 3: Learn Linear Algebra and Calculus** Linear algebra and calculus are crucial for understanding machine learning concepts. Brush up on your math skills and learn to apply them to real-world problems. **Point 4: Dive into Machine Learning Fundamentals** Study the basics of machine learning, including regression, classification, clustering, and dimensionality reduction. Practice implementing these concepts using Python libraries. **Point 5: Experiment with Real-World Datasets** Work with real <>
Hello sunil, thanks for this. But its giving incomplete answers for long texts. EXAMPLE Prompt: ``` messages = [ {"role": "system", "content": "You are a digital marketer who is expert in writing awesome enthusiastic blogs!"}, {"role": "user", "content": """Generate an 15 points long blog that describes about solid pathway to quickly become Machine Learning Engineer """}, ] ``` Response: ``` **Unlock the Power of Machine Learning: A 15-Point Path to Becoming a Machine Learning Engineer** Are you ready to unlock the secrets of machine learning and become a sought-after expert in this high-demand field? Look no further! In this blog, we'll outline a solid pathway to quickly become a machine learning engineer, covering the essential skills, tools, and best practices to get you started. **Point 1: Start with the Basics** Begin by learning the fundamentals of machine learning, including supervised and unsupervised learning, regression, classification, and clustering. **Point 2: Get Familiar with Python** Python is the de facto language for machine learning. Learn the basics of Python programming and get comfortable with popular libraries like NumPy, Pandas, and scikit-learn. **Point 3: Learn Linear Algebra and Calculus** Linear algebra and calculus are crucial for understanding machine learning concepts. Brush up on your math skills and learn to apply them to real-world problems. **Point 4: Dive into Machine Learning Fundamentals** Study the basics of machine learning, including regression, classification, clustering, and dimensionality reduction. Practice implementing these concepts using Python libraries. **Point 5: Experiment with Real-World Datasets** Work with real .....<<>> ``` Can you help me with it?