How to Run Llama 3 Locally?

Sunil Kumar Last Updated : 02 Apr, 2025

6 min read

Meta’s Llama 3 models bring exciting improvements like a larger vocabulary and better performance. This article explains how they work, compares them to other models, and shows you how to use them on your own devices with tools like HuggingFace and Ollama. You’ll also learn about their open-source design, new features like Grouped Query Attention, and how they can be used for text generation and other tasks. In this article you will get to know all about the Llama 3 , and how to run it Locally.

This article was published as a part of the Data Science Blogathon.

Meta’s Llama 3
What is Ollama?
Introduction of Llama 3
Running Llama 3 Locally
- Using HuggingFace
- Using Llama 3 With Ollama
Conclusion
Frequently Asked Questions

Meta’s Llama 3

Meta’s Llama 3 is a large language model (LLM) that they released in 2024. Here’s a summary of what makes it special:

Most Capable Open-Source LLM: Meta claims Llama 3 outperforms other similar sized open-source models on benchmarks. [1]
Powers Meta AI Assistant: This AI assistant is integrated into Facebook, Messenger, WhatsApp and Instagram and can help with tasks, learning and content creation .
Easy to Access: You can try Llama 3 through Meta AI or through platforms like Hugging Face

What is Ollama?

Ollama is an open-source framework designed to make working with Large Language Models (LLMs) easier. It allows you to run these powerful AI models directly on your own computer.

Here are some key features of Ollama:

Run LLMs locally: Ollama lets you bypass cloud-based services and run LLMs on your local machine. This can be beneficial for privacy reasons and when dealing with sensitive data.
Simple API: Ollama provides an easy-to-use interface for creating, running, and managing LLMs.
Pre-built models: Ollama comes with a library of pre-built models that you can use right away for various tasks.
Customization: Although it offers pre-built models, Ollama also allows you to import your own custom models for even greater flexibility.

Overall, Ollama is a valuable tool for developers, data scientists, and researchers who want to work with LLMs on their local machines. It simplifies the process and offers a secure environment for experimentation and development.

Read about this also “3 Ways to Use Llama 3

Introduction of Llama 3

Introducing the Llama 3 family: a new era in language models. With pre-trained base and chat models available in 8B and 70B sizes, it brings forth significant advancements. These include an expanded vocabulary size, now at 128k tokens, enhancing token encoding efficiency and enabling better multi-lingual text generation. Additionally, it implements Grouped Query Attention (GQA) across all models, ensuring more coherent and extended responses compared to its predecessors.

Furthermore, Meta’s rigorous training regimen, utilizing 15 trillion tokens for the 8B model alone, signifies a commitment to pushing the boundaries of natural language processing. With plans for multi-modal models and even larger 400B+ models on the horizon, the Llama 3 series heralds a new era of AI language modeling, poised to revolutionize various applications across industries.

You can click here to access model.

Performance Highlights

Llama 3 models excel in various tasks like creative writing, coding, and brainstorming, setting new performance benchmarks.
The 8B Llama 3 model outperforms previous models by significant margins, nearing the performance of the Llama 2 70B model.
Notably, the Llama 3 70B model surpasses closed models like Gemini Pro 1.5 and Claude Sonnet across benchmarks.
Open-source nature allows for easy access, fine-tuning, and commercial use, with models offering liberal licensing.

Running Llama 3 Locally

Llama 3 with all these performance metrics is the most appropriate model for running locally. Thanks to the advancement in model quantization method we can run the LLM’s inside consumer hardware. There are different ways to run these models locally depending on hardware specifications. If your system has enough GPU memory (~48GB), you can comfortably run 8B models with full precision and a 4-bit quantized 70B model. Output might be on the slower side. You may also use cloud instances for inferencing. Here, we will use the free tier Colab with 16GB T4 GPU for running a quantized 8B model. The 4-bit quantized model requires ~5.7 GB of GPU memory, which is fine for running on T4 GPU.

To run these models, we can use different open-source tools. Here are a few tools for running models locally.

Using HuggingFace

HuggingFace has already rolled out support for Llama 3 models. We can easily pull the models from HuggingFace Hub with the Transformers library. You can install the full-precision models or the 4-bit quantized ones. This is an example of running it on the Colab free tier.

Step1: Install Libraries

Install accelerate and bitsandbytes libraries and upgrade the transformers library.

!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes

Step2: Install Model

Now we will install the model and start querying.

import transformers
import torch

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

Step3: Send Queries

Now send queries to the model for inferencing.

messages = [
    {"role": "system", "content": "You are a helpful assistant!"},
    {"role": "user", "content": """Generate an approximately fifteen-word sentence 
                                   that describes all this data:
                                   Midsummer House eatType restaurant; 
                                   Midsummer House food Chinese; 
                                   Midsummer House priceRange moderate; 
                                   Midsummer House customer rating 3 out of 5; 
                                   Midsummer House near All Bar One"""},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

Output of the query: “Here is a 15-word sentence that summarizes the data:

Midsummer House is a moderate-priced Chinese eatery with a 3-star rating near All Bar One.”

Step4: Install Gradio and Run Code

You can wrap this inside a Gradio to have an interactive chat interface. Install Gradio and run the code below.

import gradio as gr

messages = []

def add_text(history, text):
    global messages  #message[list] is defined globally
    history = history + [(text,'')]
    messages = messages + [{"role":'user', 'content': text}]
    return history

def generate(history):
  global messages
  prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
  response_msg = outputs[0]["generated_text"][len(prompt):]
  for char in response_msg:
      history[-1][1] += char
      yield history
  pass

with gr.Blocks() as demo:
    
    chatbot = gr.Chatbot(value=[], elem_id="chatbot")
    with gr.Row():
            txt = gr.Textbox(
                show_label=False,
                placeholder="Enter text and press enter",
            )

    txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
            generate, inputs =[chatbot,],outputs = chatbot,)
            
demo.queue()
demo.launch(debug=True)

Here is a demo of the Gradio app and Llama 3 in action.

Using Llama 3 With Ollama

Ollama is another open-source software for running LLMs locally. To use Ollama, you have to download the software.

Step1: Starting Local Server

Once downloaded use this command to start a local server.

ollama run llama3:instruct  #for 8B instruct model

ollama run llama3:70b-instruct #for 70B instruct model

ollama run llama3  #for 8B pre-trained model

ollama run llama3:70b #for 70B pre-trained

Step2: Query Through API

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Step3: JSON Response

You will receive a JSON response.

{
  "model": "llama3",
  "created_at": "2024-04-19T19:22:45.499127Z",
  "response": "The sky is blue because it is the color of the sky.",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 5043500667,
  "load_duration": 5025959,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}

Conclusion

We have discovered not just advances in language modeling but also useful implementation strategies of Llama 3. Running Llama 3 locally is now possible because to technologies like HuggingFace Transformers and Ollama, which opens up a wide range of applications across industries. Looking ahead, Llama 3’s open-source design encourages innovation and accessibility, opening the door for a time when advanced language models will be accessible to developers everywhere.

Key Takeaways

Meta has unveiled the Llama 3 family of models containing four models, 8B, and 70B pre-trained and instruction-tuned models.
The models have performed exceedingly well across multiple benchmarks in their respective weight categories.
Llama 3 now uses a different tokenizer than Llama 2 with an increased vocan size. Now all the models are equipped with Grouped Query Attention (GQA) for better text generation.
While the models are big it is possible to run them on consumer hardware using quantization using open-source tools like Ollama and HiggingFace Transformers.

Frequently Asked Questions

Q1. What is Llama 3?

A. Llama 3 is a family of large language models from Meta AI. There are two models 8B and 70B with both a pre-trained base model and an instruction-tuned model for chat application.

Q2. Is Llama 3 open-source?

A. Yes, it is open-source. The model can be deployed commercially and further fine-tuned on custom datasets.

Q3. Is Llama 3 multi-modal?

A. The first batch of these models is not multi-modal but Meta has confirmed the future release of multi-modal models.

Q4. Is Llama 3 better than ChatGPT?

A. The Llama 3 70B model is better than GPT 3.5 but it is still not better than GPT 4.

Q5. Is Llama 3 better than gpt 4?

A. GPT-4 is generally considered more advanced, but LLaMA 3 excels in specific tasks like coding and summarization. Choose based on your needs and preferences.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Sujal Luhar

Hey sunil, When I executed step 3, I am not getting the answer but it is not the complete answer. can we do something for that? Example: messages = [ {"role": "system", "content": "You are a digital marketer who is expert in writing awesome enthusiastic blogs!"}, {"role": "user", "content": """Generate an 15 points long blog that describes about solid pathway to quickly become Machine Learning Engineer """}, ] Response: **Unlock the Power of Machine Learning: A 15-Point Path to Becoming a Machine Learning Engineer** Are you ready to unlock the secrets of machine learning and become a sought-after expert in this high-demand field? Look no further! In this blog, we'll outline a solid pathway to quickly become a machine learning engineer, covering the essential skills, tools, and best practices to get you started. **Point 1: Start with the Basics** Begin by learning the fundamentals of machine learning, including supervised and unsupervised learning, regression, classification, and clustering. **Point 2: Get Familiar with Python** Python is the de facto language for machine learning. Learn the basics of Python programming and get comfortable with popular libraries like NumPy, Pandas, and scikit-learn. **Point 3: Learn Linear Algebra and Calculus** Linear algebra and calculus are crucial for understanding machine learning concepts. Brush up on your math skills and learn to apply them to real-world problems. **Point 4: Dive into Machine Learning Fundamentals** Study the basics of machine learning, including regression, classification, clustering, and dimensionality reduction. Practice implementing these concepts using Python libraries. **Point 5: Experiment with Real-World Datasets** Work with real <>

Sujal Luhar

Hello sunil, thanks for this. But its giving incomplete answers for long texts. EXAMPLE Prompt: ``` messages = [ {"role": "system", "content": "You are a digital marketer who is expert in writing awesome enthusiastic blogs!"}, {"role": "user", "content": """Generate an 15 points long blog that describes about solid pathway to quickly become Machine Learning Engineer """}, ] ``` Response: ``` **Unlock the Power of Machine Learning: A 15-Point Path to Becoming a Machine Learning Engineer** Are you ready to unlock the secrets of machine learning and become a sought-after expert in this high-demand field? Look no further! In this blog, we'll outline a solid pathway to quickly become a machine learning engineer, covering the essential skills, tools, and best practices to get you started. **Point 1: Start with the Basics** Begin by learning the fundamentals of machine learning, including supervised and unsupervised learning, regression, classification, and clustering. **Point 2: Get Familiar with Python** Python is the de facto language for machine learning. Learn the basics of Python programming and get comfortable with popular libraries like NumPy, Pandas, and scikit-learn. **Point 3: Learn Linear Algebra and Calculus** Linear algebra and calculus are crucial for understanding machine learning concepts. Brush up on your math skills and learn to apply them to real-world problems. **Point 4: Dive into Machine Learning Fundamentals** Study the basics of machine learning, including regression, classification, clustering, and dimensionality reduction. Practice implementing these concepts using Python libraries. **Point 5: Experiment with Real-World Datasets** Work with real .....<<>> ``` Can you help me with it?

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Run Llama 3 Locally?

Table of contents

Meta’s Llama 3

What is Ollama?

Introduction of Llama 3

Performance Highlights

Running Llama 3 Locally

Using HuggingFace

Step1: Install Libraries

Step2: Install Model

Step3: Send Queries

Step4: Install Gradio and Run Code

Using Llama 3 With Ollama

Step1: Starting Local Server

Step2: Query Through API

Step3: JSON Response

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang