The world of AI just got a whole lot more exciting with the release of Llama3! This powerful open-source language model, created by Meta, is shaking things up. Llama3, available in 8B and 70B pretrained and instruction-tuned variants, offers a wide range of applications. In this guide, we will explore the capabilities of Llama3 and how to access Llama3 with Flask, focusing on its potential to revolutionize Generative AI.
This article was published as a part of the Data Science Blogathon.
Llama3 is an auto-regressive language model that leverages an optimized transformer architecture. Yes, the regular transformer but with an upgraded approach. The tuned versions employ supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The model was pretrained on an extensive corpus of over 15 trillion tokens of data from publicly available sources, with a cutoff of March 2023 for the 8B model and December 2023 for the 70B model, respectively. The fine-tuning data incorporates publicly available instruction datasets, as well as over 10 million human-annotated examples.
As we previously noted, Llama3 has an optimized transformer design and comes in two sizes, 8B and 70B parameters, in both pre-trained and instruction-tuned versions. The tokenizer of the model has a 128K token vocabulary. Sequences of 8,192 tokens were used to train the models. Llama3 has proven to be remarkably capable of the following:
The most significant advantage of Llama3 is its open-source and free nature, making it accessible to developers without breaking the bank.
As mentioned earlier, the Llama3 offers two major variants, each catering to different use cases with the two sizes of 8B and 70B:
Llama3 was pre-trained on an extensive corpus of over 15 trillion tokens of publicly available data, with a cutoff of March 2023 for the 8B model and December 2023 for the 70B model. The fine-tuning data incorporates publicly available instruction datasets and over 10 million human-annotated examples(You heard that right!). The model has achieved impressive results on standard automatic benchmarks, including MMLU, AGIEval English, CommonSenseQA, and more.
Llama can be used like other Llama family models which has also made using it very easy. We basically need to install transformer and accelerate. We will see a wrapper script in this section. You can find the entire code snippets and the notebook to run with GPU here. I have added the notebook, a flask app, and an interactive mode script to test the behavior of the model. Here’s an example of using Llama3 with pipeline:
Let us now explore the steps to access Llama3 with Flask.
Create a virtual environment (optional but recommended):
$ python -m venv env
$ source env/bin/activate # On Windows use `.\env\Scripts\activate`
Install necessary packages:
We install transformer and accelerate but since Llama3 is new, we go on by installing directly from Git Hub.
(env) $ pip install -q git+https://github.com/huggingface/transformers.git
(env) $ pip install -q flask transformers torch accelerate # datasets peft bitsandbytes
Create a new Python file called main.py. Inside it, paste the following code.
from flask import Flask, request, jsonify
import transformers
import torch
app = Flask(__name__)
# Initialize the model and pipeline outside of the function to avoid unnecessary reloading
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
@app.route('/generate', methods=['POST'])
def generate():
data = request.get_json()
user_message = data.get('message')
if not user_message:
return jsonify({'error': 'No message provided.'}), 400
# Create system message
messages = [{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}]
# Add user message
messages.append({"role": "user", "content": user_message})
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=256,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
generated_text = outputs[0]['generated_text'][len(prompt):].strip()
response = {
'message': generated_text
}
return jsonify(response), 200
if __name__ == '__main__':
app.run(debug=True)
The above code initializes a Flask web server with a single route, /generate, responsible for receiving and processing user messages and returning AI-generated responses.
Run the Flask app by executing the following command:
(env) $ export FLASK_APP=main.py
(env) $ flask run --port=5000
Now, you should have the Flask app running at http://localhost:5000. You may test the API via tools like Postman or CURL, or even write a simple HTML frontend page.
Interactive Mode Using Transformers AutoModelForCausalLM
To interactively query the model within Jupyter Notebook, paste this in a cell and run:
import requests
import sys
sys.path.insert(0,'..')
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_NAME ='meta-llama/Meta-Llama-3-8B-Instruct'
class InteractivePirateChatbot:
def __init__(self):
self._tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side='left')
self._tokenizer.pad_token = self._tokenizer.eos_token
self._model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto", offload_buffers=True)
def _prepare_inputs(self, messages):
try:
inputs = self._tokenizer([message['content'] for message in messages], padding='longest', truncation=True, max_length=512, return_tensors='pt')
input_ids = inputs.input_ids.to(self._model.device)
attention_mask = inputs.attention_mask.to(self._model.device)
return {'input_ids': input_ids, 'attention_mask': attention_mask}
except Exception as e:
print(f"Error preparing inputs: {e}")
return None
def ask(self, question):
try:
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": question}
]
prepared_data = self._prepare_inputs(messages)
if prepared_data is None:
print("Error preparing inputs. Skipping...")
return
output = self._model.generate(**prepared_data, max_length=512, num_beams=5, early_stopping=True)
answer = self._tokenizer.decode(output[0], skip_special_tokens=True)
print("Pirate:", answer)
except Exception as e:
print(f"Error generating response: {e}")
generator = InteractivePirateChatbot()
while True:
question = input("User: ")
generator.ask(question)
The above code will allow you to quickly interact and see how the model works. Find the entire code here.
User: "Who are you?"
Pirate: "Arrrr, me hearty! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas! I be here to swab yer decks with me clever responses and me trusty parrot, Polly, perched on me shoulder. So hoist the colors, me matey, and let's set sail fer a swashbucklin' good time!"
Since we have seen how the model works, let’s see some safety and responsibility guides.
Meta has taken a series of steps to ensure responsible AI development, including implementing safety best practices, providing resources like Meta Llama Guard 2 and Code Shield safeguards, and updating the Responsible Use Guide. Developers are encouraged to tune and deploy these safeguards according to their needs, weighing the benefits of alignment and helpfulness for their specific use case and audience. All these links are available in the Hugginface repository for Llama3.
While Llama3 is a powerful tool, it’s essential to acknowledge its limitations and potential risks. The model may produce inaccurate, biased, or objectionable responses to user prompts. Therefore, developers should perform safety testing and tuning tailored to their specific applications of the model. Meta recommends incorporating Purple Llama solutions into workflows, specifically Llama Guard, which provides a base model to filter input and output prompts to layer system-level safety on top of model-level safety.
Meta has reshaped the landscape of artificial intelligence with the introduction of Llama3, a potent open-source language model crafted by Meta. With its availability in both 8B and 70B pretrained and instruction-tuned versions, Llama3 presents a multitude of possibilities for innovation. This guide has provided an in-depth exploration of Llama3’s capabilities and how to access Llama3 with Flask, emphasizing its potential to redefine Generative AI.
A. Meta developed Llama3, a powerful open-source language model available in both 8B and 70B pre-trained and instruction-tuned versions.
A. Llama3 has demonstrated impressive capabilities, including enhanced accuracy, adaptability, and robust scalability. Research and tests have shown that it delivers more relevant and context-aware responses, ensuring that each solution is finely tuned to the user’s needs.
A. Yes, Llama3 is open-source and completely free, making it accessible to developers without breaking the bank. Although Llama3 is open-source and free to use for commercial purposes. However, we recommend reviewing the licensing terms and conditions to ensure compliance with any applicable regulations.
A.Yes, Llama3 can be fine-tuned for specific use cases by adjusting the hyperparameters and training data. This can help improve the model’s performance on specific tasks and datasets.
A. Llama3, a more advanced language model trained on a larger dataset, outperforms BERT and RoBERTa in various natural language processing tasks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.