The field of artificial intelligence has seen remarkable advancements in recent years, particularly in the area of large language models. LLMs can generate human-like text, summarize documents, and write software code. Mistral-7B is one of the recent large language models that support English text and code generation abilities, and it can be used for various tasks such as text summarization, classification, text completion, and code completion.
What sets Mistral-7B-Instruct apart is its ability to deliver stellar performance despite having fewer parameters, making it a high-performing and cost-effective solution. The model recently gained popularity after benchmark results showed that it not only outperforms all 7B models on MT-Bench but also competes favorably with 13B chat models. In this blog, we will explore the features and capabilities of Mistral 7B, including its use cases, performance, and a hands-on guide to fine-tuning the model.
This article was published as a part of the Data Science Blogathon.
Large language models‘ architecture is formed with transformers, which use attention mechanisms to capture long-range dependencies in data, where multiple layers of transformer blocks contain multi-head self-attention and feed-forward neural networks. These models are pre-trained on text data, learning to predict the next word in a sequence, thus capturing the patterns in languages. The pre-training weights can be fine-tuned on specific tasks. We will specifically look at the architecture of Mistral 7B LLM, and what makes it stand out.
The Mistral 7B model transformer architecture efficiently balances high performance with memory usage, using attention mechanisms and caching strategies to outperform larger models in speed and quality. It uses 4096-window Sliding Window Attention (SWA), which maximizes attention over longer sequences by allowing each token to attend to a subset of precursor tokens, optimizing attention over longer sequences.
A given hidden layer can access tokens from input layers at distances determined by the window size and layer depth. The model integrates modifications to Flash Attention and xFormers, doubling the speed over traditional attention mechanisms. Additionally, a Rolling Buffer Cache mechanism maintains a fixed cache size for efficient memory usage.
Let’s deep dive into the code and look at running inferences with the Mistral 7B model in Google Colab. We will use the free version with a single T4 GPU and load the model from Hugging Face.
1. Install and import the ctransformers library in Colab.
#intsall ctransformers
pip install ctransformers[cuda]
#import
from ctransformers import AutoModelForCausalLM
2. Initialize the model object from Hugging Face and set the necessary parameters. We will use a different version of the model since the original model from Mistral AI can have issues with loading the entire model into memory on Google Colab.
#load the model from huggingface with 50 gpu layers
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
model_type = "mistral", gpu_layers = 50)
3. Define a function to print the results vertically in Google Colab. This step can be skipped or modified if it is being run in a different environment.
#function to print the model output in colab in a readable manner
def colab_print(text, max_width = 120):
words = text.split()
line = ""
for word in words:
if len(line) + len(word) + 1 > max_width:
print(line)
line = ""
line += word + " "
print (line)
4. Generate text using the model and see the results. Modify the parameters to change the quality of the generated text.
#generate text
colab_print(llm('''Give me a well-written paragraph in 5 sentences about a Senior Data
Scientist (name - Suvojit) who writes blogs on LLMs on Analytics Vidhya. He studied
Masters in AIML in BITS Pilani and works at AZ Company, with a total of 4 years of
experience. Start the sentence with - Suvojit is a''',
max_new_tokens = 2048, temperature = 0.9, top_k = 55, top_p = 0.93,
repetition_penalty = 1.2))
Model Response: Suvojit is a Senior Data Scientist who has been working for 4 years at AZ company as part of their team focusing on the design, implementation, and improvement of consumer behavior predictive models in their client’s brands and business lines using Limited Memory Machine Learning techniques. He writes about LLMs on Analytics Vidhya which helps him to stay up to date with the latest trends in Data Science. He has a master’s degree in AIML from BITS Pilani, where he studied machine learning algorithms and their applications. Suvojit is passionate about using his knowledge of data science to help businesses make better decisions and improve their performance.
Let’s look at some of the use cases of the Mistral 7B in more detail.
Mistral 7B supports multiple languages for translation. Let’s look at English to French and Spanish translations and check the accuracy.
#translate English to French
colab_print(llm('''Translate this text from English to French:
"Suvojit is a Senior Data Scientist who writes blogs on Analytics Vidhya."''',
max_new_tokens = 2048, temperature = 0.9, top_k = 55, top_p = 0.93,
repetition_penalty = 1.2))
Model Response: The translation would be: “Suvojit est un scientifique des données sénior qui écrit des blogues sur Analytics Vidhya.”
#tranlsate English to Spanish
colab_print(llm('''Translate this text from English to Spanish:
"Suvojit is a Senior Data Scientist who writes blogs on Analytics Vidhya."''',
max_new_tokens = 2048, temperature = 0.9, top_k = 55, top_p = 0.93,
repetition_penalty = 1.2))
Model Response: The translation would be: “Suvojit es un Senior Data Científico que escribe blogs en Analytics Vidhya.”
Use Mistral 7B to summarize documents into a shorter version using appropriate summarization instructions.
#Define the prompt with instructions
prompt = '''[INST] You are a helpful code assistant. Your task is to summarize text
into exactly two short sentences:
Transformers have revolutionized the field of large language models with their ability
to capture long-range dependencies and intricate patterns in text. Originating
from the paper "Attention Is All You Need" by Vaswani et al. in 2017, transformers
introduced the self-attention mechanism, which weighs input tokens differently based
on their relevance to a given token. This mechanism enables transformers to process
information more flexibly than traditional RNNs or LSTMs. Consequently, models
like GPT, BERT, and their subsequent iterations have been built on the transformer
architecture, leading to breakthroughs in numerous natural language processing tasks.
Their parallel processing capabilities further make them conducive to training on
vast amounts of data. However, as these models grow in size, concerns about their
environmental impact and computational demands also emerge. Despite these challenges,
the transformative power of transformers in language models remains undeniable.
Just summarize the text in exactly 2 short sentences without explanations:
[/INST]'''
#Summarize
colab_print(llm(prompt,
max_new_tokens = 2048, temperature = 0.9, top_k = 55, top_p = 0.93,
repetition_penalty = 1.2))
Model Response: Transformers have revolutionized large language models by capturing long-range dependencies and intricate patterns, introduced a self-attention mechanism that makes them process information flexibly and subsequent implementations include breakthroughs like GPT & BERT. However, as these models grow in size concerns about their environmental impact and computational demands arise, despite these challenges they remain undeniably transformative in language modeling.
We can use the [INST] tag to modify the user input to get a particular response from the model. For example, we can generate a JSON based on text description.
prompt = '''[INST] You are a helpful code assistant. Your task is to generate a valid
JSON object based on the given information:
My name is Suvojit Hore, working in company AB and my address is AZ Street NY.
Just generate the JSON object without explanations:
[/INST]
'''
colab_print(llm(prompt,
max_new_tokens = 2048, temperature = 0.9, top_k = 55, top_p = 0.93,
repetition_penalty = 1.2))
Model Response: “`json { “name”: “Suvojit Hore”, “company”: “AB”, “address”: “AZ Street NY” } “`
Let’s look at how we can fine-tune the model using a single GPU on Google Colab. We will use a dataset that converts few-word descriptions about images to detailed and highly descriptive text. These results can be used in Midjourney to generate the specific image. The goal is to train the LLM to act as a prompt engineer for image generation.
Setup the environment and import the necessary libraries in Google Colab:
# Install the necessary libraries
!pip install pandas autotrain-advanced -q
!autotrain setup --update-torch
!pip install -q peft accelerate bitsandbytes safetensors
#import the necesary libraries
import pandas as pd
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
from huggingface_hub import notebook_login
Login to Hugging Face from a browser and copy the access token. Use this token to log in to Hugging Face in the notebook.
notebook_login()
Upload the dataset to Colab session storage. We will use the Midjourney dataset.
df = pd.read_csv("prompt_engineering.csv")
df.head(5)
Train the model using Autotrain with appropriate parameters. Modify the command below to run it for your own Huggin Face repo and user access token.
!autotrain llm --train --project_name mistral-7b-sh-finetuned --model
username/Mistral-7B-Instruct-v0.1-sharded --token hf_yiguyfTFtufTFYUTUfuytfuys
--data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 12
--num_train_epochs 3 --trainer sft --target_modules q_proj,v_proj --push_to_hub
--repo_id username/mistral-7b-sh-finetuned
Now let’s use the finetuned model to run the inference engine and generate some detailed descriptions of the images.
#adapter and model
adapters_name = "suvz47/mistral-7b-sh-finetuned"
model_name = "bn22/Mistral-7B-Instruct-v0.1-sharded"
device = "cuda"
#set the config
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
#initialize the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
device_map='auto'
)
Load the finetuned model and tokenizer.
#load the model and tokenizer
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1
stop_token_ids = [0]
Generate a detailed and descriptive Midjourney prompt with just a few words.
#prompt
text = "[INST] generate a midjourney prompt in less than 20 words for A computer
with an emotional chip [/INST]"
#encoder and decoder
encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
model_input = encoded
model.to(device)
generated_ids = model.generate(**model_input, max_new_tokens=200, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print('\n\n')
print(decoded[0])
Model Response: As the computer with an emotional chip begins to process its emotions, it starts to question its existence and purpose, leading to a journey of self-discovery and self-improvement.
#prompt
text = "[INST] generate a midjourney prompt in less than 20 words for A rainbow
chasing its colors [/INST]"
#encoder and decoder
encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
model_input = encoded
model.to(device)
generated_ids = model.generate(**model_input, max_new_tokens=200, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print('\n\n')
print(decoded[0])
Model Response: A rainbow chasing colors finds itself in a desert where the sky is a sea of endless blue, and the colors of the rainbow are scattered in the sand.
Mistral 7B has proved to be a significant advancement in the field of Large Language Models. Its efficient architecture, combined with its superior performance, showcases its potential to be a staple for various NLP tasks in the future. This blog provides insights into the model’s architecture, its application, and how one can harness its power for specific tasks like translation, summarization, and fine-tuning for other applications. With the right guidance and experimentation, Mistral 7B could redefine the boundaries of what’s possible with LLMs.
A. Mistral-7B is designed for efficiency and performance. While it has fewer parameters than some other models, its architectural advancements, such as the Sliding Window Attention, allow it to deliver outstanding results, even outperforming larger models in specific tasks.
A. Yes, Mistral-7B can be fine-tuned for various tasks. The guide provides an example of fine-tuning the model to convert short text descriptions into detailed prompts for image generation.
A. The Sliding Window Attention (SWA) allows the model to handle longer sequences efficiently. With a window size of 4096, SWA optimizes attention operations, enabling Mistral-7B to process lengthy texts without compromising on speed or accuracy.
A. Yes, when running Mistral-7B inferences, we recommend using the ctransformers library, especially when working within Google Colab. You can also load the model from Hugging Face for added convenience
A. It’s crucial to craft detailed instructions in the input prompt. Mistral-7B’s versatility enables it to understand and follow these detailed instructions, ensuring accurate and desired outputs. Proper prompt engineering can significantly enhance the model’s performance.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.