The Phi model from Microsoft has been at the forefront of many open-source Large Language Models. Phi architecture has led to all the popular small open-source models that we see today which include TPhixtral, Phi-DPO, and others. Their Phi Family has taken the LLM architecture a step forward with the introduction of Small Language Models, saying that these are enough to achieve different tasks. Now Microsoft has finally unveiled the Phi 3, the next generation of Phi models, which further improves than the previous generation of models. We will go through the Phi 3 in this article and test it with different prompts.
This article was published as a part of the Data Science Blogathon.
Recently Microsoft has released Phi 3, showcasing its commitment to the open-source in the field of Artificial Intelligence. Phi has released two variants of Phi 3. One is the Phi 3 with a 4k context size and the other is the Phi 3 with a 128k context size. Both of these have the same architecture and a size of 3.8 Billion Parameters called the Phi 3 mini. Microsoft has even brought up two larger variants of Phi, a 7 Billion version called the Phi 3 Small and a 14 Billion version called the Phi 3 Medium, though they are still in the training phases. All the Phi 3 models come with the instruct version and thus are ready to be deployed in chat applications.
Coming to the benchmarks, the Phi 3 mini, i.e. the 3.8 Billion Parameter model has overtaken the Gemma 7B from Google. It has gotten a score of 68.8 in MMLU and 76.7 in HellaSwag which exceeds Gemma which has a score of 63.6 in MMLU and 49.8 in HellSwag and even the Mistral 7B model which has a score of 61.7 in MMLU and 58.5 in HellSwag. Phi-3 has even surpassed the recently released Llama 3 8B model in both of these benchmarks.
It even surpasses these and the other models in other popular evaluation tests like the WinoGrande, TruthfulQA, HumanEval, and others. In the below table, we can compare the scores of the Phi 3 family of models with other popular open-source large language models.
To get started with Phi-3. We need to follow certain steps. Let us dive deeper into each step.
Let’s start by downloading the following libraries.
!pip install -q transformers huggingface-cli bitsandbytes accelerate
Now, before we start downloading the model, we need to define our quantization config. This is because we cannot load the entire full precision model within the free Google Colab GPU and even if we fit it, the inference will be slow. So, we will quantize our model to 4-bit precision and then work with the model.
The configuration for this quantization can be seen below:
import torch
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
Running this code will create our quantization configuration.
Now, we are ready to download the model and quantize it with the following quantization configuration. The code for this will be:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
quantization_config = config
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
Running this code will download the Phi-3 mini 4k context instruct LLM and then will quantize it to the 4bit level based on the configuration that we have provided to it. And then the tokenizer is downloaded as well.
Now we will test the Phi-3-mini. For this, the code will be:
messages = [
{"role": "user", "content": "A clock shows 12:00 p.m. now. How many \
degrees will the minute hand move in 15 minutes?"},
{"role": "assistant", "content": "The minute hand moves 360 degrees \
in one hour (60 minutes). Therefore, in 15 minutes, it will move \
(15/60) * 360 degrees = 90 degrees."},
{"role": "user", "content": "How many degrees does the hour hand \
move in 15 minutes?"}
]
model_inputs = tokenizer.apply_chat_template(messages,
return_tensors="pt").to("cuda")
output = model.generate(model_inputs,
max_new_tokens=1000,
do_sample=True)
decoded_output = tokenizer.batch_decode(output,
skip_special_tokens=True)
print(decoded_output[0])
Hence, when we run this code will take in the list of messages, do the proper formatting by applying the chat template, convert them into tokens, and then pass them to generate a function to generate the response and finally decode them to convert the response generated in the form of tokens to English text.
Running this code produced the following output.
Seeing the output generated, the model has correctly answered the question. We see a very detailed approach similar to a chain of thoughts. Here the model starts by talking about how the minute hand moves and how the hour hand moves per hour. Then from there, it calculated the necessary intermediate result, and from there it went on to solve the actual user question.
Now let’s try with another question.
messages = [
{"role": "user", "content": "If a plane crashes on the border of the \
United States and Canada, where do they bury the survivors?"},
]
model_inputs = tokenizer.apply_chat_template(messages,
return_tensors="pt").to("cuda")
output = model.generate(model_inputs,
max_new_tokens=1000,
do_sample=True)
decoded_output = tokenizer.batch_decode(output,
skip_special_tokens=True)
print(decoded_output[0])
Here in the above example, we asked a tricky question to the Phi 3 LLM. And it was able to provide a pretty convincing answer. Here the LLM was able to get to the confusing part, that is we cannot bury the survivors, because survivors are living, hence there are no survivors at all to bury. Let’s try giving another tricky question and checking the generated output.
messages = [
{"role": "user", "content": "How many smartphones can a human eat?"},
]
model_inputs = tokenizer.apply_chat_template(messages,
return_tensors="pt").to("cuda")
output = model.generate(model_inputs,
max_new_tokens=1000,
do_sample=True)
decoded_output = tokenizer.batch_decode(output,
skip_special_tokens=True)
print(decoded_output[0])
Here we asked the Phi-3-mini another tricky question, about how many smartphones can a human eat. This tests the Large Language Model’s common sense ability. The Phi-3 LLM was able to catch this by saying that it was a misunderstanding. It even tells that the. This tells that the Phi-3-mini was well trained on a quality dataset containing a good mixture of common sense, reasoning, and maths.
Phi-3 represents Microsoft’s next generation of Phi models, bringing significant advancements over Phi-2. It boasts a drastically increased context length, reaching up to 128k tokens with minimal performance impact. Additionally, Phi-3 is trained on a much larger and more comprehensive dataset compared to its predecessor. Benchmarks indicate that Phi-3 outperforms other popular models in various tasks, demonstrating its effectiveness. With its capability to handle complex questions and incorporate common sense reasoning, Phi-3 holds great promise for various applications.
A. Phi 3 models are trained on data with a specific chat template format. So, it’s recommended to use the same format when providing prompts or questions to the model. This template can be applied by calling the apply_chat_template.
A. hi 3 is the next generation of Phi models from Microsoft, part of a family including Phi 3 mini, Small, and Medium. Where the mini version is a 3.8 Billion Parameter model, while the Small is a 7 Billion Parameter model and the Medium is a 14 Billion Parameter model.
A. Yes, Phi 3 models are available for free through the Hugging Face platform. Right now only the Phi 3 mini i.e. the 3.8 Billion Parameter model is available on HuggingFace. This model can be worked with for commercial applications too, based on the given license.
A. Phi 3 shows promising results with common-sense reasoning. The provided examples demonstrate that Phi 3 can answer tricky questions that involve humor or logic.
A. Yes. While the Phi 3 Mini still works with the regular Llama 2 tokenizer, having a vocabulary size of 32k, the new Phi 3 Small model gets a tokenizer, where the vocabulary size is extended to 100k tokens