Phi 3 – Small Yet Powerful Models from Microsoft

Ajay Last Updated : 13 Nov, 2024

9 min read

Introduction

The Phi model from Microsoft has been at the forefront of many open-source Large Language Models. Phi architecture has led to all the popular small open-source models that we see today which include TPhixtral, Phi-DPO, and others. Their Phi Family has taken the LLM architecture a step forward with the introduction of Small Language Models, saying that these are enough to achieve different tasks. Now Microsoft has finally unveiled the Phi 3, the next generation of Phi models, which further improves than the previous generation of models. We will go through the Phi 3 in this article and test it with different prompts.

Learning Objectives

Understand the advancements in the Phi 3 model compared to previous iterations.
Learn about the different variants of the Phi 3 model.
Explore the improvements in context length and performance achieved by Phi 3.
Recognize the benchmarks where Phi 3 surpasses other popular language models.
Understand how to download, initialize, and use the Phi 3 mini model.

This article was published as a part of the Data Science Blogathon.

Introduction
Phi 3 – The Next Iteration of Phi Family
Benchmarks – Phi 3
Getting Started with Phi 3
Implementation with Another Question
Conclusion
Frequently Asked Questions

Phi 3 – The Next Iteration of Phi Family

Recently Microsoft has released Phi 3, showcasing its commitment to the open-source in the field of Artificial Intelligence. Phi has released two variants of Phi 3. One is the Phi 3 with a 4k context size and the other is the Phi 3 with a 128k context size. Both of these have the same architecture and a size of 3.8 Billion Parameters called the Phi 3 mini. Microsoft has even brought up two larger variants of Phi, a 7 Billion version called the Phi 3 Small and a 14 Billion version called the Phi 3 Medium, though they are still in the training phases. All the Phi 3 models come with the instruct version and thus are ready to be deployed in chat applications.

Unique Features

Extended Context Length: Phi 3 increases the context length of the Large Language Model from 2k to 128k, facilitated by LongRope technology, with the default context length doubled to 4k.
Training Data Size and Quality: Phi 3 is trained on 3.3 Trillion tokens, featuring larger and more advanced datasets compared to Phi 2.
Model Variants:
- Phi 3 Mini: Trained on 3.3 Trillion tokens, with a 32k vocabulary size and leveraging the tiktoken tokenizer.
- Phi 3 Small (7B Version): Default context length of 8k, vocabulary size of 100k, and utilizes Grouped Query Attention with 4 Queries sharing 1 Key to reduce memory footprint.
Model Architecture: Incorporates Grouped Query Attention to optimize memory usage, starting with Pretraining and moving to Supervised fine-tuning, aligned with Direct Preference Optimization for AI-responsible outputs.

Benchmarks – Phi 3

Coming to the benchmarks, the Phi 3 mini, i.e. the 3.8 Billion Parameter model has overtaken the Gemma 7B from Google. It has gotten a score of 68.8 in MMLU and 76.7 in HellaSwag which exceeds Gemma which has a score of 63.6 in MMLU and 49.8 in HellSwag and even the Mistral 7B model which has a score of 61.7 in MMLU and 58.5 in HellSwag. Phi-3 has even surpassed the recently released Llama 3 8B model in both of these benchmarks.

It even surpasses these and the other models in other popular evaluation tests like the WinoGrande, TruthfulQA, HumanEval, and others. In the below table, we can compare the scores of the Phi 3 family of models with other popular open-source large language models.

Getting Started with Phi 3

To get started with Phi-3. We need to follow certain steps. Let us dive deeper into each step.

Step1: Downloading Libraries

Let’s start by downloading the following libraries.

!pip install -q transformers huggingface-cli bitsandbytes accelerate

transformers – We need this library to download the Large Language Models and work with them
huggingface-cli – We need to log in to huggingface so that we can work with the official HuggingFace model
bitsandbytes – We cannot directly run the 8 Billion model in the free GPU instance of Colab, hence we need this library to quantize the LLM to 4-bit to work with them
accelerate – We need this to speed up the GPU inference for the Large Language Models

Now, before we start downloading the model, we need to define our quantization config. This is because we cannot load the entire full precision model within the free Google Colab GPU and even if we fit it, the inference will be slow. So, we will quantize our model to 4-bit precision and then work with the model.

Step2: Defining Quantization Configure

The configuration for this quantization can be seen below:

import torch
from transformers import BitsAndBytesConfig


config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

Here we start by importing the torch and the BitsAndBytesConfig from the transformers library.
Then we create an instance of this BitsAndBytesConfig class and save it to the variable called config
While creating this instance, we give it the following parameters.
load_in_4bit: This tells that we want to quantize our model into 4bit precision format. This will greatly reduce the size of the model.
bnb_4bit_quant_type: This tells the type of 4bit quantization we wish to work with. Here we go with the normal float called the nf4. This is proven to give better results.
bnb_4bit_use_double_quant: Setting this to True will quantize the quantization constants that are internal to BitsAndBytes, this will further reduce the size of the model.
bnb_4bit_compute_dtype: Here we tell what datatype we will be working with when computing the forward pass through the model. For the colab, we can set it to brain float16 called bfloat16, which tends to provide better results than the regular float16.

Running this code will create our quantization configuration.

Step3: Download the Model

Now, we are ready to download the model and quantize it with the following quantization configuration. The code for this will be:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
    quantization_config = config
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

Here we start by importing the AutoModelForCausalLM and AutoTokenizer from the transformers library
Now we create a variable named model_name and pass it the name of the model that we will work with and here we will give the Phi-3-mini Instruct version model
Then we create an instance of the AutoModelForCausualLM.from_pretrained() and pass it the model name, and the device map, which will set the device to GPU if GPU is present, and then the quantization config that we have just created
In a similar way, we create a tokenizer object with the same model name and the device map set to auto

Running this code will download the Phi-3 mini 4k context instruct LLM and then will quantize it to the 4bit level based on the configuration that we have provided to it. And then the tokenizer is downloaded as well.

Step4: Testing Phi-3-mini

Now we will test the Phi-3-mini. For this, the code will be:

messages = [
    {"role": "user", "content": "A clock shows 12:00 p.m. now. How many \
    degrees will the minute hand move in 15 minutes?"},
    {"role": "assistant", "content": "The minute hand moves 360 degrees \
    in one hour (60 minutes). Therefore, in 15 minutes, it will move \
    (15/60) * 360 degrees = 90 degrees."},
    {"role": "user", "content": "How many degrees does the hour hand \
    move in 15 minutes?"}
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = model.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output, 
                                       skip_special_tokens=True)
print(decoded_output[0])

First, we create a list of messages. This is a list of dictionaries, containing two key-value pairs, where the keys are role and content.
The role tells if the message is from the user or the assistant and the content is the actual message
Here we create a conversation about angles between the hands of the clock. In the last message from the user, we ask a question about the angle made by the hour’s hand.
Then we apply a chat template to this chat conversation. The chat template is necessary for the model to understand, because the instruct data the model is trained on, contains the chat template formatting.
We need the corresponding tensors for this conversation and we will move it to Cuda for faster processing.
Now the model_input contains our tokens and the corresponding attention masks.
These model_inputs are passed to the model.generate() function which takes these tokens with some additional parameters like the number of tokens to print, which we sent to 1000, and the do_sample, which will sample from the high probability tokens.
Finally, we decode the output generated by the Large Language Model to convert the tokens back to English text.

Hence, when we run this code will take in the list of messages, do the proper formatting by applying the chat template, convert them into tokens, and then pass them to generate a function to generate the response and finally decode them to convert the response generated in the form of tokens to English text.

Output

Running this code produced the following output.

Seeing the output generated, the model has correctly answered the question. We see a very detailed approach similar to a chain of thoughts. Here the model starts by talking about how the minute hand moves and how the hour hand moves per hour. Then from there, it calculated the necessary intermediate result, and from there it went on to solve the actual user question.

Implementation with Another Question

Now let’s try with another question.

messages = [
    {"role": "user", "content": "If a plane crashes on the border of the \
    United States and Canada, where do they bury the survivors?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = model.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])

Here in the above example, we asked a tricky question to the Phi 3 LLM. And it was able to provide a pretty convincing answer. Here the LLM was able to get to the confusing part, that is we cannot bury the survivors, because survivors are living, hence there are no survivors at all to bury. Let’s try giving another tricky question and checking the generated output.

messages = [
    {"role": "user", "content": "How many smartphones can a human eat?"},
]

model_inputs = tokenizer.apply_chat_template(messages, 
return_tensors="pt").to("cuda")

output = model.generate(model_inputs, 
                               max_new_tokens=1000, 
                               do_sample=True)

decoded_output = tokenizer.batch_decode(output,
                                       skip_special_tokens=True)
print(decoded_output[0])

Here we asked the Phi-3-mini another tricky question, about how many smartphones can a human eat. This tests the Large Language Model’s common sense ability. The Phi-3 LLM was able to catch this by saying that it was a misunderstanding. It even tells that the. This tells that the Phi-3-mini was well trained on a quality dataset containing a good mixture of common sense, reasoning, and maths.

Conclusion

Phi-3 represents Microsoft’s next generation of Phi models, bringing significant advancements over Phi-2. It boasts a drastically increased context length, reaching up to 128k tokens with minimal performance impact. Additionally, Phi-3 is trained on a much larger and more comprehensive dataset compared to its predecessor. Benchmarks indicate that Phi-3 outperforms other popular models in various tasks, demonstrating its effectiveness. With its capability to handle complex questions and incorporate common sense reasoning, Phi-3 holds great promise for various applications.

Key Takeaways

Phi 3 performs well in practical scenarios, handling tricky and ambiguous questions effectively
Model Variants: Different versions of Phi 3 include Mini (3.8B), Small (7B), and Medium (14B), providing options for various use cases.
Phi 3 surpasses other open-source models in key benchmarks like MMLU and HellaSwag.
Compared to the previous model Phi 2, the context size of Phi 3 is doubled that is 4k, and with the LongRope method, the context length is further moved to 128k with very little degradation in performance
Phi 3 is trained on 3.3 Trillion Tokens involving highly curated datasets and it was both supervised fine-tuned and then followed by alignment with Direct Preference Optimization

Frequently Asked Questions

Q1. What kind of prompts can I use with Phi 3?

A. Phi 3 models are trained on data with a specific chat template format. So, it’s recommended to use the same format when providing prompts or questions to the model. This template can be applied by calling the apply_chat_template.

Q2. What is Phi 3 and what models are part of its family?

A. hi 3 is the next generation of Phi models from Microsoft, part of a family including Phi 3 mini, Small, and Medium. Where the mini version is a 3.8 Billion Parameter model, while the Small is a 7 Billion Parameter model and the Medium is a 14 Billion Parameter model.

Q3. Can I use Phi 3 for free?

A. Yes, Phi 3 models are available for free through the Hugging Face platform. Right now only the Phi 3 mini i.e. the 3.8 Billion Parameter model is available on HuggingFace. This model can be worked with for commercial applications too, based on the given license.

Q4. How well does Phi 3 handle tricky questions?

A. Phi 3 shows promising results with common-sense reasoning. The provided examples demonstrate that Phi 3 can answer tricky questions that involve humor or logic.

Q5. Are there any changes for the tokenizers in the new Phi family of models?

A. Yes. While the Phi 3 Mini still works with the regular Llama 2 tokenizer, having a vocabulary size of 32k, the new Phi 3 Small model gets a tokenizer, where the vocabulary size is extended to 100k tokens

Ajay

I work as a Developer in the field of Data Science. I constantly spend time learning new things be it related to AI, DataSceine, and CyberSecurity. Deep learning and machine learning are two topics that I find particularly fascinating, and Python is my preferred language for programming. Cyber Security is another field that I'm touching upon recently. I have experience with large-scale data analysis, and I have a solid grasp of a variety of deep learning and machine learning approaches, including neural networks, regression models, and natural language processing. I'm eager to take on new challenges and make a meaningful contribution to the industry, so I'm constantly seeking for ways to enlarge and deepen my knowledge and skills in the subject.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Phi 3 – Small Yet Powerful Models from Microsoft

Introduction

Learning Objectives

Table of contents

Phi 3 – The Next Iteration of Phi Family

Unique Features

Benchmarks – Phi 3

Getting Started with Phi 3

Step1: Downloading Libraries

Step2: Defining Quantization Configure

Step3: Download the Model

Step4: Testing Phi-3-mini

Output

Implementation with Another Question

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg