Phi-4: Redefining Language Models with Synthetic Data

Ketan Kumar Last Updated : 17 Dec, 2024

11 min read

The landscape of AI is evolving rapidly, and language models, particularly those designed for reasoning and problem-solving tasks, are at the heart of this revolution. One such breakthrough in AI is Phi-4, a 14-billion parameter model developed by Microsoft Research. What sets Phi-4 apart from its predecessors and other models is its innovative approach to training—especially its use of synthetic data. By prioritizing the quality of data over sheer quantity, Phi-4 demonstrates remarkable improvements in reasoning capabilities, STEM-focused question answering, and coding tasks.

In this blog, we will explore Phi-4 in detail, analyzing every component of its architecture, training process, and post-training innovations. We’ll break down its key strengths, discuss areas of improvement, and explain how it outperforms many other language models—even those much larger in size. By the end of this deep dive, you’ll understand why Phi-4 isn’t just another model, but a true leap forward in the field of natural language processing (NLP).

Learning Objectives

Learn why synthetic data is crucial for phi-4’s development and how it boosts performance in long-context tasks.
Learn how the team trains Phi-4 using diverse data sources, including synthetic and non-synthetic data, across three training stages.
Discover how phi-4’s context length increases from 4K to 16K tokens in midtraining and its impact on performance.
See how Phi-4 undergoes evaluation on real-world tasks like question answering, summarization, and retrieval-augmented generation, and compare its performance.
Get a guide on running phi-4 locally, covering technical setup, system requirements, and challenges like overfitting and data contamination.

This article was published as a part of the Data Science Blogathon.

Why Synthetic Data Matters?
Why Synthetic Data is Key for Phi-4?
How Phi-4 was Trained?
Insights from the Mid-Training Phase
Outcomes and Reflections from Post-Training
Performance on Key Benchmarks
How to Run Phi-4 Locally
Challenges: Dealing with Overfitting and Data Contamination
Conclusion
Frequently Asked Questions

Why Synthetic Data Matters?

At its core, Phi-4 is a 14-billion parameter language model developed by Microsoft Research. The model builds on the successes of previous iterations in the Phi family, such as Phi-3, but introduces several key innovations that significantly enhance its performance on reasoning-heavy tasks. Unlike many other large language models (LLMs) that rely primarily on massive amounts of organic data (like web content, books, and code repositories), Phi-4 strategically incorporates a large amount of synthetic data in its training pipeline. This focus on synthetic data, combined with other training innovations, allows Phi-4 to achieve better performance in key areas—particularly STEM-related question answering and complex problem-solving.

Why Synthetic Data is Key for Phi-4?

In the AI community, data is the lifeblood of training models. Typically, LLMs are trained using massive datasets scraped from the web or curated from books and papers. While this organic data is useful, it often contains inconsistencies, irrelevant information, or a lack of structured challenges that would push the model’s reasoning abilities. This is where synthetic data comes in.

Role of Synthetic Data in Phi-4

The team artificially generates synthetic data to meet specific training objectives, making it a highly effective tool for guiding the model’s learning process. For Phi-4, synthetic data helps build high-quality datasets that encourage strong reasoning and problem-solving abilities.

Structured Learning: Unlike organic data, which often requires models to decipher complex, indirect relationships between tokens, synthetic data allows Phi-4 to learn more systematically. For example, in math or coding tasks, the synthetic data provides clear step-by-step reasoning, making it easier for the model to follow logical progressions.
Diversity in Challenges: Synthetic data can be generated to cover a wide range of topics and skills, ensuring the model encounters various challenges. For example, Phi-4’s synthetic datasets include complex math problems, coding challenges, and scientific reasoning tasks—each designed to stretch the model’s cognitive abilities.
Alignment with Inference Contexts: One key advantage of synthetic data is that it can be generated in formats that align closely with the types of outputs the model is expected to produce during real-world interactions. This helps Phi-4 generate responses that are contextually appropriate and more aligned with user queries.

Synthetic Data Techniques in Phi-4

Phi-4’s synthetic data isn’t just randomly generated—it’s carefully crafted using a combination of advanced techniques:

Multi-agent prompting: Multiple agents (models) generate different solutions to the same problem, which are then filtered for quality and consistency. This generates diverse and nuanced examples that challenge the model’s problem-solving abilities.
Self-revision workflows: The model initially generates answers, and then critiques and refines them through iterative feedback loops. This helps improve the accuracy and reasoning in the generated responses.
Instruction reversal: For coding tasks, Phi-4 uses instruction reversal techniques. It transforms existing code snippets into problem descriptions, helping the model generate solutions effectively.

By prioritizing such techniques, Phi-4 learns to solve problems more intelligently, while also reducing biases that may arise from purely organic datasets.

How Phi-4 was Trained?

Phi-4’s impressive performance doesn’t come solely from the use of synthetic data. The model’s training curriculum is also crucial to its success. Phi-4’s creators designed a sophisticated training process that incorporates a balanced mixture of data types, including organic sources and synthetic data.

Pretraining with a Mixture of Data Sources

The phi-4 model utilizes a decoder-only transformer architecture with 14 billion parameters and initially operates with a context length of 4096 tokens. This context length is later increased to 16K tokens during a subsequent midtraining phase. The architecture shares many similarities with the phi-3-medium model but introduces several enhancements. Notably, phi-4 adopts the tiktoken tokenizer, which improves multilingual support, and has a vocabulary size of 100,352 tokens, including unused tokens. Additionally, phi-4 employs full attention across the 4K context length, a departure from the 2K sliding window approach used in phi-3-medium.

The team pretrained the model using approximately 10 trillion tokens, following a linear warm-up and decay schedule. They set the peak learning rate to 0.0003, applied a constant weight decay of 0.1, and used a global batch size of 5760. They fine-tuned hyperparameters by interpolating from shorter-duration runs and stress testing the learning rate warm-up phase to ensure model stability. After pretraining, the model underwent a brief midtraining stage to extend the original 4K context length to 16K tokens.

Since pre-trained models typically do not perform well on instruction-following tasks, the researchers chose not to rely on 0-shot evaluations, such as SIMPLE-EVALS, which require answers in a particular format. Instead, they developed a custom evaluation approach for pretraining, which combines log-likelihood assessments and few-shot prompts for various tasks. For instance, the team used log-likelihood evaluations for tasks like MMLU (5-shot), MMLU-pro, and ARCC (1-shot). Additionally, they trained the model using 1, 3, 4, and 8 few-shot examples for tasks such as TriviaQA (TQA), MBPP, MATH, and GSM8k, helping it follow the required answer formats and extract correct solutions.

Insights from the Mid-Training Phase

In the midtraining phase of phi-4, the context length is extended from the original 4K tokens to 16K tokens. During this stage, the researchers conduct a series of ablation studies to investigate how different types of data impact the model’s performance with long contexts. They compare data sources that naturally have longer contexts with synthetic data, where shorter sequences are padded to create longer ones. The results show that the model performs better when trained on data that inherently has long contexts.

The team refines their dataset by filtering out high-quality, non-synthetic data like academic papers, books, and code. They isolate samples longer than 8K tokens and give more weight to those 16K tokens or longer. New synthetic datasets are created with sequences longer than 4K tokens. The final dataset mixture contains 30% long-context data and 70% recall tokens from pretraining. To accommodate the increased context length, the team sets the rotary position encoding (RoPE) base frequency to 250K. They reduce the maximum learning rate by a factor of 10 and train the model with 250 billion tokens.

To evaluate phi-4’s ability to handle long contexts, the researchers emphasize a diverse set of real-world tasks, rather than relying solely on synthetic benchmarks like needle-in-a-haystack or RULER, which are simpler but less reflective of practical scenarios. The team selects these tasks from the HELMET [YGH+24] evaluation suite and averages the results across five runs for each category.

Evaluation Framework

The evaluation framework includes the following tasks:

Recall: The model retrieves a specific value from a randomly generated long JSON file based on a given key, measured using the SubEM metric.
RAG (Retrieval-Augmented Generation): The model answers questions based on multiple retrieved and shuffled Wikipedia documents, with datasets such as NaturalQuestions, HotpotQA, and PopQA. The final results are averaged across all datasets, evaluated with the SubEM metric.
Re-rank: In this task, the model re-ranks the top-10 documents retrieved for a given query, using the MSMARCO dataset. Performance is measured with nDCG@10.
ICL (In-Context Learning): This task tests the model’s ability to perform many-shot in-context learning on datasets like TREC coarse, TREC fine, Banking77, NLU, and CLINC150. The results are averaged across all datasets, with performance measured by the F1 score.
QA (Question Answering): The model answers questions based on lengthy documents from the NarrativeQAv2 dataset, with performance evaluated using GPT-4o scoring.
Summ (Summarization): The task involves summarizing long legal documents from the Multi-LexSum dataset, with results evaluated using GPT-4o scoring.

This comprehensive evaluation strategy thoroughly tests Phi-4’s long-context capabilities across various practical tasks. It reflects the model’s real-world applicability.

Outcomes and Reflections from Post-Training

Post-training is aimed at transforming the pretrained language model into an AI assistant that users can
safely interact with. Phi-4 align the pretrained model with one round of SFT, one round of DPO on data from our pivotal token search method and one round of DPO on full length preference pairs. The model undergoes chat fine-tuning using the standard ChatML format. An example usage template for two rounds of conversation is as follows:

Innovative Post-Training Techniques

Once pretraining is complete, Phi-4 enters a post-training phase where further fine-tuning takes place. This stage focuses on refining the model’s reasoning abilities and improving the quality of its outputs. Several post-training innovations contribute to Phi-4’s impressive performance:

Supervised Fine-Tuning: In this phase, researchers fine-tune the pretrained model with a learning rate of 10−6on a variety of data generated from high-quality data across diverse domains, including math, coding, reasoning, conversation, model identity, and safety. They also added multilingual data for 40 languages. They use around 8B tokens of data in this phase, all formatted in the chatml format.
Direct Preference Optimization: Researchers use DPO to align the model with human preferences, and also to steer the model away from unwanted behavior through pairs of desired and undesired outputs. DPO data covers chat format data, reasoning, and Responsible AI (RAI) data and improves the model in math, coding, reasoning, robustness, and safety. They did two rounds of DPO on the SFT model.
Pivotal Token Search (PTS): A novel technique developed for Phi-4, PTS identifies key tokens in a response that have a significant impact on the overall success of the model’s output. This allows the model to focus on improving specific, critical tokens in its responses, ensuring greater accuracy and robustness.

Performance on Key Benchmarks

To assess Phi-4’s capabilities, it’s essential to examine its performance on standard benchmarks. Phi-4 consistently outperforms its predecessors and many larger models across several critical tasks.

STEM and Reasoning Tasks

Phi-4 shines particularly in STEM-focused question answering (such as GPQA for graduate-level questions) and mathematics competitions (MATH). Despite being smaller than models like Llama-3, Phi-4 achieves comparable or superior results on these reasoning-heavy tasks. This is a testament to the model’s effective use of synthetic data and its focus on structured, logical problem-solving.

For example, Phi-4 outperforms its teacher model, GPT-4, on many reasoning benchmarks such as GPQA and MATH, despite being a smaller model. The incorporation of high-quality synthetic data and innovative training techniques has allowed Phi-4 to surpass the capabilities of much larger models in these areas.

Coding and Technical Tasks

In coding tasks, Phi-4 also excels, outperforming models such as GPT-4 mini and Qwen 2.5. Whether it’s solving algorithmic problems in HumanEval or tackling more complex programming challenges, Phi-4’s ability to reason and apply logic effectively makes it one of the top performers in the coding space.

Safety

Phi-4 demonstrates robust safeguards against generating harmful or biased content, ensuring ethical and responsible AI interactions during benchmarking.

How to Run Phi-4 Locally

Running Phi-4 locally allows you to interact with this advanced AI model directly from your system, offering convenience and flexibility for testing or application development. Follow the steps below to set it up:

Install Ollama

Ollama is a tool that facilitates running and interacting with AI models like Phi-4. Begin by installing Ollama on your system. You can find detailed installation instructions on Ollama’s official website.

Run Phi-4 in the Command Line

Once Ollama is installed, you can run the Phi-4 model with a single command in your terminal or PowerShell:

ollama run vanilj/Phi-4

This command initializes the Phi-4 model and allows you to interact with it directly in your CLI. You can start chatting or asking questions immediately.

Integrate Phi-4 with LangChain

For more advanced use cases, such as integrating Phi-4 into a workflow or application, you can use LangChain with Ollama. LangChain provides tools for working with language models programmatically.

Install the LangChain-Ollama library:

%pip install -U langchain-ollama

Use the following Python script to run Phi-4 via LangChain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="vanilj/Phi-4")
chain = prompt | model
print(chain.invoke({"question": "Write a poem on AI?"}))

Challenges: Dealing with Overfitting and Data Contamination

No model is perfect, and Phi-4 has its own set of challenges. Overfitting is a common concern in AI development. It happens when a model becomes too specialized to training data, hurting generalization. Phi-4 tackles this by using a data decontamination process. This ensures no test data is included in training, reducing overfitting risk.

Overfitting Mitigation

By using fresh datasets, such as the November 2024 AMC-10 and AMC-12 math competitions, Phi-4 has shown that it can generalize well beyond its training set and perform excellently on new tasks. This is crucial for ensuring that Phi-4 remains a robust and reliable tool for real-world applications.

Weaknesses

Instruction Following: While Phi-4 performs well in reasoning tasks, it struggles with strict instruction-following. Tasks requiring specific formatting or complex stylistic instructions can sometimes cause the model to veer off course.
Factual Hallucinations: Phi-4 still struggles with factual accuracy in some cases, particularly in generating information about non-existent or hypothetical individuals.

Conclusion

Phi-4 is a game-changer in the world of language models. Its combination of innovative synthetic data generation, cutting-edge training techniques, and post-training refinements sets it apart from many other models. Phi-4 demonstrates that with the right approach to training, quality can trump quantity—achieving superior performance in reasoning-heavy tasks, STEM Q&A, and coding challenges, despite being smaller than many contemporary models.

Phi-4 is not without its challenges, particularly around instruction-following and factual accuracy. However, its remarkable abilities in logical reasoning and problem-solving make it a significant step forward in the AI space. As AI evolves, Phi-4’s use of synthetic data sets a model for future developments in the field. It helps push the boundaries of what’s possible with language models.

Key Takeaways

Phi-4 leverages synthetic data to prioritize quality over quantity, enhancing its reasoning, STEM question answering, and coding capabilities.
Synthetic data in Phi-4 introduces structured learning, diverse challenges, and better alignment with real-world inference contexts.
Phi-4’s training includes pretraining, midtraining with extended context lengths, and innovative post-training techniques for fine-tuning.
Midtraining expands Phi-4’s context length from 4K to 16K tokens, optimizing it for long-context tasks.
Evaluation of Phi-4 emphasizes real-world tasks like RAG, summarization, and in-context learning for practical insights.
Post-training innovations, including Supervised Fine-Tuning and Direct Preference Optimization, refine Phi-4’s reasoning and safety.
Phi-4’s architecture, coupled with advanced datasets and training techniques, sets a new benchmark in NLP for handling complex problem-solving tasks.

Frequently Asked Questions

Q1. What is phi-4 and how is it different from previous models?

A. Phi-4 is a large-scale, state-of-the-art AI model based on a decoder-only transformer architecture. Phi-4 builds on models like Phi-3-medium by increasing the context length to 16K tokens. It also introduces improved data preprocessing techniques, including tiktoken, for better multilingual support.

Q2. Why is synthetic data important for training phi-4?

A. Synthetic data plays a key role in training phi-4, as it helps the model handle long-context tasks more effectively. By combining real-world data with synthetically generated sequences, Phi-4 generalizes better across diverse scenarios. This improves its performance on tasks requiring reasoning across large datasets.

Q3. What are the key stages of phi-4’s training process?

A. Phi-4’s training involves three stages. Pretraining uses diverse data sources. Midtraining expands context length from 4K to 16K tokens. Posttraining includes fine-tuning techniques like SFT, reinforcement learning with DPO, and token sampling (PTS) from the pretraining stage.

Q4. How does phi-4 perform on real-world tasks?

A. Phi-4 excels on a wide range of real-world benchmarks, including question answering, summarization, and retrieval-augmented generation. Phi-4 excels in reasoning tasks over lengthy documents, evaluated using diverse datasets from the HELM evaluation suite.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ketan Kumar

I'm a Data Scientist at Syngene International Limited. I have completed my Master's in Data Science from VIT AP and I have a burning passion for Generative AI. My expertise lies in building robust machine learning and NLP models for innovative projects. Currently, I'm putting this knowledge to work in drug discovery research at Syngene, exploring the potential of LLMs. Always eager to learn and delve deeper into the ever-evolving world of data science and AI!

Advanced Generative AI

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Phi-4: Redefining Language Models with Synthetic Data

Learning Objectives

Table of contents

Why Synthetic Data Matters?

Why Synthetic Data is Key for Phi-4?

Role of Synthetic Data in Phi-4

Synthetic Data Techniques in Phi-4

How Phi-4 was Trained?

Pretraining with a Mixture of Data Sources

Insights from the Mid-Training Phase

Evaluation Framework

Outcomes and Reflections from Post-Training

Innovative Post-Training Techniques

Performance on Key Benchmarks

STEM and Reasoning Tasks

Coding and Technical Tasks

Safety

How to Run Phi-4 Locally

Install Ollama

Run Phi-4 in the Command Line

Integrate Phi-4 with LangChain

Challenges: Dealing with Overfitting and Data Contamination

Overfitting Mitigation

Weaknesses

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID