Tülu 3 : Advancing Open Language Model Post-Training

Himanshu Ranjan Last Updated : 11 Feb, 2025

5 min read

The field of natural language processing (NLP) has seen significant advancements in the past few years, with post-training techniques playing a crucial role in refining language models. While proprietary models like OpenAI’s GPT-4 and Anthropic’s Claude lead the market, open-source alternatives often lag due to limited access to post-training data and methodologies. Tülu 3 addresses this gap by introducing a fully open-source, state-of-the-art post-training framework, incorporating novel techniques and rigorous evaluation methods. In this article we will learn all about the Tülu 3 405b AI model including its training process and how to access the chatbot.

Learning Objectives

Get familiar with the new open-source model – Tülu 3.
Understand how the model works.
Explore the four-stage post-training pipeline that Tülu 3 follows.
Learn how to access the Tülu 3 405b AI chatbot.
See how Tülu 3 performs in comparison to other existing models such as Llama 3.1 8B-Instruct.

This article was published as a part of the Data Science Blogathon.

What is Tülu 3?
Tülu 3 Data
Training Process
Evaluation Process
How to Get Started with Llama-3.1-Tulu-3-405B
Results & Comparisons
Key Contributions of Tülu 3
Conclusion
Frequently Asked Questions

What is Tülu 3?

Tülu 3 is a result of collaborative efforts from Allen Institute for AI and the University of Washington. Therefore, there is complete transparency in post-training datasets, methodologies, and evaluation frameworks. Built on Llama 3.1 base models, Tülu 3 surpasses the performance of other instruct-tuned open models, even competing with closed models like GPT-4o-mini and Claude 3.5-Haiku.

Tülu 3 is designed to refine the capabilities of open-source language models across multiple skill areas, including:

Knowledge recall (e.g., MMLU benchmarks)
Reasoning (e.g., BigBenchHard, DROP)
Mathematics (e.g., GSM8K, MATH dataset)
Coding (e.g., HumanEval, CodeAlpaca)
Instruction following (e.g., IFEval, AlpacaEval 2)
Safety & compliance (e.g., Tülu 3 Safety suite)

Tülu 3 Data

Data plays a critical role in training and refining language models. Tülu 3 introduces a diverse and well-curated dataset that combines publicly available sources with synthetically generated data.

Data Sources

The dataset includes:

Publicly available datasets (e.g., FLAN v2, Open Assistant, No Robots, WildChat)
Skill-specific datasets (e.g., NuminaMath, SciRIFF, OpenMathInstruct)
Synthetically generated datasets using a persona-driven approach for skills like math, coding, and instruction following
Noncompliance & safety data (e.g., WildJailbreak, CoCoNot, WildGuardMix)

Prompt Decontamination

A crucial step in ensuring model integrity is decontaminating training datasets to prevent test set contamination. The decontamination process involves 8-gram matching, ensuring that evaluation data does not overlap with training data. Several datasets (e.g., Evol CodeAlpaca, WildChat) were filtered and re-released with decontaminated samples.

Training Process

Tülu 3 follows a four-stage post-training pipeline:

Data Curation: Prompts are curated from various datasets and synthetically generated for specific skills. A strict decontamination process is applied to prevent contamination in evaluation benchmarks.
Supervised Finetuning (SFT): SFT trains the model using high-quality instruction-following data. Data mixing experiments were conducted to optimize performance across different tasks while maintaining generalization.
Preference Finetuning (DPO): DPO is applied to fine-tune models using pairwise preference data. On-policy data is generated by comparing Tülu 3 completions against outputs from other models.
Reinforcement Learning with Verifiable Rewards (RLVR): A novel RL-based approach, RLVR optimizes model performance by rewarding only verifiable correct answers. This method is particularly effective for tasks like math problem-solving and precise instruction-following.

Evaluation Process

Tülu 3 introduces Tülu 3 Eval, a standardized and transparent evaluation framework. The evaluation suite consists of:

Development evaluations – Used to guide model improvement during training.
Unseen evaluations – Held-out tests to measure overfitting and generalization.
Safety evaluations – Assess compliance and robustness to adversarial prompts.

The evaluation suite is based on benchmarks like MMLU, GSM8K, BigBenchHard, HumanEval, and AlpacaEval 2. All evaluations and decontamination tools are open-sourced for reproducibility.

How to Get Started with Llama-3.1-Tulu-3-405B

Tülu 3 is an advanced instruction-following model family. Below are steps to start using the Llama-3.1-Tulu-3-405B model:

Step 1. Loading the Model with HuggingFace

To load the model using HuggingFace, use the following Python snippet:

from transformers import AutoModelForCausalLM
tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-405B")

Step 2. Running with vLLM

As a Llama base model, the model can be easily served using:

vllm serve allenai/Llama-3.1-Tulu-3-405B --max_model_len=8192

Step 3. Using the Chat Template

The chat template for the model follows this format:

<|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

Or with expanded new lines:

<|user|>
How are you doing?
<|assistant|>

I’m just a computer program, so I don’t have feelings, but I’m functioning as expected. How can I assist you today?<|endoftext|>

Results & Comparisons

Tulu 3 AI chatbot benchmark — Source: Tülu 3 Research Paper

Tülu 3 achieves state-of-the-art results among open-weight models, outperforming models like Llama 3.1 Instruct, Mistral, and Qwen 2.5 Instruct. At the 70B model scale, Tülu 3 even rivals Claude 3.5 Haiku and GPT-4o-mini. Key results include:

Tülu 3-70B surpasses Llama 3.1 70B Instruct and Nous Hermes 3
Tülu 3-8B outperforms Qwen 2.5 7B and Mistral 8B
Tülu 3-405B competes with DeepSeek V3 and GPT-4o (11-24)

Key Contributions of Tülu 3

Tülu 3 represents a major advancement in open language model post-training by introducing:

Open-source datasets, code, and training recipes, enabling full transparency and reproducibility.
Advanced decontamination strategies to prevent data leakage and ensure fair evaluations.
Scalable preference tuning methodology, leveraging on-policy data for better alignment.
Reinforcement Learning with Verifiable Rewards (RLVR), a novel RL training method that ensures correctness in verifiable tasks.
Robust evaluation framework, providing reproducible benchmarks and safety assessments.

Conclusion

Tülu 3 establishes a new benchmark for open-weight language models, demonstrating that open-source models can rival proprietary solutions. With full access to model weights, training code, evaluation tools, and datasets, Tülu 3 lays the foundation for future advancements in post-training research.

Future work includes scaling the methodology to larger models, improving multimodal capabilities, and further optimizing RLVR techniques. The Tülu 3 release marks a significant milestone in the open AI community, enabling further innovation and research in large-scale language model post-training.

Key Takeaways

Tülu 3 is an open-source post-training framework competing with proprietary models like GPT-4o-mini and Claude 3.5 Haiku.
It follows a four-stage post-training pipeline: Data Curation, Supervised Fine-Tuning (SFT), Preference Fine-Tuning (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR).
The model is trained using diverse datasets, including public sources, skill-specific data, and synthetic persona-driven data, with strict decontamination to prevent test contamination.
Tülu 3 outperforms several open-weight models, with the 70B version surpassing Llama 3.1 70B Instruct and Nous Hermes 3, and the 405B version competing with DeepSeek V3 and GPT-4o.
The project promotes full transparency by open-sourcing datasets, training code, and evaluation tools, laying the foundation for future research in open-source AI.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is Tülu 3?

A. Tülu 3 is an open-source post-training framework designed to enhance language models through supervised finetuning, preference tuning, and reinforcement learning.

Q2. How does RLVR improve model performance?

A. Reinforcement Learning with Verifiable Rewards (RLVR) optimizes models using rewards granted only for verifiably correct outputs, improving accuracy in structured tasks like mathematics and instruction-following.

Q3. Can I fine-tune Tülu 3 for my use case?

A. Yes, all datasets, model weights, and training recipes are open-source, allowing users to fine-tune Tülu 3 for specific needs.

Q4. How does Tülu 3 compare to GPT-4?

A. Tülu 3 competes closely with proprietary models like GPT-4o-mini and Claude 3.5-Haiku, achieving strong performance in various benchmarks.

Q5. Where can I access Tülu 3 models and code?

A. You can find Tülu 3 models, code, and datasets on Hugging Face and GitHub.

Himanshu Ranjan

Hi there! I’m Himanshu a Data Scientist at KPMG, and I have a deep passion for data everything from crunching numbers to finding patterns that tell a story. For me, data is more than just numbers on a screen; it’s a tool for discovery and insight. I’m always excited by the possibility of what data can reveal and how it can solve real-world problems.

But it’s not just data that grabs my attention. I love exploring new things, whether that’s learning a new skill, experimenting with new technologies, or diving into topics outside my comfort zone. Curiosity drives me, and I’m always looking for fresh challenges that push me to think differently and grow. At heart, I believe there’s always more to learn, and I’m on a constant journey to expand my knowledge and perspective.

Generative AI Intermediate

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Tülu 3 : Advancing Open Language Model Post-Training

Learning Objectives

Table of Contents

What is Tülu 3?

Tülu 3 Data

Training Process

Evaluation Process

How to Get Started with Llama-3.1-Tulu-3-405B

Step 1. Loading the Model with HuggingFace

Step 2. Running with vLLM

Step 3. Using the Chat Template

Results & Comparisons

Key Contributions of Tülu 3

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp