4M Tokens? MiniMax-Text-01 Raises the Bar, Beating DeepSeek V3

Nitika Sharma Last Updated : 16 Jan, 2025

5 min read

Chinese AI labs are making steady progress in the AI race. Models like DeepSeek-V3 and Qwen 2.5 are giving tough competition to GPT-4o, Claude, and Grok. How are these Chinese models better? They stand out for their cost efficiency, openness, and high performance. Many are open-source and available under commercially permissive licenses, making them accessible to a wide range of developers and businesses.

MiniMax-Text-01 is the latest addition to the Chinese LLMs. With a 4 million token context length—far exceeding industry standards of 128K-256K tokens—it sets a new benchmark in handling long-context tasks. The model’s Hybrid Attention architecture ensures operational efficiency, and its open-source, commercially permissive license empowers innovation without the burden of hefty costs.

Let’s explore MiniMax-Text-01!

Hybrid Architecture
Mixture-of-Experts (MoE) Strategy
Training and Scaling Strategies
Post-Training Optimization
Key Innovations
Core Academic Benchmarks
Let’s Get Started with MiniMax-Text-01
Important Links
End Note

Hybrid Architecture

MiniMax-Text-01 combines Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE) to achieve a balance between efficiency and performance.

7/8 Linear Attention (Lightning Attention-2):
- Lightning Attention is a linear attention mechanism that reduces computational complexity from O(n²d) to O(d²n), making it highly efficient for long-context tasks.
- The mechanism involves:
  1. Input transformation using SiLU activation.
  2. Matrix operations to compute attention scores.
  3. Normalization and scaling using RMSNorm and sigmoid.
1/8 Softmax Attention:
- Traditional attention with RoPE (Rotary Position Embedding) applied to half the attention head dimension, enabling length extrapolation without performance degradation.

Mixture-of-Experts (MoE) Strategy

MiniMax-Text-01 employs a unique MoE architecture that differs from models like DeepSeek-V3:

Token Drop Strategy: Uses an auxiliary loss to balance token distribution across experts, unlike DeepSeek’s dropless strategy.
Global Router: Optimizes token allocation to ensure balanced workloads across expert groups.
Top-k Routing: Selects top-2 experts per token, compared to DeepSeek’s top-8 + 1 shared expert.
Expert Configuration:
- 32 experts (vs. DeepSeek’s 256 + 1 shared).
- Expert Hidden Dimension: 9216 (vs. DeepSeek’s 2048).
- Total Activated Parameters per Layer: 18,432 (same as DeepSeek).

Training and Scaling Strategies

Training Infrastructure:
- Trained on ~2000 H100 GPUs using advanced parallelism techniques like Expert Tensor Parallelism (ETP) and Linear Attention Sequence Parallelism Plus (LASP+).
- Optimized for 8-bit quantization, ensuring efficient inference on 8x80GB H100 nodes.
Training Data:
- Trained on ~12 trillion tokens with a WSD-like learning rate schedule.
- Data includes a mix of high-quality and low-quality sources, with global deduplication and 4x repetition for high-quality data.
Long-Context Training:
- Three Phases:
  1. Main Training: 8k context length with RoPE base 10k.
  2. Phase 1: 128k context length, 5M RoPE base, 30% short sequences, 70% medium sequences.
  3. Phase 2: 512k context length, 10M RoPE base, 35% short, 35% medium, 30% long sequences.
  4. Phase 3: 1M context length, 10M RoPE base, 30% short, 30% medium, 40% long sequences.
- Linear Interpolation: Mitigates distribution shifts during context length scaling.

Post-Training Optimization

Iterative Fine-Tuning:
- Combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in cycles.
- RL uses Offline DPO and Online GRPO for alignment.
Long-Context Fine-Tuning:
- Short-Context SFT → Long-Context SFT → Short-Context RL → Long-Context RL.
- This phased approach is critical for achieving superior long-context performance.

Key Innovations

DeepNorm: A post-norm architecture that scales residual connections and improves stability during training.
Batch Size Warmup: Gradually increases batch size from 16M to 128M tokens to optimize training dynamics.
Efficient Parallelism:
- Ring Attention: Reduces memory overhead for long sequences.
- Padding Optimization: Minimizes wasted computation during training.

Core Academic Benchmarks

General Tasks Benchmarks

Task	GPT-4o	Claude-3.5-Sonnet	Gemini-1.5-Pro	Gemini-2.0-Flash	Qwen2.5-72B-Inst.	DeepSeek-V3	Llama-3.1-405B-Inst.	MiniMax-Text-01
MMLU*	85.7	88.3	86.8	86.5	86.1	88.5	88.6	88.5
MMLU-Pro*	74.4	78.0	75.8	76.4	71.1	75.9	73.3	75.7
SimpleQA	39.0	28.1	23.4	26.6	10.3	24.9	23.2	23.7
C-SimpleQA	64.6	56.8	59.4	63.3	52.2	64.8	54.7	67.4
IFEval (avg)	84.1	90.1	89.4	88.4	87.2	87.3	86.4	89.1
Arena-Hard	92.4	87.6	85.3	72.7	81.2	91.4	63.5	89.1

Reasoning Tasks Benchmarks

Task	GPT-4o	Claude-3.5-Sonnet	Gemini-1.5-Pro	Gemini-2.0-Flash	Qwen2.5-72B-Inst.	DeepSeek-V3	Llama-3.1-405B-Inst.	MiniMax-Text-01
GPQA*	46.0	65.0	59.1	62.1	49.0	59.1	50.7	54.4
DROP*	89.2	88.8	89.2	89.3	85.0	91.0	92.5	87.8

Mathematics & Coding Tasks Benchmarks

Task	GPT-4o	Claude-3.5-Sonnet	Gemini-1.5-Pro	Gemini-2.0-Flash	Qwen2.5-72B-Inst.	DeepSeek-V3	Llama-3.1-405B-Inst.	MiniMax-Text-01
GSM8k*	95.6	96.9	95.2	95.4	95.8	96.7	96.7	94.8
MATH*	76.6	74.1	84.6	83.9	81.8	84.6	73.8	77.4
MBPP +	76.2	75.1	75.4	75.9	77.0	78.8	73.0	71.7
HumanEval	90.2	93.7	86.6	89.6	86.6	92.1	89.0	86.9

You can checkout other evaluation parameters here.

Let’s Get Started with MiniMax-Text-01

This script sets up and runs the MiniMax-Text-01 language model using the Hugging Face transformers library. It includes steps to configure the model for multi-GPU environments, apply quantization for efficiency, and generate responses from a user-provided input prompt.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, GenerationConfig

# Ensure QuantoConfig is imported or defined
try:
    from transformers import QuantoConfig
except ImportError:
    class QuantoConfig:
        def __init__(self, weights, modules_to_not_convert):
            self.weights = weights
            self.modules_to_not_convert = modules_to_not_convert

# Load Hugging Face config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Text-01", trust_remote_code=True)

# Quantization config (int8 recommended)
quantization_config = QuantoConfig(
    weights="int8",
    modules_to_not_convert=[
        "lm_head",
        "embed_tokens",
    ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
    + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
)

# Set device map for multi-GPU setup
world_size = 8  # Assume 8 GPUs
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
layers_per_device = hf_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Text-01")

# Prepare input prompt
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
if hasattr(tokenizer, 'apply_chat_template'):
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
else:
    raise NotImplementedError("The tokenizer does not support 'apply_chat_template'. Check the documentation or update the tokenizer version.")

# Tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Load model with quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    trust_remote_code=True,
    offload_buffers=True,
)

# Generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Important Links

End Note

MiniMax-Text-01 is a highly capable model with state-of-the-art performance in long-context and general-purpose tasks. While it has some areas for improvement, its open-source nature, cost efficiency, and innovative architecture make it a strong contender in the AI landscape. It’s particularly well-suited for applications requiring extensive memory and complex reasoning, but may need further refinement for coding-specific tasks.

Stay tuned to Analytics Vidhya News for more such insightful content!

Nitika Sharma

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Generative AI News

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

4M Tokens? MiniMax-Text-01 Raises the Bar, Beating DeepSeek V3

Table of contents

Hybrid Architecture

Mixture-of-Experts (MoE) Strategy

Training and Scaling Strategies

Post-Training Optimization

Key Innovations

Core Academic Benchmarks

General Tasks Benchmarks

Reasoning Tasks Benchmarks

Mathematics & Coding Tasks Benchmarks

Let’s Get Started with MiniMax-Text-01

Important Links

End Note

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr