4M Tokens? MiniMax-Text-01 Raises the Bar, Beating DeepSeek V3

Nitika Sharma Last Updated : 16 Jan, 2025
5 min read

Chinese AI labs are making steady progress in the AI race. Models like DeepSeek-V3 and Qwen 2.5 are giving tough competition to GPT-4o, Claude, and Grok. How are these Chinese models better? They stand out for their cost efficiency, openness, and high performance. Many are open-source and available under commercially permissive licenses, making them accessible to a wide range of developers and businesses.

MiniMax-Text-01 is the latest addition to the Chinese LLMs. With a 4 million token context length—far exceeding industry standards of 128K-256K tokens—it sets a new benchmark in handling long-context tasks. The model’s Hybrid Attention architecture ensures operational efficiency, and its open-source, commercially permissive license empowers innovation without the burden of hefty costs.

Let’s explore MiniMax-Text-01!

Hybrid Architecture

MiniMax-Text-01 combines Lightning AttentionSoftmax Attention, and Mixture-of-Experts (MoE) to achieve a balance between efficiency and performance.

Source: MiniMax-Text-01
  • 7/8 Linear Attention (Lightning Attention-2):
    • Lightning Attention is a linear attention mechanism that reduces computational complexity from O(n²d) to O(d²n), making it highly efficient for long-context tasks.
    • The mechanism involves:
      1. Input transformation using SiLU activation.
      2. Matrix operations to compute attention scores.
      3. Normalization and scaling using RMSNorm and sigmoid.
  • 1/8 Softmax Attention:
    • Traditional attention with RoPE (Rotary Position Embedding) applied to half the attention head dimension, enabling length extrapolation without performance degradation.

Mixture-of-Experts (MoE) Strategy

MiniMax-Text-01 employs a unique MoE architecture that differs from models like DeepSeek-V3:

Source: MiniMax-Text-01
  • Token Drop Strategy: Uses an auxiliary loss to balance token distribution across experts, unlike DeepSeek’s dropless strategy.
  • Global Router: Optimizes token allocation to ensure balanced workloads across expert groups.
  • Top-k Routing: Selects top-2 experts per token, compared to DeepSeek’s top-8 + 1 shared expert.
  • Expert Configuration:
    • 32 experts (vs. DeepSeek’s 256 + 1 shared).
    • Expert Hidden Dimension: 9216 (vs. DeepSeek’s 2048).
    • Total Activated Parameters per Layer: 18,432 (same as DeepSeek).

Training and Scaling Strategies

  • Training Infrastructure:
    • Trained on ~2000 H100 GPUs using advanced parallelism techniques like Expert Tensor Parallelism (ETP) and Linear Attention Sequence Parallelism Plus (LASP+).
    • Optimized for 8-bit quantization, ensuring efficient inference on 8x80GB H100 nodes.
  • Training Data:
    • Trained on ~12 trillion tokens with a WSD-like learning rate schedule.
    • Data includes a mix of high-quality and low-quality sources, with global deduplication and 4x repetition for high-quality data.
  • Long-Context Training:
    • Three Phases:
      1. Main Training: 8k context length with RoPE base 10k.
      2. Phase 1: 128k context length, 5M RoPE base, 30% short sequences, 70% medium sequences.
      3. Phase 2: 512k context length, 10M RoPE base, 35% short, 35% medium, 30% long sequences.
      4. Phase 3: 1M context length, 10M RoPE base, 30% short, 30% medium, 40% long sequences.
    • Linear Interpolation: Mitigates distribution shifts during context length scaling.

Post-Training Optimization

  • Iterative Fine-Tuning:
    • Combines Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) in cycles.
    • RL uses Offline DPO and Online GRPO for alignment.
  • Long-Context Fine-Tuning:
    • Short-Context SFT → Long-Context SFT → Short-Context RL → Long-Context RL.
    • This phased approach is critical for achieving superior long-context performance.

Key Innovations

  • DeepNorm: A post-norm architecture that scales residual connections and improves stability during training.
  • Batch Size Warmup: Gradually increases batch size from 16M to 128M tokens to optimize training dynamics.
  • Efficient Parallelism:
    • Ring Attention: Reduces memory overhead for long sequences.
    • Padding Optimization: Minimizes wasted computation during training.

Core Academic Benchmarks

Source: MiniMax-Text-01

General Tasks Benchmarks

Task GPT-4o Claude-3.5-Sonnet Gemini-1.5-Pro Gemini-2.0-Flash Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Text-01
MMLU* 85.7 88.3 86.8 86.5 86.1 88.5 88.6 88.5
MMLU-Pro* 74.4 78.0 75.8 76.4 71.1 75.9 73.3 75.7
SimpleQA 39.0 28.1 23.4 26.6 10.3 24.9 23.2 23.7
C-SimpleQA 64.6 56.8 59.4 63.3 52.2 64.8 54.7 67.4
IFEval (avg) 84.1 90.1 89.4 88.4 87.2 87.3 86.4 89.1
Arena-Hard 92.4 87.6 85.3 72.7 81.2 91.4 63.5 89.1

Reasoning Tasks Benchmarks

Task GPT-4o Claude-3.5-Sonnet Gemini-1.5-Pro Gemini-2.0-Flash Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Text-01
GPQA* 46.0 65.0 59.1 62.1 49.0 59.1 50.7 54.4
DROP* 89.2 88.8 89.2 89.3 85.0 91.0 92.5 87.8

Mathematics & Coding Tasks Benchmarks

Task GPT-4o Claude-3.5-Sonnet Gemini-1.5-Pro Gemini-2.0-Flash Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Text-01
GSM8k* 95.6 96.9 95.2 95.4 95.8 96.7 96.7 94.8
MATH* 76.6 74.1 84.6 83.9 81.8 84.6 73.8 77.4
MBPP + 76.2 75.1 75.4 75.9 77.0 78.8 73.0 71.7
HumanEval 90.2 93.7 86.6 89.6 86.6 92.1 89.0 86.9
Source: MiniMax-Text-01

You can checkout other evaluation parameters here.

Let’s Get Started with MiniMax-Text-01

This script sets up and runs the MiniMax-Text-01 language model using the Hugging Face transformers library. It includes steps to configure the model for multi-GPU environments, apply quantization for efficiency, and generate responses from a user-provided input prompt.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, GenerationConfig

# Ensure QuantoConfig is imported or defined
try:
    from transformers import QuantoConfig
except ImportError:
    class QuantoConfig:
        def __init__(self, weights, modules_to_not_convert):
            self.weights = weights
            self.modules_to_not_convert = modules_to_not_convert

# Load Hugging Face config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Text-01", trust_remote_code=True)

# Quantization config (int8 recommended)
quantization_config = QuantoConfig(
    weights="int8",
    modules_to_not_convert=[
        "lm_head",
        "embed_tokens",
    ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
    + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
)

# Set device map for multi-GPU setup
world_size = 8  # Assume 8 GPUs
device_map = {
    'model.embed_tokens': 'cuda:0',
    'model.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
layers_per_device = hf_config.num_hidden_layers // world_size
for i in range(world_size):
    for j in range(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Text-01")

# Prepare input prompt
prompt = "Hello!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"role": "user", "content": [{"type": "text", "text": prompt}]},
]
if hasattr(tokenizer, 'apply_chat_template'):
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
else:
    raise NotImplementedError("The tokenizer does not support 'apply_chat_template'. Check the documentation or update the tokenizer version.")

# Tokenize and move to device
model_inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Load model with quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-Text-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    trust_remote_code=True,
    offload_buffers=True,
)

# Generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

End Note

MiniMax-Text-01 is a highly capable model with state-of-the-art performance in long-context and general-purpose tasks. While it has some areas for improvement, its open-source nature, cost efficiency, and innovative architecture make it a strong contender in the AI landscape. It’s particularly well-suited for applications requiring extensive memory and complex reasoning, but may need further refinement for coding-specific tasks.

Stay tuned to Analytics Vidhya News for more such insightful content!

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details