From RL to LLMs: Optimizing AI with GRPO, PPO, and DPO for Better Fine-Tuning

Neil D Last Updated : 18 Feb, 2025
22 min read

For decades, Reinforcement Learning (RL) has been the driving force behind breakthroughs in robotics, game-playing AI (AlphaGo, OpenAI Five), and control systems. RL’s strength lies in its ability to optimize decision-making by maximizing long-term rewards, making it ideal for problems requiring sequential reasoning. However, large language models (LLMs) initially relied on supervised learning, where models were fine-tuned on static datasets. This approach lacked adaptability—while LLMs could mimic human text, they struggled with nuanced human preference alignment, leading to inconsistencies in conversational AI. The introduction of RLHF (Reinforcement Learning with Human Feedback) changed everything. By integrating RL into LLM fine-tuning, models like ChatGPT, DeepSeek, Gemini, and Claude could optimize (LLM Optimization) their responses based on user feedback.

However, standard PPO-based RLHF had inefficiencies, requiring expensive reward modeling and iterative training. Enter DeepSeek’s Group Relative Policy Optimization (GRPO)—a breakthrough that eliminated the need for explicit reward modeling by directly optimizing preference rankings. To fully grasp the significance of GRPO, we must first explore the fundamental policy optimization techniques (LLM optimization) that power modern reinforcement learning.

Source: Link

Learning Objectives 

  • Understand why RL-based techniques are crucial for optimizing LLMs like ChatGPT, DeepSeek, Claude, and Gemini.
  • Learn the fundamentals of policy optimization, including PG, TRPO, and PPO.Explore DPO and GRPO for preference-based LLM training without explicit reward models.
  • Compare PG, TRPO, PPO, DPO, and GRPO to determine the best approach for RL and LLM fine-tuning.
  • Gain hands-on experience with Python implementations of policy optimization algorithms.
  • Evaluate fine-tuning impact using training loss curves and probability distributions.
  • Apply DPO and GRPO to enhance LLM safety, alignment, and reliability.

This article was published as a part of the Data Science Blogathon.

Primer on Policy Optimization Techniques

Before diving into DeepSeek’s GRPO, it’s crucial to understand the policy optimization techniques that form the foundation of reinforcement learning (RL) in both traditional control tasks and LLM fine-tuning. Policy optimization refers to the process of improving an AI agent’s decision-making strategy (policy) to maximize expected rewards. While early methods like vanilla policy gradient (PG) laid the groundwork, more sophisticated techniques such as TRPO, PPO, DPO, and GRPO evolved to address issues like stability, efficiency, and preference alignment.

What is Policy Optimization?

  At its core, policy optimization is about learning the optimal policy π_θ(a∣s), which maps a state s to an action a while maximizing long-term rewards. The objective function in RL is typically formulated as:  

Formula

Where R(τ) is the total reward collected in a trajectory τ, and the expectation is taken over all possible trajectories following policy π_θ.

There are three major approaches to policy optimization:

1. Gradient-Based Optimization (Policy Gradient Methods)

  • These methods directly compute gradients of expected reward and update policy parameters using gradient ascent.
  • Example: REINFORCE algorithm (Vanilla Policy Gradient).
  • Pros: Simple, works with continuous and discrete actions.
  • Cons: High variance, requires tricks like baseline subtraction.

2. Trust-Region Optimization (TRPO, PPO)

  • Introduces constraints (KL divergence) to ensure policy updates are stable and not too drastic.
  • Example: TRPO ensures updates stay within a “trust region”; PPO simplifies this with clipping.
  • Pros: More stable than raw policy gradients.
  • Cons: Computationally expensive (TRPO), hyperparameter-sensitive (PPO).

3. Preference-Based Optimization (DPO, GRPO)

  • Optimizes directly from ranked human preferences instead of rewards.
  • Example: DPO learns from preferred vs. rejected responses; GRPO generalizes to groups.
  • Pros: Eliminates the need for reward models and better aligns LLMs with human intent.
  • Cons: Requires high-quality preference data.

Mathematical Foundations (Required for All Methods)

A. Markov Decision Process (MDP)

RL is typically formulated as a Markov Decision Process (MDP), represented as:

Formula

where:

  • S is the state space,
  • A is the action space,
  • P(s′∣s,a) is the transition probability to state s′,
  • R(s,a) is the reward function,
  • γ is the discount factor (how much future rewards are valued).

  B. Expected Return J(θ)

  The Expected Return (ER) measures how much cumulative reward we expect from following policy π_θ:  

Formula

  where γ (0 ≤ γ ≤ 1) determines how much future rewards contribute.  

C. Policy Gradient Theorem

Policy gradient (PG) methods update the policy using gradients of expected rewards. The key equation:

Formula

where:

  • A(s,a) is the advantage function (how good action a is compared to average actions in state s).
  • logπ_θ​ ensures we increase the probabilities of better actions.

D. Advantage Function A(s,a)

To reduce variance in gradient estimates, we use the advantage function:

Formula

where:

  • Q(s,a) is the expected return for taking action a at state s.
  • V(s) is the expected return following policy π from s.

Using A(s,a) helps make updates more stable and efficient.

Policy Gradient (PG) – The Foundation

The Policy Gradient (PG) method is the most fundamental approach to reinforcement learning. Instead of learning a value function, PG directly parameterizes the policy π_θ(a∣s) and updates it using gradient ascent. This allows learning in continuous action spaces, making it effective for tasks like robotics, game AI, and LLM fine-tuning.

However, PG methods suffer from high variance due to their reliance on sampling full trajectories. More advanced methods like TRPO, PPO, and GRPO build upon PG to improve stability.

The Policy Gradient Theorem

  The goal of policy optimization is to find policy parameters θ that maximize expected return:  

The Policy Gradient Theorem

Using the log-derivative trick, we obtain the Policy Gradient Theorem:

The Policy Gradient Theorem

where:

  • ∇θ​logπθ​(a∣s) is the gradient of the log-probability of taking action aaa.
  • A(s,a) (Advantage function) determines how much better action aaa is compared to others.
  • We perform gradient ascent to increase the probability of good actions.

Code Example: REINFORCE Algorithm

The REINFORCE algorithm is the simplest form of PG. It samples trajectories, computes rewards, and updates the policy parameters. Below is the main training loop (only the key function is shown to limit the scope; the full notebook is linked).

def train_policy_gradient(env, policy, optimizer, num_episodes=500, gamma=0.99):
    """Train a policy using the REINFORCE algorithm"""
    reward_history = []

    for episode in range(num_episodes):
        state, _ = env.reset()
        log_probs = []
        rewards = []
        done = False

        while not done:
            state = torch.FloatTensor(state).unsqueeze(0)
            action_probs = policy(state)
            action_dist = torch.distributions.Categorical(action_probs)
            action = action_dist.sample()

            log_probs.append(action_dist.log_prob(action))
            next_state, reward, done, _, _ = env.step(action.item())
            rewards.append(reward)
            state = next_state

        # Compute discounted rewards
        returns = []
        G = 0
        for r in reversed(rewards):
            G = r + gamma * G
            returns.insert(0, G)

        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)  # Normalize for stability

        # Compute policy gradient loss
        loss = []
        for log_prob, G in zip(log_probs, returns):
            loss.append(-log_prob * G)  # Gradient ascent on expected return
        loss = torch.stack(loss).sum()

        # Optimize policy
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        reward_history.append(sum(rewards))

    return reward_history

  🔗 Full implementation available here 

Code Explanation

The train_policy_gradient function implements the REINFORCE algorithm, which optimizes policy parameters using Monte Carlo updates. The training begins by initializing the environment and iterating over multiple episodes, collecting state-action-reward trajectories. For each step in an episode, an action is sampled from the policy, executed in the environment, and its corresponding reward is stored. After completing an episode, the discounted rewards are computed using the compute_discounted_rewards function, ensuring that future rewards contribute appropriately to policy updates. These rewards are then normalized to reduce variance, making training more stable. The policy loss is calculated by multiplying the log probabilities of actions by their respective discounted rewards. Finally, the policy is updated using gradient descent, which maximizes the expected return by reinforcing actions that led to higher rewards.

Expected Outcomes & Justification

The training plot demonstrates how the total episode rewards evolve over 500 episodes. Initially, the agent performs poorly, as seen in the low reward values in early episodes (e.g., Episode 50: 20.0). However, as training progresses, the agent learns more effective strategies, leading to higher rewards (Episode 100: 134.0, Episode 150: 229.0). The performance peaks when the agent successfully balances the pole for the maximum time, reaching 500 rewards per episode (Episode 200, 350, and 450). However, instability is evident, as seen in the sharp reward drop in Episode 250 (26.0) and Episode 500 (9.0). This behaviour arises due to the high variance of PG methods, where updates can occasionally lead to suboptimal policies before stabilizing.

Policy Gradient (REINFORCE)
Policy Gradient (REINFORCE)

The overall trend shows increasing average rewards, indicating that the policy is improving. However, fluctuations in rewards highlight the limitation of vanilla PG methods, which motivates the need for more stable techniques like TRPO and PPO.

Trust Region Policy Optimization (TRPO) 

While Policy Gradient (PG) methods like REINFORCE are effective, they suffer from high variance and instability in updates. One bad update can drastically collapse the learned policy. TRPO (Trust Region Policy Optimization) improves upon PG by ensuring updates are constrained within a trust region, preventing abrupt changes that could harm performance.

Instead of using vanilla gradient descent, TRPO solves a constrained optimization problem:

Trust Region Policy Optimization (TRPO) 

This KL-divergence constraint ensures that the new policy is not too far from the previous policy, leading to more stable updates.

TRPO Algorithm & Key Mathematical Concepts

TRPO optimizes the policy using Generalized Advantage Estimation (GAE) and Conjugate Gradient Descent.

1. Generalized Advantage Estimation (GAE): Computes an advantage function to estimate how much better an action is compared to the expected return.

Generalized Advantage Estimation (GAE)

  where δ_t is the TD error:  

Generalized Advantage Estimation (GAE)

2. Trust Region Constraint: Ensures updates stay within a safe region using KL-divergence.

  where δ  is the maximum step size.  

Trust Region Constraint

3.  Conjugate Gradient Optimization: Instead of directly computing the inverse Hessian, TRPO uses a conjugate gradient to find the optimal update direction efficiently.

Code Example: TRPO Training Loop

Below is the main TRPO training function, where we apply trust region updates and compute the discounted rewards and advantages. (Only the key function is shown; the full notebook link.)

def train_trpo(env, policy, num_episodes=500, gamma=0.99):
    reward_history = []

    for episode in range(num_episodes):
        state = env.reset()
        if isinstance(state, tuple):
            state = state[0]  # Handle Gym versions that return (state, info)

        log_probs = []
        states = []
        actions = []
        rewards = []

        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            probs = policy(state_tensor)
            action_dist = torch.distributions.Categorical(probs)
            action = action_dist.sample()

            step_result = env.step(action.item())

            if len(step_result) == 5:
                next_state, reward, terminated, truncated, _ = step_result
                done = terminated or truncated  # New Gym API
            else:
                next_state, reward, done, _ = step_result  # Old Gym API

            log_probs.append(action_dist.log_prob(action))
            states.append(state_tensor)
            actions.append(action)
            rewards.append(reward)

            state = next_state

        # Compute discounted rewards and advantages
        discounted_rewards = compute_discounted_rewards(rewards, gamma)
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) 
        / (discounted_rewards.std() + 1e-9)

        # Convert lists to tensors
        states = torch.cat(states)
        actions = torch.tensor(actions)
        advantages = discounted_rewards

        # Copy old policy before updating
        old_policy = PolicyNetwork(env.observation_space.shape[0], 
        env.action_space.n)
        old_policy.load_state_dict(policy.state_dict())

        # Apply TRPO update
        trpo_step(policy, old_policy, states, actions, advantages)

        total_episode_reward = sum(rewards)
        reward_history.append(total_episode_reward)

        if (episode + 1) % 50 == 0:
            print(f"Episode {episode+1}, Total Reward: {total_episode_reward}")

    return reward_history

  🔗 Full implementation available here 

Code Explanation

The train_trpo function implements the Trust Region Policy Optimization update. The training loop initializes the environment and runs 500 episodes, collecting states, actions, and rewards for each step. The key difference from Policy Gradient (PG) is that TRPO maintains an old policy copy and updates the new policy while ensuring the update remains within a KL-divergence bound.

The advantages are computed using discounted rewards and normalized to reduce variance. Finally, conjugate gradient descent is used to determine the optimal policy step direction. Unlike standard gradient updates, TRPO restricts step size to prevent drastic policy changes, leading to more stable performance.

Expected Outcomes & Justification

The training curve for TRPO exhibits significant reward fluctuations, and the numerical results indicate that the policy does not consistently improve over time as shown below.

Expected Outcomes & Justification
Expected Outcomes & Justification

Unlike Policy Gradient (PG), which showed steady learning progress, TRPO struggles to maintain consistent improvements. Despite its theoretical advantages (trust region constraint preventing catastrophic updates), the actual results show high instability. The total rewards oscillate between low values (9-20), indicating that the agent fails to learn an optimal strategy efficiently.

This is a known issue with TRPO—it requires careful tuning of KL divergence constraints, and in many cases, the update process is computationally expensive and prone to suboptimal convergence. The reward fluctuations suggest that the agent isn’t exploiting learned knowledge effectively, reinforcing the need for a more practical and robust policy optimization method. PPO simplifies TRPO by approximating the trust region constraint using a clipped objective function, leading to faster and more efficient training. 

Proximal Policy Optimization (PPO)

TRPO ensures stable policy updates but is computationally expensive due to solving a constrained optimization problem at each step. PPO (Proximal Policy Optimization) simplifies this process by using a clipped objective function to restrict updates without requiring second-order optimization.

Instead of solving:

Proximal Policy Optimization (PPO)

PPO modifies the objective function by introducing a clipped surrogate loss:

Proximal Policy Optimization (PPO)

where:

  • r_t​(θ) is the probability ratio between new and old policies.
  • A_t​ is the advantage estimate.
  • ϵ is a small constant (e.g., 0.2) that limits excessive policy updates.

This prevents overshooting updates, making PPO more computationally efficient while retaining TRPO’s stability.

PPO Algorithm & Key Mathematical Concept

1. Advantage Estimation using GAE: PPO improves TRPO by using Generalized Advantage Estimation (GAE) to compute stable gradients:  

PPO Algorithm & Key Mathematical Concept

  where δ_t = r_t γV(s_(t+1)) V(s_t).  

2. Clipped Objective Function: Unlike TRPO, which enforces a strict KL constraint, PPO approximates the constraint using clipping:

PPO Algorithm & Key Mathematical Concept

This ensures that the update does not move too far, preventing policy collapse.

3. Mini-Batch Training: Instead of updating the policy after each episode, PPO trains using mini-batches over multiple epochs, improving sample efficiency.

Code Example: PPO Training Loop

Below is the main PPO training function, where we compute advantages, apply clipped policy updates, and use mini-batches for stable learning. (Only the key function is shown; full notebook link.)

def train_ppo(env, policy, optimizer, num_episodes=500, gamma=0.99, lambda_=0.95, epsilon=0.2, batch_size=32, epochs=5):
    reward_history = []

    for episode in range(num_episodes):
        state = env.reset()
        if isinstance(state, tuple):
            state = state[0]  # Handle Gym versions returning (state, info)

        log_probs = []
        values = []
        states = []
        actions = []
        rewards = []

        done = False
        while not done:
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            probs = policy(state_tensor)
            action_dist = torch.distributions.Categorical(probs)
            action = action_dist.sample()

            step_result = env.step(action.item())

            # Handle different Gym API versions
            if len(step_result) == 5:
                next_state, reward, terminated, truncated, _ = step_result
                done = terminated or truncated  # New API
            else:
                next_state, reward, done, _ = step_result  # Old API

            log_probs.append(action_dist.log_prob(action))
            states.append(state_tensor)
            actions.append(action)
            rewards.append(reward)

            state = next_state

        # Compute advantages
        values = [0] * len(rewards)  # Placeholder for value estimates (since we use policy-only PPO)
        advantages = compute_advantages(rewards, values, gamma, lambda_)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-9)  # Normalize advantages

        # Convert lists to tensors
        states = torch.cat(states)
        actions = torch.tensor(actions)
        old_log_probs = torch.tensor(log_probs)

        # PPO Training Loop
        for _ in range(epochs):
            for i in range(0, len(states), batch_size):
                batch_indices = slice(i, i + batch_size)

                new_probs = policy(states[batch_indices])
                new_action_dist = torch.distributions.Categorical(new_probs)
                new_log_probs = new_action_dist.log_prob(actions[batch_indices])

                loss = ppo_loss(old_log_probs[batch_indices], new_log_probs, advantages[batch_indices], epsilon)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        total_episode_reward = sum(rewards)
        reward_history.append(total_episode_reward)

        if (episode + 1) % 50 == 0:
            print(f"Episode {episode+1}, Total Reward: {total_episode_reward}")

    return reward_history

🔗 Full implementation available here 

Code Explanation

The train_ppo function implements Proximal Policy Optimization (PPO) using a clipped surrogate loss and mini-batch updates. Unlike TRPO, which computes trust region constraints, PPO approximates them by clipping policy updates, making it much more efficient.

  • The function begins by collecting episode trajectories (states, actions, log probabilities, and rewards).
  • Advantage estimation is computed using Generalized Advantage Estimation (GAE).
  • Mini-batches are used to update the policy over multiple epochs, improving sample efficiency.
  • Instead of a strict KL divergence constraint, PPO applies a clipped loss function to prevent destructive updates.

Expected Outcomes for PPO

The PPO training curve and numerical results show a clear improvement in policy learning over time:

PPO training curve
PPO training curve

Key Observations:

  • Stable Improvement: The early rewards (Ep 50-100) are low, indicating the agent is still exploring.
  • Steady Progress: By Episode 200, the total reward surpasses 200, showing the agent is learning a structured policy.
  • Fluctuations Exist, But Recovery is Fast: Between Ep 300-400, rewards drop, but PPO stabilizes and quickly rebounds to peak performance (500).
  •  Final Convergence: The model reaches 500 rewards (max score) by Ep 500, confirming PPO effectively learns an optimal strategy.

Compared to TRPO, PPO exhibits:

  • Less noisy training
  • Faster convergence
  • More efficient sample utilization: These improvements validate PPO’s clipped updates and mini-batch training as a superior approach to policy learning.

PPO is excellent for reward-based learning, but it struggles with preference-based fine-tuning in applications like LLMs (e.g., ChatGPT, DeepSeek, Claude, Gemini). DPO (Direct Preference Optimization) improves upon PPO by directly learning from human preference data instead of optimizing pure rewards.

Direct Preference Optimization (DPO) – Preference Learning for LLMs

Traditional reinforcement learning (RL) techniques are designed to optimize numerical reward-based objectives. However, Large Language Models (LLMs) like ChatGPT, DeepSeek, Claude, and Gemini require fine-tuning that aligns with human preferences rather than just maximizing a reward function. This is where Direct Preference Optimization (DPO) plays a crucial role. Unlike RL-based methods like PPO, which rely on an explicitly trained reward model, DPO optimizes models directly using human feedback. By leveraging preference pairs (where one response is preferred over another), DPO enables models to learn human-like responses efficiently.

DPO eliminates the need for a separate reward model, making it a simpler and more data-driven approach compared to Reinforcement Learning from Human Feedback (RLHF). Instead of reward-based fine-tuning, DPO updates the model parameters to increase the probability of preferred responses while decreasing the probability of rejected responses. This makes the training process more stable and avoids the complexities of RL algorithms like PPO, which involve constrained policy updates and KL penalties.

The significance of DPO lies in its ability to fine-tune LLMs in a way that ensures better response alignment with human expectations. By removing explicit reward models, it prevents the instability often associated with RL-based fine-tuning. Moreover, DPO reduces the risk of harmful, misleading, or biased outputs, making LLMs safer and more reliable. This streamlined optimization process makes it a practical alternative to RL-based fine-tuning, especially when human preference data is available at scale.

The DPO Training Dataset

For DPO, we use human preference data, where each prompt has a preferred response and a rejected response.

Example Preference Dataset (Used for Fine-Tuning)

preference_data = [
    {"prompt": "What is the capital of France?",
     "preferred": "The capital of France is Paris.",
     "rejected": "France is a country in Europe."},

    {"prompt": "Who wrote Hamlet?",
     "preferred": "Hamlet was written by William Shakespeare.",
     "rejected": "Hamlet is an old book."},

    {"prompt": "Tell me a joke.",
     "preferred": "Why did the scarecrow win an award? Because he was outstanding in his field!",
     "rejected": "I don’t know any jokes."},

    {"prompt": "What is artificial intelligence?",
     "preferred": "Artificial intelligence is the simulation of human intelligence in machines.",
     "rejected": "AI is just robots."},

    {"prompt": "How to stay motivated?",
     "preferred": "Set clear goals, track progress, and reward yourself for achievements.",
     "rejected": "Just be motivated."},
]

The preferred responses are accurate, informative, and well-structured, while the rejected responses are vague, incorrect, or unhelpful.

The DPO Loss Function

DPO is formulated as a pairwise ranking problem between a preferred response and a rejected response for the same prompt. The goal is to increase the log probability of preferred responses while decreasing the probability of rejected ones.

Mathematically, the DPO objective is:

The DPO Loss Function

Where:

  • y^+ is the preferred response
  • y^- is the rejected response
  • β is a scaling hyperparameter controlling preference strength
  • P_θ(y∣x) is the log probability of generating a response given input x

This is similar to logistic regression, where the model maximizes separation between preferred and rejected responses.

Code Example: Direct Preference Optimization (DPO)

DPO fine-tunes LLMs by training on human-labeled preference pairs. The core logic of DPO training involves optimizing model weights based on preferred vs. rejected responses. The function below trains a transformer-based model to increase the likelihood of preferred responses while decreasing the likelihood of rejected ones. Below is the key function for computing the DPO loss and updating the model (only the main function is shown for scope; full notebook is linked).

def dpo_loss(preferred_log_probs, rejected_log_probs, beta=0.1):
    """Computes the DPO loss function to optimize based on preferences"""
    return -torch.mean(torch.sigmoid(beta * (preferred_log_probs - 
    rejected_log_probs)))

def encode_text(prompt, response):
    """Encodes the prompt + response into tokenized format with proper padding"""
    tokenizer.pad_token = tokenizer.eos_token  # Fix padding issue
    input_text = f"User: {prompt}\nAssistant: {response}"

    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        padding=True,         # Enable padding
        truncation=True,      # Truncate if too long
        max_length=512        # Set max length for safety
    )

    return inputs["input_ids"], inputs["attention_mask"]

loss_history = []  # Store loss values

optimizer = optim.AdamW(model.parameters(), lr=5e-5)

for epoch in range(10):  # Train for 10 epochs
    total_loss = 0

    for data in preference_data:
        prompt, preferred, rejected = data["prompt"], data["preferred"], 
        data["rejected"]

        # Encode preferred and rejected responses
        pref_input_ids, pref_attention_mask = encode_text(prompt, preferred)
        rej_input_ids, rej_attention_mask = encode_text(prompt, rejected)

        # Get log probabilities from the model
        preferred_logits = model(pref_input_ids, attention_mask=
        pref_attention_mask).logits[:, -1, :]
        rejected_logits = model(rej_input_ids, attention_mask=rej_attention_mask)
        .logits[:, -1, :]

        preferred_log_probs = preferred_logits.log_softmax(dim=-1)
        rejected_log_probs = rejected_logits.log_softmax(dim=-1)

        # Compute DPO loss
        loss = dpo_loss(preferred_log_probs, rejected_log_probs, beta=0.5)

        # Optimize the model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    loss_history.append(total_loss)  # Store loss for visualization
    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")

🔗 Full implementation available here 

Expected Output & Analysis

The outcomes of Direct Preference Optimization (DPO) can be analyzed from multiple angles: loss convergence, probability shifts, and qualitative response improvements. The training loss curve shows a sharp drop in the initial epochs, followed by stabilization, indicating that the model quickly learns to align with human preferences. The plateau in loss suggests that further optimization yields diminishing improvements, confirming effective preference-based fine-tuning.

Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO)

The probability shift visualization reveals that preferred responses consistently achieve higher log probabilities than rejected ones. This confirms that DPO successfully adjusts the model’s behaviour, reinforcing the correct responses while suppressing undesired ones. Some variance in probability shifts suggests that certain prompts may still require fine-tuning for optimal alignment.

DPO probability shift visualization

A direct comparison of model responses before and after DPO fine-tuning highlights clear improvements. Initially, the model fails to generate a joke, instead providing an irrelevant response. After fine-tuning, it attempts humor but still lacks coherence. This demonstrates that while DPO enhances preference alignment, additional refinements or complementary techniques may be required to generate high-quality, structured responses.

DPO

Although DPO effectively tunes LLMs without an explicit reward function, it lacks the structured policy learning of reinforcement learning-based methods. This is where General Reinforcement Pretraining Optimization (GRPO) by DeepSeek comes in, combining the strengths of DPO and PPO to enhance LLM fine-tuning further. The next section will explore how GRPO refines policy optimization for large-scale models.

GRPO – Group Relative Policy Optimization (DeepSeek’s Approach)

DeepSeek’s Group Relative Policy Optimization (GRPO) is an advanced preference optimization technique that extends Direct Preference Optimization (DPO) while incorporating elements from Proximal Policy Optimization (PPO). Unlike traditional policy optimization methods that operate on single preference pairs, GRPO leverages group-wise preference ranking, enabling better alignment with human feedback in large-scale LLM fine-tuning.

Traditional preference-based optimization methods, such as DPO (Direct Preference Optimization), operate on pairwise comparisons—one preferred and one rejected response. However, this approach fails to scale efficiently when optimizing on large datasets where multiple responses per prompt are ranked in order of preference. To address this limitation, DeepSeek introduced Group Relative Policy Optimization (GRPO), which allows group-based preference ranking rather than just single-pair preference updates. Instead of comparing two responses at a time, GRPO compares all ranked responses within a batch and optimizes the policy accordingly.

Mathematically, GRPO extends DPO’s reward-free optimization by defining an ordered preference ranking among multiple completions and optimizing their relative likelihoods accordingly.

Mathematical Foundation of GRPO

Since this is the main intent behind the blog, we will dive deep into the mathematics of this.

1. Expected Return in Preference Optimization

In standard reinforcement learning, the expected return of a policy π_θ is:

Expected Return in Preference Optimization

where R(s_t,a_t) is the reward at timestep t.

However, LLM fine-tuning does not operate in traditional reward-based RL. Instead, we optimize over human preferences, meaning that reward models are unnecessary.

Instead of learning a reward function, GRPO directly optimizes the model parameters to increase the likelihood of higher-ranked responses over lower-ranked ones.

2. Ranking-Based Probability Optimization

Given a set of responses r_1,r_2, …, r_n ranked in order of preference, we define a likelihood ratio:

Ranking-Based Probability Optimization

where is the input prompt, and π_θ represents the policy (LLM) parameterized by θ. The key objective is to maximize the probability of higher-ranked responses while suppressing the probability of lower-ranked ones.

To enforce relative preference constraints, GRPO optimizes the following pairwise ranking loss across all response pairs:

Ranking-Based Probability Optimization

where:

  • σ(x) is the sigmoid function ensuring probability normalization
  • β is a temperature scaling parameter controlling gradient magnitude.
  • π_θ​ is the policy (LLM).
  • The sum iterates over all pairs (i, j) where r_i, is ranked higher than r_j.

The KL-regularized version of GRPO adds a penalty term to prevent drastic shifts in model behaviour:

KL-regularized version of GRPO

where D_KL​ ensures conservative updates to prevent overfitting.

Data for GRPO Fine-Tuning

Below is an example dataset used to fine-tune an LLM using ranked preferences:

grpo_preference_data = [
    {"prompt": "What is the capital of France?",
     "responses": [
         {"text": "The capital of France is Paris.", "rank": 1},
         {"text": "Paris is the largest city in France.", "rank": 2},
         {"text": "Paris is in France.", "rank": 3},
         {"text": "France is a country in Europe.", "rank": 4}
     ]},

    {"prompt": "Tell me a joke.",
     "responses": [
         {"text": "Why did the scarecrow win an award? Because he was outstanding 
         in his field!", "rank": 1},
         {"text": "Why did the chicken cross the road? To get to the other side.",
          "rank": 2},
         {"text": "Jokes are funny.", "rank": 3},
         {"text": "I don’t know any jokes.", "rank": 4}
     ]}
]

Each prompt has multiple responses with assigned ranks. The model learns to increase the probability of higher-ranked responses while reducing the probability of lower-ranked ones.

Code Implementation: Group-Based Preference Optimization

Below is the key function for computing the DPO loss and updating the model (only the main function is shown for scope; the full notebook is linked). The GRPO training function processes multiple ranked responses per prompt, optimizing log-likelihood differences while enforcing KL constraints.

def deepseek_grpo_loss(log_probs, rankings, input_ids, beta=1.0, kl_penalty=0.02, epsilon=1e-6):
    """Computes DeepSeek GRPO loss with pairwise ranking and KL regularization."""
    loss_terms = []
    num_pairs = 0

    log_probs = torch.clamp(log_probs, min=-10, max=10)  # Prevent extreme values

    for i in range(len(rankings)):
        for j in range(i + 1, len(rankings)):
            if rankings[i] < rankings[j]:  # Higher-ranked response should be preferred
                prob_diff = log_probs[i] - log_probs[j]
                pairwise_loss = -torch.log(torch.sigmoid(beta * prob_diff) + epsilon)  # Avoid log(0)
                loss_terms.append(pairwise_loss)
                num_pairs += 1

    loss = torch.stack(loss_terms).mean() if num_pairs > 0 else torch.tensor(0.0, device=log_probs.device)

    # KL regularization to prevent policy divergence
    old_logits = base_model(input_ids).logits[:, -1, :]
    old_log_probs = old_logits.log_softmax(dim=-1)

    kl_div = torch.nn.functional.kl_div(log_probs, old_log_probs.clamp(min=epsilon), reduction="batchmean")

    return loss + (kl_penalty * kl_div.mean())  # Ensure single scalar

Training Loop for GRPO

The training loop processes ranked responses, computes loss, and updates the model while enforcing stability constraints.

loss_history = []
num_epochs = 15

for epoch in range(num_epochs):
    total_loss = 0

    for data in grpo_preference_data:
        prompt, responses = data["prompt"], data["responses"]

        input_ids, rankings = encode_text(prompt, responses)

        logits = model(input_ids).logits[:, -1, :]
        log_probs = logits.log_softmax(dim=-1)

        loss = deepseek_grpo_loss(log_probs, rankings, input_ids)

        if torch.isnan(loss):
            print(f"Skipping update at epoch {epoch} due to NaN loss.")
            continue

        optimizer.zero_grad()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()

    loss_history.append(total_loss)
    scheduler.step()
    print(f"Epoch {epoch + 1}, Loss: {total_loss:.4f}")

🔗 Full implementation available here 

Expected Outcome and Results

The expected outcomes of GRPO fine-tuning on the LLM, based on the provided outputs, highlight improvements in model optimization and preference-based ranking.

The training loss curve shows a gradual and stable decline over 15 epochs, indicating that the model is learning effectively. Unlike conventional policy optimization methods, GRPO ensures that ranked responses improve without drastic fluctuations, suggesting smooth convergence.

DeepSeek GRPO Training Loss Curve
DeepSeek GRPO Training Loss Curve

The loss value distribution over epochs presents a histogram where most values concentrate around a decreasing trend, showing that GRPO efficiently optimizes the model while maintaining stable loss updates. This distribution further indicates that loss values do not exhibit large variations, preventing instability in preference ranking.

distribution of Loss Values over Epochs

The log probability distribution before vs. after fine-tuning provides crucial insights into the model’s response generation. The shift in probability distribution suggests that after fine-tuning, the model assigns higher confidence to preferred responses. This shift results in responses that align better with human expectations and rankings.

Log Probability Distribution before vs. after fine-tuning

Overall, the expected outcome of GRPO fine-tuning is a well-optimized model capable of generating high-quality responses ranked effectively based on preference learning. This demonstrates why GRPO is an effective alternative to traditional RL methods like PPO or DPO, offering a structured approach to optimizing LLMs without explicit reward models.

Final Model Insights: Why GRPO Excels in LLM Fine-Tuning

Unlike pairwise DPO and trust-region PPO, GRPO allows LLMs to learn from multiple ranked completions per prompt, significantly improving response quality, stability, and human alignment.

  • More scalable than pairwise methods → Learns from multiple ranked completions rather than just binary comparisons.
  • No explicit reward modeling → Unlike RLHF, GRPO fine-tunes without requiring a trained reward model.
  • KL regularization stabilizes updates → Prevents catastrophic shifts in response distribution.
  • Better generalization across prompts → Ensures the LLM produces high-quality, human-aligned responses.

With reinforcement learning playing an increasingly central role in fine-tuning LLMs, GRPO stands out as the next step in AI preference learning, setting a new standard for human-aligned language modeling.

Conclusion

Policy optimization techniques play a critical role in reinforcement learning and LLM fine-tuning. Each method—Policy Gradient (PG), Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO)—offers unique advantages and trade-offs. PG serves as the foundation but suffers from high variance, while TRPO provides stability at the cost of computational complexity. PPO, being a refined version of TRPO, balances efficiency and robustness, making it widely used in RL applications. DPO, on the other hand, optimizes LLMs directly using preference data, eliminating the need for a reward model. Finally, GRPO, as introduced by DeepSeek, enhances preference-based fine-tuning by leveraging relative ranking in a structured manner.

Below is a comparison of these LLM Optimization methods based on key aspects such as variance, stability, sample efficiency, and suitability for reinforcement learning versus LLM fine-tuning:

Method Variance Stability Sample Efficiency Best for Limitations
PG (REINFORCE) High Low Inefficient Simple RL problems High variance, slow convergence
TRPO Low High Moderate High-stability RL tasks Complex second-order updates, expensive
PPO Medium High Efficient General RL tasks, Robotics, Games May require careful hyperparameter tuning
DPO Low High High LLM fine-tuning with human preferences Lacks explicit reinforcement learning framework
GRPO Low High High Preference-based LLM fine-tuning Newer method, requires further empirical validation

For practitioners, the choice depends on the task at hand. If optimizing reinforcement learning agents in games or robotics, PPO is the best choice due to its balance of efficiency and performance. If high-stability optimization is required, TRPO is preferred despite its computational cost. DPO and GRPO, however, are better suited for LLM fine-tuning, with GRPO providing an even stronger optimization framework based on relative preference ranking rather than just binary preference signals.

Key Takeaways

Reinforcement learning (RL) plays a crucial role in both game-playing agents and LLM fine-tuning, but the optimization techniques vary significantly.

  • PG, TRPO, and PPO are fundamental in RL, with PPO being the most practical choice for its efficiency and performance balance.
  • DPO introduced a major shift in LLM fine-tuning by eliminating explicit reward models, making human preference alignment easier and more efficient.
  • GRPO, pioneered by DeepSeek, further refines LLM fine-tuning by optimizing for relative ranking rather than just binary comparisons, improving preference-based alignment.
  • For RL tasks, PPO remains the dominant method, while for LLM fine-tuning, DPO and GRPO are superior choices due to their ability to fine-tune models using direct preference data without RL instability.

This blog highlights how reinforcement learning and preference-based fine-tuning are converging, with new techniques like GRPO bridging the gap between structured optimization and real-world deployment of large-scale AI systems.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is the difference between PPO and DPO?

Ans. PPO (Proximal Policy Optimization) is an RL-based optimization method that improves policies while maintaining stability using a clipping mechanism. It is widely used in reinforcement learning tasks such as robotics and game-playing AI. DPO (Direct Preference Optimization), on the other hand, is designed specifically for LLM fine-tuning, directly optimizing the model based on human preferences without requiring an explicit reward model. DPO is simpler and more efficient for aligning language models with human intent.

Q2. Why is GRPO better than DPO for preference-based fine-tuning?

Ans. GRPO (Group Relative Policy Optimization) improves upon DPO by optimizing preferences in a ranked manner instead of binary preference signals. While DPO only differentiates between “preferred” and “rejected” responses, GRPO assigns relative rankings across multiple responses, capturing nuanced differences in preference. This allows LLMs to learn more refined distinctions and align better with human feedback.

Q3. When should I use TRPO over PPO?

Ans. TRPO (Trust Region Policy Optimization) should be used when strict stability constraints are required, such as in high-stakes RL environments (e.g., robotics, autonomous driving). However, it is computationally expensive due to second-order optimization. PPO (Proximal Policy Optimization) provides a more efficient and scalable alternative by approximating TRPO’s constraints using a clipping mechanism, making it the preferred choice in most RL scenarios.

Q4. Why do LLMs need preference optimization techniques like DPO and GRPO?

Ans. Traditional RL methods focus on maximizing numerical rewards, which do not always align with human expectations in language models. DPO and GRPO fine-tune LLMs based on human preference data, ensuring responses are helpful, honest, and harmless. Unlike Reinforcement Learning with Human Feedback (RLHF), these methods eliminate the need for a separate reward model, making fine-tuning more efficient and reducing potential biases from reward misalignment.

Advancing language model research by day and writing about my work online by night. I explore AI breakthroughs and transform complex studies into clear, engaging insights that empower professionals and enthusiasts alike.

Thanks for stopping by my profile!

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details