DAPO: Enhancing GRPO For LLM Reinforcement Learning

In this post, we explore the research paper titled “DAPO: An Open-Source LLM Reinforcement Learning System at Scale“, which uncovers key insights the researchers gained while attempting to replicate the DeepSeek-R1 method.

Introduction

Large Language Models (LLMs) have made remarkable progress in complex reasoning tasks like competitive math and coding. Models such as OpenAI’s o1 and o3-mini have demonstrated impressive performance with test-time scaling. However, these models are closed-source, and OpenAI has not shared enough details to enable the research community to replicate their training methods.

This sparked a huge wave of research effort to develop test-time scaling reasoning in LLMs, building up to the release of DeepSeek-R1. DeepSeek released DeepSeek-R1 as open-source, accompanied by a detailed research paper explaining its training process. Particularly, DeepSeek-R1 paper has democratized the idea of using large-scale reinforcement learning (RL) to enhance long-form reasoning capabilities in LLMs. However, key technical details behind DeepSeek-R1 RL training remain elusive, making it difficult for the research community to reproduce these results.

This is where DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) steps in. The researchers behind DAPO have not only developed a high-performing RL system but also fully open-sourced the training code, dataset, and algorithm, creating a reproducible framework for large-scale RL in LLMs. DAPO achieves an impressive 50 points on the AIME 2024 benchmark, which comprises problems from a challenging high school math competition. DAPO outperforming the DeepSeek-R1’s 47 points with only 50% of the training steps, while using the same Qwen2.5-32B as a base model. Note that this is DeepSeek-R1’s 32B version trained with large-scale RL, and not the distilled DeepSeek-R1.

Based on the same model (Qwen2.5-32B), DAPO outperforms the RL-based 32B params DeepSeek-R1 on AIME 2024 (Source)

Watch this video on YouTube

The GRPO Struggle

When the researchers first attempted to replicate the 32B RL-based DeepSeek-R1 results, they encountered a significant performance gap. Their initial run only achieved 30 points on the AIME 2024 benchmark, compared to the 47 points achieved by DeepSeek-R1.

In response, the researchers set out to close this gap by developing DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a new RL algorithm specifically designed to address challenges in long chain-of-thought (CoT) reasoning tasks. DAPO builds upon GRPO and incorporates several novel techniques, which we will review in this post.

Before diving in, it’s important to note that while the performance gap could be attributed to undisclosed technical details about the RL process, another potential factor is the use of different datasets for training. The researchers created and open-sourced a new dataset, DAPO-Math-17K, which may differ in structure or content from the dataset used by DeepSeek-R1. However, the focus of the paper remains on improving the RL algorithm and training process. Nevertheless, the findings are fascinating.

Introducing DAPO

Now, since DAPO is similar to GRPO, let’s look at the optimization objective for both GRPO and DAPO. But, don’t worry about the math. In the following sections, we’ll explain the key insights of the paper in plain words so you can follow along without a deep understanding of the formulas.

GRPO optimization objective:

DAPO optimization objective:

In both GRPO and DAPO:

Definitions of the policy ratio (r) and the advantage (A) (Source)

The Policy Ratio (r) – Confidence Change

‘r’ represents the Policy Ratio. In our case, the policy is the LLM that we train.

The numerator is the probability of a certain response (o) by the LLM, given a question (q), after the LLM has been updated in a certain training step. This represent the parameters that are updated as part of each training step.
The denominator is the probability of the same response by the same LLM, but before it has been updated (hence the notation “old”). This is just a number determined by the current state of the model.

The ratio between them reflects the change in the LLM confidence for a specific response:

If the ratio is close to 1 (r ≈ 1), it means the new policy (the LLM) is just as confident as the old policy about the response.
If the ratio is greater than 1 (r > 1), the new policy is more confident about the response.
If the ratio is less than 1 (r < 1), the new policy is less confident about the response.

As an intuition, if the response is great, we’ll want r to be greater than 1, and if the response is poor, we’ll want r to be lower than 1.

The Advantage (A) – Response Quality

‘A’ represents the advantage of the LLM response. This is a measure for the quality of the response comparing to other responses. In more details:

Multiple responses are sampled for a given question.
A reward is calculated for each response, either by using a reward model, or by using rule-based reward.
The advantage of a response is measured by how high its reward, comparing to other responses for the same question.
- Good Actions: If A > 0, it means the response is better than the average response for a specific question. Multiplying the policy ratio by a positive number encourages the model towards preferred responses.
- Poor Actions: If A < 0, the response is worse than the average. Multiplying the policy ratio by a negative number encourages the model against not-preferred responses.

DAPO Technique #1 – Clip-Higher

Understanding Clipping

To understand Clip-Higher, let’s first cover the concept of clipping. Clipping is a common reinforcement learning technique that helps to stabilize the training process by preventing large updates to the model. In simple words, we want each training step to make small, incremental improvements rather than drastic changes that could destabilize training.

Think that each training step is not exposed to the full training data, but rather just to a small batch of training samples. We don’t want any single small batch to have a dramatic impact. Gradual improvements over time lead to more stable training.

The clipping factor is determined by ε in the objective formula, restricting the policy ratio to a trust region between 1-ε to 1+ε.

Understanding Clip-Higher In DAPO

The researchers found that the common default clipping value ε = 0.2 causes entropy collapse. Entropy collapse is a phenomenon where the model becomes too confident about certain deterministic options. The lack of randomness can prevent the model from exploring different solutions.

The underlying issue is that clipping favors responses with already high probability, limiting exploration for low-probability responses. A mathematic intuition for that is that the existing response probability sits in the denominator, and the new probability in the numerator. If a response has low probability (dividing by a low number), then updating the new probability to a value that pass the clipping boundary is easier comparing to a response with higher probability (dividing by a larger number).

In GRPO, a single hyperparameter ε controls both the lower and upper clipping range. DAPO decouples them, and increase the upper clipping range from 0.2 to 0.28. The researchers do not increase the lower clipping range to avoid decreasing probabilities of unlikely tokens to 0.

In the following figure we can see the impact of Clip-Higher. On the left chart, the purple line shows that the accuracy of the model is improved when using Clip-Higher comparing to the baseline which is using the default value. On the right chart we see the entropy collapse when not using Clip-Higher, and how Clip-Higher avoids the entropy drop.

Clip-Higher impact on the RL training process (Source)

DAPO Technique #2 – Dynamic Sampling for Stronger Training Signals

The Problem

In GRPO and DAPO, multiple responses are sampled for each question, and the advantage of each response is calculated relative to the other sampled responses. In rule-based RL, the reward is often fixed (1 if the solution is correct, otherwise -1). So what happens if all sampled responses are correct?

The advantage for each response becomes zero because the average reward is identical across all samples. This removes the training signal for such a sample.

As training progress and the model gets stronger, this scenario becomes more common, since the model solves more questions correctly in all sampled solutions. This means that each training step is impacted by a smaller than intended batch, only from the questions that are not completely solved. This reduces the update signals, and can lead to larger variance of updates magnitude.

The Solution: Dynamic Sampling

To address this, DAPO uses dynamic sampling. If all sampled responses for a question are correct, we filter out the question and sample a new one instead, until the batch is full with non redundant questions.

In DAPO’s objective, the part that represents dynamic sampling is highlighted in red below:

Highlighting dynamic sampling in DAPO's optimization objective — Highlighting dynamic sampling in DAPO’s optimization objective (Source)

Interestingly, the over-sampling does not slow down training. The following figure shows that dynamic sampling achieves the same performance as without dynamic sampling, with only a third of the training steps!

Dynamic sampling requires less training step to reach the same performance (Source)

DAPO Technique #3 – Token-Level Loss for Better Training Stability

The Problem

In reinforcement learning, while the reward is calculated at the response level, the loss is calculated at the token level. In other words, there’s no ground truth label for each token, only a reward for the entire response. Since the model generates one token at a time, the loss needs to be propagated at the token level to guide the model’s learning step by step.

GRPO calculates the loss at the sample-level by averaging the token-level losses within each response by the response length, and them across all sampled responses. This effectively gives equal weight to each response, regardless of its length.

However, sampled responses vary in their length. For high-quality long responses, this slows down learning reasoning patterns from these responses, and poor-quality long responses are not penalized enough.

The Solution: Token-Level Policy Gradient Loss

DAPO addresses this by averaging the loss for all sampled responses together based on their overall length. This is achieved with a small change in GRPO ‘s objective function, moving the division by the response length to the end of the loss calculation. In the DAPO objective function, the part representing token-level averaging is highlighted in red below:

Highlighting the averaging of the loss by the overall length of sampled responses (Source)

The following figure shows how this change stabilizes training, preventing unhealthy increases in entropy (left) and response length (right).

Token-level loss contributes to the training stability (Source)

DAPO Technique #4 – Overlong Responses Punishment

In reinforcement learning, a maximum response length is usually set, and overlong responses are truncated. When this happens, the reward for the response is negative since the model didn’t reach the final answer. However, the reasoning process up to the truncation point may still be valid. This can confuse the model since it is being punished for high-quality reasoning.

To address this, the researchers explore two approaches. The first is called Overlong Filtering, which avoids updating the model based on truncated responses. The following figure shows its contribution to accuracy improvement (left) and to training stabilization (right).

Contribution of the Overlong Filtering strategy (Source)

The second approach is called Soft Overlong Punishment, which extends the rule-based reward assigned to a response with another reward component. The new reward component is defined by the following equation, gradually signaling the model that the response is too long. The model is only punished when the response exceeds a certain length, starting with a small punishment and increasing as the response grows longer.

Soft Overlong Punishment reward definition (Source)

Main Results

The below table shows how the techniques we covered contribute to improving DAPO score on AIME from 30 to 50, outperforming DeepSeek-R1’s 32B RL-based model.

Reproduction of the Aha Moment

In DeepSeek-R1, the researchers describe an ‘aha moment’, where the model develops the capability to reflect on its solution during reasoning, and reevaluate its intial direction for the solution.

The following example from the paper shows the development of this behavior during reinforcement learning training with DAPO:

KL Divergence Removal

GRPO and DAPO objectives, highlighting the KL divergence component in GRPO (Source)

Another change we did not cover yet is the removal of the KL divergence from the objective. We see in the objectives that GRPO has a KL divergence component, which does not exist in the DAPO objective. This component helps to ensure that the model is not drifted too far away from the pretrained model. However, the paper says that during RL for long CoT reasoning, the model distribution can diverge significantly from the initial model, and therefore this restriction is not needed.

References & Links

Paper
Code
Join our newsletter to receive concise 1-minute read summaries for the papers we review – Newsletter

All credit for the research goes to the researchers who wrote the paper we covered in this post.