DeepSeek-R1 Paper Explained - A New RL LLMs Era in AI?

In this post, we dive into DeepSeek-R1, which has shaken the AI industry, explaining the research paper behind it in simple terms.

Introduction

In recent years, the field of artificial intelligence (AI) has experienced rapid advancements, with Large Language Models (LLMs) paving the way towards artificial general intelligence (AGI). One remarkable model, OpenAI’s o1, introduced innovative inference-time scaling techniques that significantly enhance reasoning capabilities. However, it remains closed source.

Today, we dive into the groundbreaking research paper by DeepSeek which introduced DeepSeek-R1. The paper, titled “DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning”, presents a state-of-the-art, open-source reasoning model and a detailed recipe for training such models using large-scale reinforcement learning techniques.

Recap: LLMs Training Process

Before we dive into the paper itself, let’s briefly recap the training process for LLMs. Typically, LLMs undergo three main stages of training:

Pre-training: In this stage, LLMs are pre-trained on vast amounts of text and code to learn general-purpose knowledge. This step helps the model become proficient at predicting the next token in a sequence. For example, given an input like “write a bedtime _,” the model can complete it with a reasonable word, such as “story.” However, after pre-training, the model still struggles to follow human instructions. The next stage addresses this.
Supervised Fine-tuning: In this stage, the model is fine-tuned on an instruction dataset. Each sample from the dataset consists of an instruction-response pair, where the response is used as the label. After this stage, the model becomes better at following instructions.
Reinforcement Learning: LLMs are further improved using feedback. A powerful method for this is Reinforcement Learning from Human Feedback (RLHF), where the model is trained based on human feedback. Gathering large-scale, high-quality human feedback, especially for complex tasks, is challenging. Therefore, another common approach is Reinforcement Learning from AI Feedback (RLAIF), where an AI model provides the feedback. For RLAIF to work effectively, a highly capable model is needed to provide accurate feedback.

Introducing the DeepSeek-R1-Zero Model

Training DeepSeek-R1-Zero using only RL in post-training, without SFT

The paper we’re reviewing today eliminates, or partially eliminates, the supervised fine-tuning stage. Specifically, to train DeepSeek-R1-Zero, the first model presented in the paper, we start with a pretrained model called DeepSeek-V3-Base, which has 671 billion parameters. The supervised fine-tuning stage is completely omitted. To run reinforcement learning at a large scale, instead of using the standard reinforcement learning with human or AI feedback, a rule-based reinforcement learning method is employed.

Rule-based Reinforcement Learning

GRPO samples multiple outputs for a given output, and instructs the model to prefer the best one, using reward per each output

The reinforcement learning method used is called Group Relative Policy Optimization (GRPO), developed in-house at DeepSeek.

Given a model to train and an input problem, the input is fed into the model, and a group of outputs is sampled. Each output consists of a reasoning process and an answer. The GRPO method observes these sampled outputs and trains the model to generate the preferred options by calculating a reward for each output using predefined rules:

Accuracy: One set of rules calculates an accuracy reward. For instance, in math problems with deterministic results, we can reliably check if the final answer provided by the model is correct. For code problems with predefined test cases, a compiler generates feedback based on the test cases.
Format: Another type of rule creates format rewards. In the below figure from the paper, we can see how the model is instructed to respond, with its reasoning process within <think> tags and the answer within <answer> tags. The format reward ensures the model follows this formatting.

This rule-based mechanism, which does not use a neural model to generate rewards, simplifies and reduces the cost of the training process, making it feasible at a large scale. Moreover, the researchers found that reward models might suffer from reward hacking, where the model discovers a loophole or unintended way to maximize the reward, which does not align with the desired goal.

The model is instructed to adhere to a certain format, generating its reasoning process within think tags, and answer within answer tags (Source)

DeepSeek-R1-Zero Performance Insights

Let’s now explore a few performance insights of the DeepSeek-R1-Zero model.

DeepSeek-R1-Zero performance comparison with OpenAI o1 (Source)

In the above table from the paper, we see a comparison of DeepSeek-R1-Zero and OpenAI’s o1 on reasoning-related benchmarks. Impressively, DeepSeek-R1-Zero is comparable to o1 and even surpasses it in some cases. The below fascinating figure from the paper shows the improvement progress during training, as measured on the AIME dataset. Notably, the average pass@1 score on AIME significantly increases, jumping from an initial 15.6% to an impressive 71.0%, reaching levels comparable to OpenAI’s o1!

DeepSeek-R1-Zero improvement progress during training (Source)

Self-Evolution Process of DeepSeek-R1-Zero

A key insight from the paper is the self-evolution process of the model, illustrated in the above figure. The x-axis shows the number of training steps, while the y-axis indicates that as training progresses, the model’s response lengths increase. Through reinforcement learning, the model naturally learns to allocate more thinking time when solving reasoning tasks. Amazingly, this occurs without any external adjustments.

The ‘Aha Moment’ Phenomenon

If the above was not enough, there’s another intriguing phenomenon referred to in the paper as the ‘Aha moment’ of DeepSeek-R1-Zero. The below example from the paper demonstrates this phenomenon. Given a math question, the model starts its reasoning process. However, at a certain point, the model begins to reevaluate its solution. The model learns to reevaluate its initial approach and correct itself if needed. This remarkable capability emerges naturally during the reinforcement learning training.

The aha moment where the models learn to reevaluate its reasoning (Source)

Training Process of the DeepSeek-R1 Model

Let’s now discuss the training process of the second model, called DeepSeek-R1. But first, why do we need a second model given the remarkable capabilities that we’ve just seen?

Why DeepSeek-R1 Is Needed?

There are two main reasons:

Readability Issues: DeepSeek-R1-Zero’s outputs often suffer from poor readability.
Language Consistency: It frequently mixes languages within a single response.

The above make DeepSeek-R1-Zero less user-friendly. Interestingly, an ablation study shows that guiding the model to be consistent with one language slightly damages its performance. It’s fascinating that the model learns to express itself better by using more than one language, unlike humans who usually stick to a single language.

Training Pipeline of DeepSeek-R1

Illustration of DeepSeek-R1 training pipeline

To address these issues, DeepSeek-R1 is trained in a four phases pipeline:

Cold Start (Phase 1): Starting with the pre-trained model DeepSeek-V3-Base, the model undergoes supervised fine-tuning on a small dataset of results collected from DeepSeek-R1-Zero. These results were validated as high-quality and readable. This dataset contains thousands of samples, making it relatively small. Incorporating a supervised fine-tuning phase on this small, high-quality dataset helps DeepSeek-R1 mitigate the readability issues observed in the initial model.
Reasoning Reinforcement Learning (Phase 2): This phase applies the same large-scale reinforcement learning we’ve reviewed for the previous model to enhance the model’s reasoning capabilities. Specifically, in tasks such as coding, math, science and logic reasoning, where clear solutions can define rewarding rules for the reinforcement learning process.
Rejection Sampling and Supervised Fine-Tuning (Phase 3): In this phase, the model checkpoint from phase 2 is used to generate many samples. With rejection sampling, only correct and readable samples are retained. Additionally, a generative reward model, DeepSeek-V3, is used to decide which samples should be kept. Some of DeepSeek-V3’s training data is also included in this phase. The model is then trained on this dataset using supervised fine-tuning. This dataset includes more than reasoning-oriented questions, enhancing the model’s capabilities across more domains.
Diverse Reinforcement Learning Phase (Phase 4): This final phase includes diverse tasks. Rule-based rewards are utilized for tasks that allow that, such as math. For other tasks, a LLM provides feedback to align the model with human preferences.

Additionally, various smaller open-source models were distilled using the dataset constructed in phase 3, offering smaller alternatives with high reasoning capabilities.

Remarkable Results of DeepSeek-R1

DeepSeek-R1 comparison with OpenAI o1 (Source)

We conclude this review by highlighting the remarkable results of the freely available DeepSeek-R1 compared to OpenAI’s o1 model. The above figure from the paper shows how DeepSeek-R1 is not only comparable to but also surpasses o1 in certain benchmarks.

Additionally, the 32 billion parameters distilled model also demonstrates impressive performance, making it a viable smaller alternative with high reasoning capabilities.

References & Links

Paper page
GitHub page
Join our newsletter to receive concise 1-minute read summaries for the papers we review – Newsletter

All credit for the research goes to the researchers who wrote the paper we covered in this post.