Reinforcement Learning Archives

GDPO Explained: How NVIDIA Fixes GRPO for Multi-Reward LLM Reinforcement Learning

GDPO is NVIDIA’s solution to GRPO’s limitations in multi-reward RL for large language models. We break down the paper in this post.

Discover how reinforcement learning enables hierarchical reasoning in LLMs and how HICRA improves on top of GRPO.

In this post we break down Microsoft’s Reinforcement Pre-Training, which scales up reinforcement learninng with next-token reasoning

DeepSeekMath is the fundamental GRPO paper, the reinforcement learning method used in DeepSeek-R1. Dive in to understand how it works

Explore DAPO, an innovative open-source Reinforcement Learning paradigm for LLMs that rivals DeepSeek-R1 GRPO method.

Dive into SWE-RL by Meta, a DeepSeek-R1 style recipe for training LLMs for software engineering with reinforcement learning.

Dive into the groundbreaking DeepSeek-R1 research paper, introduces open-source reasoning models that rivals the performance OpenAI’s o1!

In this post we dive into a Stanford research presenting Generative Reward Models, a hybrid Human and AI RL to improve LLMs