Generative Reward Models: Merging the Power of RLHF and RLAIF for Smarter AI
Introduction In recent years, we’ve witnessed tremendous progress in AI, primarily due to the rise of large language models (LLMs) such as LLaMA-3.1. To further enhance LLM capabilities, extensive research focuses on improving their training processes. In this post, we review a recent research paper by Stanford University and SynthLabs that suggests a potential improvement, which may significantly advance AI. The paper is titled “Generative Reward Models”, authored by some of the same individuals behind the widely-used DPO method. LLM Training Process Before diving to Generative Reward Models (GenRMs), let’s do a quick recap for how large language models are trained. Pre-training Stage LLMs are first pre-trained on huge amount of text, to learn general purpose knowledge. This step helps the LLM to be good at predicting the next token in a […]
Generative Reward Models: Merging the Power of RLHF and RLAIF for Smarter AI Read More »