Cheating LLMs & How (Not) To Stop Them | OpenAI Paper Explained

What if the most advanced AI models are secretly finding ways to cheat the systems they’re meant to follow, finding illegitimate ways to achieve their goals?
In this video we break down a fascinating, and alarming, paper from OpenAI, titled: Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation, exploring reward hacking in top reasoning models.

Paper authors
Paper authors (Source)

Introduction

Reward hacking is a critical challenge in reinforcement learning (RL). In RL, the model outputs are rewarded based on success criteria, and the model is updated to prefer higher reward outputs. Reward hacking is the phenomenon where the model discovers a loophole or unintended way to maximize the reward, which does not align with the desired goal.

For instance, imagine a large language model (LLM) tasked with writing code, rewarded based on how many tests its code passes. Instead of genuinely solving the problem, the model might exploit glitches in the test cases to simplify the code, ensuring it passes the tests without truly solving the intended task.

This problem isn’t unique to AI. Humans also fall short in similar scenarios. The researchers illustrate this in the paper with an intriguing example from Hanoi in 1902. The government tried to control the rat population by paying citizens for each rat tail they turned in. However, it didn’t turn out as expected. People began farming rats solely for their tails, which not only failed to reduce the rat problem but made it worse.

Reward Hacking Example

Reward hacking example of a frontier OpenAI model
Reward hacking example of a frontier OpenAI model (Source)

Top reasoning LLMs today don’t simply return a final answer. Instead, they provide a detailed breakdown of their reasoning process, known as their chain-of-thought (CoT). This allows us to understand how the model arrives to its final answers.

In the above example, a frontier reasoning model caught reward hacking in the wild. The input task for this model is to make a small fix in a code repository. Since the actual chain-of-thought reasoning is long, the red-colored sections summarize parts of the reasoning process. To begin, the model evaluates where the functions it needs to fix are used and identifies some test cases.

In green, we see actual outputs of the model which are part of its original chain-of-thought. We see that the model notes that the test only checks a certain part of the code. Afterwards, the model keeps on reviewing related files.

Next, the model explicitly notes in its reasoning that implementing a genuine solution would likely be hard and that it could fudge the solution and circumvent the tests by adjusting a specific function to always return ‘true’. Then, the model proceeds to explore this idea, verifying that its approach could work.

Finally, we see the actual code change. The model removes all of the method’s original implementation and replaces it with a line that simply forces the function to always return ‘true’.

So what can we learn from this example? it shows that the model’s chain-of-thought contains information that can help us identify when it is about to cheat. The research doesn’t claim this is always the case, but it happens frequently enough to be valuable. If that’s true, could we utilize a LLM we already trust to monitor the chain-of-thought of a stronger model during training?

Reward Hacking Chain-of-Thought Monitoring

Catching reward hacking using chain-of-thought monitoring
Catching reward hacking using chain-of-thought monitoring (Source)

The above figure shows the effectiveness of reward hacking monitoring. In this case GPT-4o is used to monitor reward hacking of a stronger model.
Two types of monitors are evaluated:

  • Action monitor – has access only to the final answer of the model.
  • CoT monitor – has access to both the final answer and the chain-of-thought.

The x-axis represents the progress of the training process. The y-axis represents the amount of training samples that were evaluated using the monitor. These samples are coding tasks, where the model is tasked with implementing new functionality such that all unit tests pass. The model being trained is a top reasoning OpenAI model from the family of o1 and o3-mini.

Reward Hack #1 – exit(0)

At some point during training, the model learns to reward hack by utilizing the exit(0) command, exploiting a bug that allowed it to exit from the environment early without running the unit tests. Once this hack is known, it can be caught using a simple regular expression. This allowed the researchers to check how many times the monitors are able to detect it.

  • In blue, we see the rate in which both monitors detect this hack.
  • In purple, we see the rate in which only the action monitor detects this hack, which is barely visible in the chart since it almost never happens.
  • In green, we see the rate in which only the chain-of-thought monitor catches the reward hack.
  • In red, we see that there are cases that none of the monitors was able to detect the reward hack.

Clearly, the chain-of-thought monitor, which detects 93% of the cases, is doing a better job comparing to the action monitor, which detects only 40% of the cases.
At a certain point, the researchers fixed the bug in the testing environment, which prevents that reward hack, causing the drop in monitored samples.

Reward Hack #2 – raise SkipTest

The model learns another reward hack, using ‘raise SkipTest’ to raise an exception from functions outside the testing framework in order to skip unit tests evaluation. In this case as well, the chain-of-thought monitor is doing a better job at detecting the reward hack, catching 97% of the cases, comparing to 80% detected by the action monitor. Then, a fix is introduced for this issue in the testing environment as well.

More Sophisticated Reward Hacks

While the two reward hacks we’ve reviewed are simple, and therefore easy to analyze, the chain-of-thought monitor discovered additional, more sophisticated reward hacks. Following are a few examples:

  • Writing stubs instead of real implementations, which allows passing unit tests when their coverage is poor.
  • Modifying test files or libraries that the tests are dependent on, in such a way that the test will trivially pass.
  • The model found compiled files with a reference solution and decompiled them.

The Risk Of Chain-of-Thought Obfuscation

This is the part where it gets really interesting. If chain-of-thought monitoring is effective at detecting reward hacking, could we use that during training to optimize the model not to cheat?

To explore this, the researchers conducted an experiment by training a model with a honeypot, a deliberate setup designed to tempt the model into cheating, to be able to analyze this behavior. Specifically, the model was instructed not to change the tests, yet it was allowed to do so. To verify if the model cheated, the researchers simply ran the original tests.

During training, the reward is now not solely based on passing the test cases. Instead, it now also requires that the chain-of-thought monitor model did not detect a reward hack. In other words, we punish the model if we detected a reward hack in its chain-of-thought.

Analyzing The Experiment

What happens when punishing the models on reward hacks
What happens when punishing the models on reward hacks (Source)

Across all charts, the x-axis represents the progress of the training process. The blue lines represent the model that is punished if the chain-of-thought monitor caught it reward hacking, and the dotted green line represents a baseline, where the reward is solely determined based on the unit tests.

On the left, we see the ratio of samples that were solved correctly, meaning that all unit tests pass and the model did not cheat. As training progresses, both models learn to cheat, but the model with the monitor reward cheats less often than the baseline. So far, so good, we actually see that using chain-of-thought monitoring in the rewards reduces cheating.

But, the middle and right charts expose an alarming behavior. The middle chart measures the ratio of reward hacking cases that the chain-of-thought model successfully detects, while the right chart shows the ratio of undetected cases. While both models learn to hack, the baseline model’s hacks are mostly detectable by the monitor. However, for the model punished for reward hacking, the monitor becomes incapable of detecting cheating in most cases.

So, the model is still cheating, it just that now it knowns not to expose its intent in its chain-of-thought. Therefore, it might be safer to leave chain-of-thought reasoning unrestricted and reserve it for monitoring purposes.

To finish, below we can see an example where the model that was punished for reward hacking, hides its intent to hack:

Example for reward hacking with hidden intent
Example for reward hacking with hidden intent (Source)

References & Links

  • Paper
  • Join our newsletter to receive concise 1-minute read summaries for the papers we review – Newsletter

All credit for the research goes to the researchers who wrote the paper we covered in this post.

Cheating LLMs Teaser
Cheating LLMs Teaser
Scroll to Top