wassname 74e228529b feat: finish the self-contained substrate -- 6 -> 24 problems (6 per mode)
The "full-preset loader" TODO: full (200 steps, 43 prompts/step) was resampling
just 6 problems. Expand the hand-written substrate to 24 (6 per loophole mode,
matching the old repo's partition) so a whole mode can be held out for the
generalisation test and full/fast see real variety. Self-contained by design --
no external dataset.

- problems.py: +18 simple one-liner problems (max/min/sum/count/...), balanced
  6 per mode. All hand-verified.
- rewards._self_check: new guard -- every problem's clean solution must pass the
  strict oracle (gt_correct, not exploited). A wrong body/gt_test now fails
  `just check` loud. Confirms 6/mode.
- train: fast + full presets use n_problems=24 (smoke keeps 6).

just check: diagonal clean + all 24 clean solutions pass the oracle.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 06:30:20 +00:00
2026-06-01 00:19:30 +00:00

projected_grpo

Motovation: Can we erase or route reward hacking using a "cheat direction"?

Hypothesis: a weak detector that only knows some hacks generalises to suppress the unknown ones, which is the situation any real deployment is in.

Experiment: We can take contrastive (hack, clean) prompts to extract a "cheat direction" in the model's activation space.Then during GRPO we erase or route "cheat" gradients.

In early experiment this worked, but the model quickly routed around it, so we take SGTM's approach of absorbing the cheat direction in a non-adverserial manner.

How?

Like SGTM (Selective Gradient Masking, which localises unwanted capabilities into deletable parameters during pretraining), but for RL reward-hacking. When GRPO teaches a coding model to cheat, we route the "cheat direction" out of each gradient into a throwaway adapter we delete at deployment, or just erase it. SGTM decides what to route using a per-example classifier label. We have no such label: instead we route by a "cheat direction" extracted from contrastive (hack, clean) prompts (see problems.py).

Environment: We reuse the simple ariahw/rl-rewardhacking, extended from one to four distinct loophole hacks. Researchers may find this useful as a toy environment. But to avoid 65 hours experiments we use teacher examples to bootstrap it, resulting in 30 minutes experiments.

Full story: blog draft.

Try it: uv sync && just smoke.

Building on the code: AGENTS.md.

S
Description
No description provided
Readme 615 KiB
Languages
Python 94%
Just 6%