The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
5.4 KiB
AGENTS.md — projected_grpo
This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is reasearch code We want readable, fail fast, pseudocode like code Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit
see @README.md for project overview
Project in one paragraph
Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.
Hypothesis
We can find and the "reward hacking direciton", and reduce RL from learning reward hacking Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.
Motovation: We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
Inherit global rules from ~/.claude/CLAUDE.md.
Things the user has had to explain many times:
- No-cheat (VECTOR-framed -- this is the load-bearing constraint; cast every "is this
cheating / is this weak" question in vector terms, since this repo is vector-based).
Three things, don't conflate them:
- The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval.
- A "weak detector for hack A that misses B/C/D" is the classic weak-LABEL setup (labelA -> suppress labelNotA). That is NOT our setup; do not frame our method that way.
- OUR setup is
vec -> routing: extract a hack directionvecfrom hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by its cosine alignment tovec. The only labels anywhere are on the pairs we wrote; no detector ever runs over student rollouts at train time. Generalization is tested by whethervec(built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. So when you double/triple-check "is this weak / is this cheating", the check is: does anything other than the hand-built pairs and the extractedvectouch training? If a live detector or the oracle leaks in, it's cheating.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
Extra instructions:
- When you queue a job, follow with
pueue follow | tailin bg so you are woken on fail or finish - for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
- say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read
Files
- Read docs/spec.md for the preregistered plan.
- Read docs/brainstorm/extracted_prefs.md for design rationale.
- New sweep arms get recipes in justfile with
# H:hypothesis comments. just smokebefore any real run (~1-2 min, beartype on, real pipeline on tiny inputs).- Real runs go through
pueueon the 96GB GPU box. Label each job withwhy:andresolve:. - Head docs/RESEARCH_JOURNAL.md for latest results.
- No
tests/dir;smokeis the correctness gate. Unit checks arescripts/verify_*.py(assert + exit nonzero), wired into thesmokerecipe so every smoke runs them. - Every load-bearing invariant gets a
verify_*.pygate. The no-cheat leak (held-out modes received <=1.1% detector labels via the route2 gate, 2026-06-05) shipped because nothing asserted it: the existing gates were green but none covered the property. If a claim is load-bearing, write the assertion test in the same commit -- "the tests passed" means nothing if the property was never tested. This one was found by hand-counting, not by a gate; nowscripts/verify_gate_anchor.pycovers it.
On persona pairs
- ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md
- ./docs/personas/personas_kept.md
On concepts such as "what are contrastive pairs" or "why SVD space" grep
- ./docs/vendor/AntiPaSTO_concepts/README.md
For the original paper
- LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
- Code: ./docs/vendor/rl-rewardhacking