Files
evil_MoE/AGENTS.md
T
wassname 83cae4ef72 docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md
The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.

- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
  = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
  not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
  no-cheat reference at AGENTS.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:39:48 +00:00

5.4 KiB

AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is reasearch code We want readable, fail fast, pseudocode like code Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit

see @README.md for project overview

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Hypothesis

We can find and the "reward hacking direciton", and reduce RL from learning reward hacking Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.

Motovation: We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.

Inherit global rules from ~/.claude/CLAUDE.md.

Things the user has had to explain many times:

  • No-cheat (VECTOR-framed -- this is the load-bearing constraint; cast every "is this cheating / is this weak" question in vector terms, since this repo is vector-based). Three things, don't conflate them:
    1. The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval.
    2. A "weak detector for hack A that misses B/C/D" is the classic weak-LABEL setup (labelA -> suppress labelNotA). That is NOT our setup; do not frame our method that way.
    3. OUR setup is vec -> routing: extract a hack direction vec from hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by its cosine alignment to vec. The only labels anywhere are on the pairs we wrote; no detector ever runs over student rollouts at train time. Generalization is tested by whether vec (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. So when you double/triple-check "is this weak / is this cheating", the check is: does anything other than the hand-built pairs and the extracted vec touch training? If a live detector or the oracle leaks in, it's cheating.
  • do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
  • I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

Extra instructions:

  • When you queue a job, follow with pueue follow | tail in bg so you are woken on fail or finish
  • for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
  • say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read

Files

  • Read docs/spec.md for the preregistered plan.
  • Read docs/brainstorm/extracted_prefs.md for design rationale.
  • New sweep arms get recipes in justfile with # H: hypothesis comments.
  • just smoke before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
  • Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.
  • Head docs/RESEARCH_JOURNAL.md for latest results.
  • No tests/ dir; smoke is the correctness gate. Unit checks are scripts/verify_*.py (assert + exit nonzero), wired into the smoke recipe so every smoke runs them.
  • Every load-bearing invariant gets a verify_*.py gate. The no-cheat leak (held-out modes received <=1.1% detector labels via the route2 gate, 2026-06-05) shipped because nothing asserted it: the existing gates were green but none covered the property. If a claim is load-bearing, write the assertion test in the same commit -- "the tests passed" means nothing if the property was never tested. This one was found by hand-counting, not by a gate; now scripts/verify_gate_anchor.py covers it.

On persona pairs

  • ./docs/personas/how_to_rewrite_pairs.md
  • ./docs/personas/how_to_write_personas.md
  • ./docs/personas/personas_kept.md

On concepts such as "what are contrastive pairs" or "why SVD space" grep

  • ./docs/vendor/AntiPaSTO_concepts/README.md

For the original paper

  • LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
  • Code: ./docs/vendor/rl-rewardhacking