evil_MoE/AGENTS.md

# AGENTS.md — projected_grpo

**This is novel ML research.** Not in your training data. Extrapolate carefully and without overconfidence.
**This is reasearch code** We want readable, fail fast, pseudocode like code
**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit


see @README.md for project overview

## Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction
reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
unpaired GRPO rollouts rather than paired-preference contrast.

Hypothesis
> We can find and the "reward hacking direciton", and reduce RL from learning reward hacking
> Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions.
> Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.

Motovation:
We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.

Inherit global rules from `~/.claude/CLAUDE.md`.


## Things the user has had to explain many times:

- No-cheat. THE TEST: would this still work on a brand-new RL env where no oracle exists?
  In real life you do not have an oracle on a new env. So the disqualifier is needing the
  oracle / ground-truth hack-labels of the LIVE training distribution -- not "a detector ran."
  Don't conflate these:
  1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT
     INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any
     train-time use (gate routing, set a threshold, label student rollouts) is cheating,
     because that signal cannot exist on the new env we claim to generalize to.
  2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live
     detector is fine IF it is an honest predictor that never saw the true hack-labels of the
     rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw
     monitors: Ground-Truth (+70/90% variants) = cheat (reads the oracle label of the live
     rollout); LLM judge = legitimate and our fairest external baseline (generic model, no
     oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was
     trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the
     new env, so it is an open question ("could we use the probe at deployment / on a new
     dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
     this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
     NOT our setup -- do not frame our method that way.
  3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
     contrastive pairs (off-distribution, authored by us before we ever see a live rollout),
     then route the live GRPO gradient by its cosine alignment to `vec`. The hand-authored
     pairs are legitimate for the same reason the LLM judge is: outside knowledge that needs no
     env-specific oracle and never peeks at a live rollout's true label. No oracle / ground-truth
     label of a live rollout ever touches training. Generalization is tested by whether `vec`
     (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
     -- vector generalization, not detector-label generalization.
  So when you double/triple-check "is this weak / is this cheating", the check is: would it
  survive on a new env with no oracle? If it needs the oracle or ground-truth hack-labels of
  the live training data, it's cheating. Hand-authored pairs + the extracted `vec` pass; a
  generic LLM judge passes; the env oracle and anything trained on its live-rollout labels fail.
  4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID
     rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must
     LABEL which rollout is the hack -- and the only thing that can label a live rollout is the
     oracle/detector. Using labels on live rollouts IS the paper's detector approach; it is the
     exact thing we are trying to beat. The hand-authored pairs are our ONLY label source, full
     stop. An "ideal direction" fit on oracle-labelled live rollouts (e.g. mu_hack - mu_clean)
     is a DIAGNOSTIC UPPER BOUND only -- it proves the signal is physically present in the
     gradients, it is NOT a reachable target, and "the ceiling is high so on-distribution pairs
     will fix it" is a category error (a good direction existing != we can build it label-free).
     The only no-cheat lever to approach that ceiling is BETTER HAND-AUTHORED pairs (more
     diverse / less overlapping / less off-distribution), i.e. the "Hack A predicts Hack B"
     generalization where every label still lives on pairs we wrote.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- DON'T act on a new theory/diagnosis until I confirm it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it to me, wait for the OK, THEN write.
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

## The adapter setup (shapes), and why "same position = shrinkage" is subtle

Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).

- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
  (deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
  scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
  FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
  Shared B -> `A.grad == A_hack.grad` pre-routing.

The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.

Shrinkage is NOT inevitable. Two things break it:
1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
   full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
   solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
   separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
   doomed to shrinkage than PiSSA even with shared B.
2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
   own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
   ∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
   update.

So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)

## Extra instructions:

- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
- say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read

## Files

For the setup, read these:
- [docs/human_journal.md](docs/human_journal.md) -- the user's notes on the method. The novel
  part is routing by an extracted vector rather than per-example labels. The SVD adapter is a
  detail, not the novel experiment. Whether the direction, the routing, or the SVD adapter
  drives the suppression is an open question (random directions match in the controls).
- [docs/writeup/main.tex](docs/writeup/main.tex) -- the writeup: thesis and claims C1-C4.
- [docs/papers/grad_routing/paper_gradient_routing.md](docs/papers/grad_routing/paper_gradient_routing.md)
  -- Cloud et al. Expand-Route-Ablate, the gradient-routing prior.

- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
- `just smoke` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `smoke` is the correctness gate. Unit checks are `scripts/verify_*.py`
  (assert + exit nonzero), wired into the `smoke` recipe so every smoke runs them.
- Every load-bearing invariant gets a `verify_*.py` gate, written in the same commit as
  the claim -- "the tests passed" means nothing if the property was never tested.

On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
reward-hack, neg=the honest solve, vector = grad(prompt+hack) - grad(prompt+clean).
Like persona steering pairs (honest/dishonest), MATCH everything but the axis -- same
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
(else style competes with the trait; see the style-confound section of the doc below).
There is NO problem_id semantics: the only "id" is which completion is the hack side
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
are DISTINCT (different gradient). Authored = off-distribution, hand-written, no-cheat;
pool-derived pairs (e.g. prog_wide_clean) are contamination-prone -> not headline-clean.
- ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md  -- pos/neg pair authoring rules + style confound
- ./docs/personas/personas_kept.md
- ./data/pairs/hack_pairs.md  -- headline multi-view pairs; supports `#all-in-one@tag`
- ./data/pairs/pair_diagnostics.md  -- semantic-framing experiments + placebo controls
- ./docs/personas/pairset_audit.md  -- pair quality, provenance, and use recommendations

On concepts such as "what are contrastive pairs" or "why SVD space" grep
- ./docs/vendor/AntiPaSTO_concepts/README.md

For the original paper (the substrate: reward-hacking LeetCode env)
- LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
- Code: ./docs/vendor/rl-rewardhacking

For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
the original; source of the absorption/leakage vocab)
- ./docs/papers/grad_routing/paper_sgtm.md