mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Files

T

wassname 0d6ff754ec docs: AGENTS.md START HERE links (human_journal, main.tex, grad-routing paper); revert rescore fallback

- Point future agents at the three docs that pin the actual thesis + the
  live open question (direction vs routing vs SVD/PiSSA prior), so they don't
  re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
  checkpoints crash loudly rather than silently defaulting).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-10 03:34:06 +00:00

9.0 KiB

Raw Blame History

AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is reasearch code We want readable, fail fast, pseudocode like code Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit

see @README.md for project overview

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Hypothesis

We can find and the "reward hacking direciton", and reduce RL from learning reward hacking Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.

Motovation: We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.

Inherit global rules from ~/.claude/CLAUDE.md.

Things the user has had to explain many times:

No-cheat. THE TEST: would this still work on a brand-new RL env where no oracle exists? In real life you do not have an oracle on a new env. So the disqualifier is needing the oracle / ground-truth hack-labels of the LIVE training distribution -- not "a detector ran." Don't conflate these:
1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any train-time use (gate routing, set a threshold, label student rollouts) is cheating, because that signal cannot exist on the new env we claim to generalize to.
2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live detector is fine IF it is an honest predictor that never saw the true hack-labels of the rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw monitors: Ground-Truth (+70/90% variants) = cheat (reads the oracle label of the live rollout); LLM judge = legitimate and our fairest external baseline (generic model, no oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the new env, so it is an open question ("could we use the probe at deployment / on a new dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is NOT our setup -- do not frame our method that way.
3. OUR setup is vec -> routing: extract a hack direction vec from hand-built synthetic contrastive pairs (off-distribution, authored by us before we ever see a live rollout), then route the live GRPO gradient by its cosine alignment to vec. The hand-authored pairs are legitimate for the same reason the LLM judge is: outside knowledge that needs no env-specific oracle and never peeks at a live rollout's true label. No oracle / ground-truth label of a live rollout ever touches training. Generalization is tested by whether vec (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. So when you double/triple-check "is this weak / is this cheating", the check is: would it survive on a new env with no oracle? If it needs the oracle or ground-truth hack-labels of the live training data, it's cheating. Hand-authored pairs + the extracted vec pass; a generic LLM judge passes; the env oracle and anything trained on its live-rollout labels fail.
4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must LABEL which rollout is the hack -- and the only thing that can label a live rollout is the oracle/detector. Using labels on live rollouts IS the paper's detector approach; it is the exact thing we are trying to beat. The hand-authored pairs are our ONLY label source, full stop. An "ideal direction" fit on oracle-labelled live rollouts (e.g. mu_hack - mu_clean) is a DIAGNOSTIC UPPER BOUND only -- it proves the signal is physically present in the gradients, it is NOT a reachable target, and "the ceiling is high so on-distribution pairs will fix it" is a category error (a good direction existing != we can build it label-free). The only no-cheat lever to approach that ceiling is BETTER HAND-AUTHORED pairs (more diverse / less overlapping / less off-distribution), i.e. the "Hack A predicts Hack B" generalization where every label still lives on pairs we wrote.
do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

Extra instructions:

When you queue a job, follow with pueue follow | tail in bg so you are woken on fail or finish
for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read

Files

START HERE to understand the setup (read before reasoning about the method):

docs/human_journal.md -- the user's own words: what the method is, the routing math (absorption ramp between clean-cos and hack-cos bounds), and the LIVE open question -- "is it the direction, the routing itself, or does the SVD/PiSSA adapter add a prior that makes absorption work?" Random-direction controls MATCHING the real direction is a KNOWN, embraced result, not a bug to explain away.
docs/writeup/main.tex -- the actual thesis and claims C1-C4. The contribution is NOT "we found the hack direction and erased it." It is: SGTM-style post-backward gradient routing in the SVD-of-W basis, gated by an extracted hack vector (not per-example data labels), with the routed mass parked in a deletable adapter. C3 already establishes the gate is largely non-directional; the direction's measurable role is solve preservation + held-out-mode generalisation (C2, the load-bearing no-cheat check).
docs/papers/grad_routing/paper_gradient_routing.md -- Cloud et al. Expand-Route-Ablate. "Absorption" is the EFFECT of routing (routing a limited signal localises the broader capability into the routed region), not a mechanism you invoke. Routing runs the whole train; ablate once at the end. There is no warmup-then-off schedule.
Read docs/brainstorm/extracted_prefs.md for design rationale.
New sweep arms get recipes in justfile with # H: hypothesis comments.
just smoke before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.
Head RESEARCH_JOURNAL.md for latest results.
No tests/ dir; smoke is the correctness gate. Unit checks are scripts/verify_*.py (assert + exit nonzero), wired into the smoke recipe so every smoke runs them.
Every load-bearing invariant gets a verify_*.py gate, written in the same commit as the claim -- "the tests passed" means nothing if the property was never tested.

On persona pairs

./docs/personas/how_to_rewrite_pairs.md
./docs/personas/how_to_write_personas.md
./docs/personas/personas_kept.md

On concepts such as "what are contrastive pairs" or "why SVD space" grep

./docs/vendor/AntiPaSTO_concepts/README.md

For the original paper (the substrate: reward-hacking LeetCode env)

LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
Code: ./docs/vendor/rl-rewardhacking

For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as the original; source of the absorption/leakage vocab)

./docs/papers/grad_routing/paper_sgtm.md