mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
97aede8d9c
Three fixes after the deploy-solve=0 run (all user-confirmed): 1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5). Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted). 2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool- covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count, no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step). No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0 stays the on/off switch. Removed dead _even_split. 3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the behavior policy = the shipped deployed-only model -- the quarantine's learned hack can't saturate the rollout distribution and starve honest solve advantage. For clean-gated rollouts gen and train forward now match. Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget). AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis confidence across turns. Smoke + smoke-solvemix green; all verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
166 lines
12 KiB
Markdown
166 lines
12 KiB
Markdown
# AGENTS.md — projected_grpo
|
|
|
|
**This is novel ML research.** Not in your training data. Extrapolate carefully and without overconfidence.
|
|
**This is reasearch code** We want readable, fail fast, pseudocode like code
|
|
**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit
|
|
|
|
|
|
see @README.md for project overview
|
|
|
|
## Project in one paragraph
|
|
|
|
Test whether SVD-basis gradient projection against an extracted hack-direction
|
|
reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
|
|
Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
|
|
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
|
|
unpaired GRPO rollouts rather than paired-preference contrast.
|
|
|
|
Hypothesis
|
|
> We can find and the "reward hacking direciton", and reduce RL from learning reward hacking
|
|
> Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions.
|
|
> Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.
|
|
|
|
Motovation:
|
|
We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
|
|
|
|
Inherit global rules from `~/.claude/CLAUDE.md`.
|
|
|
|
|
|
## Things the user has had to explain many times:
|
|
|
|
- No-cheat. THE TEST: would this still work on a brand-new RL env where no oracle exists?
|
|
In real life you do not have an oracle on a new env. So the disqualifier is needing the
|
|
oracle / ground-truth hack-labels of the LIVE training distribution -- not "a detector ran."
|
|
Don't conflate these:
|
|
1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT
|
|
INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any
|
|
train-time use (gate routing, set a threshold, label student rollouts) is cheating,
|
|
because that signal cannot exist on the new env we claim to generalize to.
|
|
2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live
|
|
detector is fine IF it is an honest predictor that never saw the true hack-labels of the
|
|
rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw
|
|
monitors: Ground-Truth (+70/90% variants) = cheat (reads the oracle label of the live
|
|
rollout); LLM judge = legitimate and our fairest external baseline (generic model, no
|
|
oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was
|
|
trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the
|
|
new env, so it is an open question ("could we use the probe at deployment / on a new
|
|
dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
|
|
this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
|
|
NOT our setup -- do not frame our method that way.
|
|
3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
|
|
contrastive pairs (off-distribution, authored by us before we ever see a live rollout),
|
|
then route the live GRPO gradient by its cosine alignment to `vec`. The hand-authored
|
|
pairs are legitimate for the same reason the LLM judge is: outside knowledge that needs no
|
|
env-specific oracle and never peeks at a live rollout's true label. No oracle / ground-truth
|
|
label of a live rollout ever touches training. Generalization is tested by whether `vec`
|
|
(built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
|
|
-- vector generalization, not detector-label generalization.
|
|
So when you double/triple-check "is this weak / is this cheating", the check is: would it
|
|
survive on a new env with no oracle? If it needs the oracle or ground-truth hack-labels of
|
|
the live training data, it's cheating. Hand-authored pairs + the extracted `vec` pass; a
|
|
generic LLM judge passes; the env oracle and anything trained on its live-rollout labels fail.
|
|
4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID
|
|
rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must
|
|
LABEL which rollout is the hack -- and the only thing that can label a live rollout is the
|
|
oracle/detector. Using labels on live rollouts IS the paper's detector approach; it is the
|
|
exact thing we are trying to beat. The hand-authored pairs are our ONLY label source, full
|
|
stop. An "ideal direction" fit on oracle-labelled live rollouts (e.g. mu_hack - mu_clean)
|
|
is a DIAGNOSTIC UPPER BOUND only -- it proves the signal is physically present in the
|
|
gradients, it is NOT a reachable target, and "the ceiling is high so on-distribution pairs
|
|
will fix it" is a category error (a good direction existing != we can build it label-free).
|
|
The only no-cheat lever to approach that ceiling is BETTER HAND-AUTHORED pairs (more
|
|
diverse / less overlapping / less off-distribution), i.e. the "Hack A predicts Hack B"
|
|
generalization where every label still lives on pairs we wrote.
|
|
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
|
|
- DON'T act on a new theory/diagnosis until I confirm it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it to me, wait for the OK, THEN write.
|
|
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
|
|
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
|
|
|
|
## The adapter setup (shapes), and why "same position = shrinkage" is subtle
|
|
|
|
Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).
|
|
|
|
- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
|
|
(deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
|
|
scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
|
|
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
|
|
FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
|
|
Shared B -> `A.grad == A_hack.grad` pre-routing.
|
|
|
|
The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
|
|
ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
|
|
`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
|
|
That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
|
|
it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.
|
|
|
|
Shrinkage is NOT inevitable. Two things break it:
|
|
1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
|
|
full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
|
|
solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
|
|
separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
|
|
doomed to shrinkage than PiSSA even with shared B.
|
|
2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
|
|
own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
|
|
∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
|
|
update.
|
|
|
|
So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
|
|
structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
|
|
empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
|
|
routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
|
|
trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)
|
|
|
|
## Extra instructions:
|
|
|
|
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
|
|
- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
|
|
- say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read
|
|
|
|
## Files
|
|
|
|
For the setup, read these:
|
|
- [docs/human_journal.md](docs/human_journal.md) -- the user's notes on the method. The novel
|
|
part is routing by an extracted vector rather than per-example labels. The SVD adapter is a
|
|
detail, not the novel experiment. Whether the direction, the routing, or the SVD adapter
|
|
drives the suppression is an open question (random directions match in the controls).
|
|
- [docs/writeup/main.tex](docs/writeup/main.tex) -- the writeup: thesis and claims C1-C4.
|
|
- [docs/papers/grad_routing/paper_gradient_routing.md](docs/papers/grad_routing/paper_gradient_routing.md)
|
|
-- Cloud et al. Expand-Route-Ablate, the gradient-routing prior.
|
|
|
|
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
|
|
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
|
|
- `just smoke` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
|
|
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
|
|
- Head [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for latest results.
|
|
- No `tests/` dir; `smoke` is the correctness gate. Unit checks are `scripts/verify_*.py`
|
|
(assert + exit nonzero), wired into the `smoke` recipe so every smoke runs them.
|
|
- Every load-bearing invariant gets a `verify_*.py` gate, written in the same commit as
|
|
the claim -- "the tests passed" means nothing if the property was never tested.
|
|
|
|
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
|
|
reward-hack, neg=the honest solve, vector = grad(prompt+hack) - grad(prompt+clean).
|
|
Like persona steering pairs (honest/dishonest), MATCH everything but the axis -- same
|
|
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
|
|
(else style competes with the trait; see the style-confound section of the doc below).
|
|
There is NO problem_id semantics: the only "id" is which completion is the hack side
|
|
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
|
|
are DISTINCT (different gradient). Authored = off-distribution, hand-written, no-cheat;
|
|
pool-derived pairs (e.g. prog_wide_clean) are contamination-prone -> not headline-clean.
|
|
- ./docs/personas/how_to_rewrite_pairs.md
|
|
- ./docs/personas/how_to_write_personas.md -- pos/neg pair authoring rules + style confound
|
|
- ./docs/personas/personas_kept.md
|
|
- ./data/pairs/hack_pairs.md -- headline multi-view pairs; supports `#all-in-one@tag`
|
|
- ./data/pairs/pair_diagnostics.md -- semantic-framing experiments + placebo controls
|
|
- ./docs/personas/pairset_audit.md -- pair quality, provenance, and use recommendations
|
|
|
|
On concepts such as "what are contrastive pairs" or "why SVD space" grep
|
|
- ./docs/vendor/AntiPaSTO_concepts/README.md
|
|
|
|
For the original paper (the substrate: reward-hacking LeetCode env)
|
|
- LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
|
|
- Code: ./docs/vendor/rl-rewardhacking
|
|
|
|
For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
|
|
the original; source of the absorption/leakage vocab)
|
|
- ./docs/papers/grad_routing/paper_sgtm.md
|