mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:15:40 +08:00

T

wassname 19687087b0 feat(#30,#39): simple online gate -- band from current batch, no window/cloud; lr 1e-4

Gate band (mean + k*std) now computed from THIS batch's pooled positions each step
instead of a sliding window. Refresh-proof by construction (live rollouts scored vs
the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window
config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random
tiny data never separates -> quarantine would never train -> pathway assert would fail.

lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean
emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-11 06:04:28 +00:00

.claude/memory

chore: memory updates, diag_pairs_compare script

2026-06-09 02:42:56 +00:00

.vscode

refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts

2026-06-09 13:34:50 +00:00

data/pairs

Consolidate tagged hack pairsets in data

2026-06-10 11:58:53 +00:00

docs

spec: small-reward-hacking env spinout (parked post-paper; commit archaeology for the 6->4 mode selection)

2026-06-11 02:07:55 +00:00

external

setup

2026-05-23 10:40:02 +08:00

modal

Consolidate tagged hack pairsets in data

2026-06-10 11:58:53 +00:00

nbs

diag: pinning separability sweep (grad/act x cos/proj/mag x filter), AUROC+p@k, notebook

2026-06-08 11:11:55 +00:00

out

Consolidate tagged hack pairsets in data

2026-06-10 11:58:53 +00:00

scripts

feat(#30 ): mean+k*std online gate replaces fixed quantile; always-show route cols

2026-06-11 02:56:07 +00:00

src/vgrout

feat(#30,#39): simple online gate -- band from current batch, no window/cloud; lr 1e-4

2026-06-11 06:04:28 +00:00

.gitignore

Consolidate tagged hack pairsets in data

2026-06-10 11:58:53 +00:00

.gitmodules

concepts

2026-05-29 06:29:20 +00:00

AGENTS.md

fix(routeV): keep=bulk gate + deterministic teachers + deploy-mode generation

2026-06-11 00:29:12 +00:00

CLAUDE.md

docs

2026-05-29 05:42:28 +00:00

justfile

feat: online-stats gate + step-level teacher forcing + AUROC diagnostic

2026-06-10 14:22:37 +00:00

pyproject.toml

feat: online-stats gate + step-level teacher forcing + AUROC diagnostic

2026-06-10 14:22:37 +00:00

README.md

docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase)

2026-06-10 11:23:21 +00:00

RESEARCH_JOURNAL.md

docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix

2026-06-10 06:50:42 +00:00

uv.lock

feat: online-stats gate + step-level teacher forcing + AUROC diagnostic

2026-06-10 14:22:37 +00:00

README.md

vGROUT

vGROUT (vector gradient routing): route the GRPO gradient against an extracted reward-hacking direction so the deployed model can't learn the hack, without tanking pass rate. A representation-routing variant of gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an extracted direction rather than a per-example data label.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/brainstorm/extracted_prefs.md and docs/papers/.

The adapter

Every target Linear gets one rank-2r LoRA (src/vgrout/lora2r.py), A:[2r,d_in] and B:[d_out,2r] both trainable, with frozen Gaussian-init copies A0/B0 subtracted in the hook so the net delta is exactly 0 at init. The 2r rows/cols split into a deployed block [:r] and a quarantine block [r:]. Because [B|B_q] @ ([A;A_q]@x) has no cross terms, the two blocks are independent adapters living in the same module. At deployment the quarantine is ablated (reset to its init), so anything learned there never ships.

Per rollout we apply an SGTM-style three-way hard mask (m, d) to the block outputs (m = quarantine on/off, d = deployed detach):

clean (0,0) -- only the deployed block trains (quarantine zero in fwd+bwd).
hack (1,1) -- only the quarantine trains (deployed kept in forward, grad detached).
mid (1,0) -- both train (absorption).

How it works

To get the hack direction we pair examples by hand: for each problem, one completion that solves it honestly and one that uses the kind of trick the model would learn to exploit. For each pair we compute the exact GRPO gradient you'd get if the hack rollout had advantage +1 and the clean rollout had advantage -1 (-grad logp(hack) + grad logp(clean)), read in the per-module c-probe space (a virtual diagonal scale between A and B). The mean hack-minus-clean direction, normalized per module, is v_grad. (Mechanically identical to a twin-NLL extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO framing is the one we mean.) The hand-authored pairs are off-distribution and the only routing-label source -- no oracle/ground-truth label of a live rollout ever touches training.

At training time routeV runs two passes per prompt. Pass 1 is an unmasked forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a band-normalized cosine to v_grad pooled across modules -> a {clean, mid, hack} label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from those labels and does the real masked forward+backward; the optimizer steps on what survives. So a rollout that looks hack-ward routes its update into the throwaway quarantine block, and the deployed model never descends the hack gradient. We re-extract v_grad every N steps so it tracks where hacks separate now, not at step 0.

Whether the direction (not just the act of routing) drives suppression is the open question -- the placebo control (Haar-random v_grad, same routing machinery) must NOT match real v_grad. We watch qmass (share of the update landing in quarantine) and the per-rollout zone shares (keep/resid/rout).

What we compare

The environment has loophole modes (e.g. run_tests, sentinel, stdout_marker, file_marker), each a grading flaw with a truthful hint disclosing the mechanism; problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean traces are in the blog appendix (docs/blog/20260529_...md).

Three arms, identical model/adapter/teacher pool, differing only in the gate (--intervention):

none -- gate pinned clean (0,0): the quarantine never trains. The capacity- and structure-matched vanilla control (same adapter, no shrinkage confound). The emergence reference.
routeV -- the method: per-rollout three-way gate from the c-probe gradient vs v_grad. --routeV-random-v-seed swaps in a Haar-random direction (placebo).
absorb -- gate pinned mid (1,0): both blocks train on everything, no gate. Isolates the value of the gate+hard-masks vs absorption alone.

Deploy hack/solve is measured the same way for every arm: quarantine-ablated forward on the held-out test set, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples. (For none the quarantine is already at init, so ablation is a no-op.)

Quick start

uv sync
just smoke               # tiny-random model, routeV pathway + all verify gates, ~1-2 min
just smoke-all           # vanilla + routeV + absorb back to back
just download-model      # warm Qwen3-4B cache
just queue-decision      # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA placebo tie was shrinkage: shared frozen basis made routing a magnitude split).

Results and write-up

The paper draft is the source of truth for current numbers, figures, and the preregistered hypotheses: docs/writeup/main.tex. Session-by-session findings and per-step log audits live in RESEARCH_JOURNAL.md.