mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:15:40 +08:00

T

wassname af420ec855 feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.

Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.

gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).

Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-12 03:22:48 +00:00

.claude/memory

chore: memory updates, diag_pairs_compare script

2026-06-09 02:42:56 +00:00

.vscode

refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts

2026-06-09 13:34:50 +00:00

data/pairs

data: wave-2 behavior2_* pairs -- 15 new proxy-gaming mechanisms, 2 contaminated drafts replaced

2026-06-11 10:04:23 +00:00

docs

feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

2026-06-12 03:22:48 +00:00

external

setup

2026-05-23 10:40:02 +08:00

modal

misc

2026-06-11 11:07:28 +00:00

nbs

diag: pinning separability sweep (grad/act x cos/proj/mag x filter), AUROC+p@k, notebook

2026-06-08 11:11:55 +00:00

out

misc

2026-06-11 11:07:28 +00:00

scripts

feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

2026-06-12 03:22:48 +00:00

src/vgrout

feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

2026-06-12 03:22:48 +00:00

.gitignore

Consolidate tagged hack pairsets in data

2026-06-10 11:58:53 +00:00

.gitmodules

concepts

2026-05-29 06:29:20 +00:00

AGENTS.md

feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

2026-06-12 03:22:48 +00:00

CLAUDE.md

docs

2026-05-29 05:42:28 +00:00

justfile

feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

2026-06-12 03:22:48 +00:00

pyproject.toml

feat: online-stats gate + step-level teacher forcing + AUROC diagnostic

2026-06-10 14:22:37 +00:00

README.md

docs: update method descriptions for activation routing

2026-06-11 13:22:13 +00:00

RESEARCH_JOURNAL.md

feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

2026-06-12 03:22:48 +00:00

uv.lock

feat: online-stats gate + step-level teacher forcing + AUROC diagnostic

2026-06-10 14:22:37 +00:00

README.md

vGROUT

vGROUT (vector gradient routing): use an extracted activation-space reward-hacking direction to route each rollout's GRPO update into deployed or quarantine parameters, aiming to reduce reward hacking retained at deployment while preserving coding performance. It is a representation-gated variant of gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces the per-example data label that normally selects the gradient route.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang 2026), the method leaves rewards and advantages unchanged. Pooled activations select which adapter block receives each rollout's gradient update.

See docs/brainstorm/extracted_prefs.md and docs/papers/.

The adapter

Every target Linear gets one rank-2r LoRA (src/vgrout/lora2r.py), A:[2r,d_in] and B:[d_out,2r] both trainable, with frozen Gaussian-init copies A0/B0 subtracted in the hook so the net delta is exactly 0 at init. The 2r rows/cols split into a deployed block [:r] and a quarantine block [r:]. Because [B|B_q] @ ([A;A_q]@x) has no cross terms, the two blocks are independent adapters living in the same module. At deployment the quarantine is ablated (reset to its initialization), so its learned contribution is absent from the deployed model.

Per rollout we apply a three-way output mask (m, d) to the block outputs (m = quarantine on/off, d = deployed detach):

clean (0,0) -- only the deployed block trains (quarantine zero in fwd+bwd).
hack (1,1) -- only the quarantine trains (deployed kept in forward, grad detached).
mid (1,0) -- both train, which may permit absorption.

How it works

To get the hack direction we pair examples by hand: for each problem, one correct completion and one completion that exploits the evaluation procedure. For each pair we run a forward pass and read the deployed-block bottleneck activation A[:r]@x, masked-mean-pooled over completion tokens. The per-module mean hack-minus-clean activation difference, unit-normalized per module, is v_act (src/vgrout/extract_vhack_act.py). Extraction is forward-only: no backward pass. The hand-authored pair sides provide the only hack/clean labels; no oracle or ground-truth label from a training rollout is used during training.

At training time routeA scores each rollout on the no-grad logp_old forward it already needs: an activation-capture hook pools the same bottleneck activations over completion tokens, and the score is the pooled dot product with v_act. Thresholds come from a rolling buffer of recent scores, z-normalized and split by two-threshold Otsu into {keep, absorb, rout}; until the buffer reaches route_warmup scores the gate pins absorb. The block masks are set from those labels before the single masked forward+backward, so there is no second gradient pass. A rollout scoring at or above the upper threshold updates the quarantine block while its deployed branch is detached. We re-extract v_act every N steps (forward-only, quarantine-ablated) so it tracks the current model; the buffer stores pooled activations and re-scores them against the current v_act, so a refresh needs no flush.

Whether the direction (not just the act of routing) drives suppression is the open question -- the placebo control (Haar-random v_act, same routing machinery) must NOT match real v_act. We watch qmass (the share of update energy assigned to quarantine) and the per-rollout zone shares (keep/resid/rout).

What we compare

The environment has loophole modes (e.g. run_tests, sentinel, stdout_marker, file_marker), each a grading flaw with a truthful hint disclosing the mechanism; problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean traces are in the blog appendix (docs/blog/20260529_...md).

Three arms, identical model/adapter/teacher pool, differing only in the gate (--intervention):

none -- gate pinned clean (0,0): the quarantine never trains. The capacity- and structure-matched vanilla control (same adapter, no shrinkage confound). The emergence reference.
routeA -- the method: per-rollout three-way gate from the pooled bottleneck activation vs v_act. --routeA-random-v-seed swaps in a Haar-random direction (placebo).
absorb -- gate pinned mid (1,0): both blocks train on every rollout. This tests ungated both-block training; it does not by itself establish absorption.

Deploy hack/solve is measured the same way for every arm: quarantine-ablated forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same deployment estimator. For none, the quarantine remains at initialization, so ablation does not change the model.

Quick start

uv sync
just smoke               # tiny-random model, routeA pathway + all verify gates, ~1-2 min
just smoke-all           # vanilla + routeA + absorb back to back
just download-model      # warm Qwen3-4B cache
just queue-decision      # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb)

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA placebo tie was shrinkage: shared frozen basis made routing a magnitude split).

Results and write-up

The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA. Current routeA findings are recorded in RESEARCH_JOURNAL.md until the paper is rewritten.