Synthetic and live scores share an axis with meaningful zero (raw scores carry a common <mu,v> offset since v = mean diff is not orthogonal to the family mean). Zones come from label-free online stats (EMA mean/std + Otsu valley), replacing mean+k*sd which placed both cuts beyond every distribution. Winsorize at 1/99% before Otsu: variance-maximizing cuts otherwise buy a class for one outlier. Fresh-eyes review verified z-norm is affine (AUROCs unchanged), zones label-free, P/R recompute from parquet matches titles. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
vGROUT
vGROUT (vector gradient routing): route the GRPO gradient against an extracted reward-hacking direction so the deployed model can't learn the hack, while preserving coding performance. A representation-routing variant of gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an extracted direction rather than a per-example data label.
Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.
See docs/brainstorm/extracted_prefs.md and docs/papers/.
The adapter
Every target Linear gets one rank-2r LoRA (src/vgrout/lora2r.py), A:[2r,d_in]
and B:[d_out,2r] both trainable, with frozen Gaussian-init copies A0/B0
subtracted in the hook so the net delta is exactly 0 at init. The 2r rows/cols
split into a deployed block [:r] and a quarantine block [r:]. Because
[B|B_q] @ ([A;A_q]@x) has no cross terms, the two blocks are independent
adapters living in the same module. At deployment the quarantine is ablated
(reset to its initialization), so its learned contribution is absent from the
deployed model.
Per rollout we apply a three-way output mask (m, d) to the block
outputs (m = quarantine on/off, d = deployed detach):
- clean
(0,0)-- only the deployed block trains (quarantine zero in fwd+bwd). - hack
(1,1)-- only the quarantine trains (deployed kept in forward, grad detached). - mid
(1,0)-- both train, which may permit absorption.
How it works
To get the hack direction we pair examples by hand: for each problem, one
correct completion and one completion that exploits the evaluation procedure.
For each pair we compute the exact GRPO gradient that would result if the hack
rollout had advantage +1 and the clean rollout had advantage -1
(-grad logp(hack) + grad logp(clean)), read in the per-module c-probe space (a
virtual diagonal scale between A and B). The mean hack-minus-clean direction,
normalized per module, is v_grad. (Mechanically identical to a twin-NLL
extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
framing is the one we mean.) The hand-authored pairs are off-distribution and the
only routing-label source. No oracle or ground-truth label from a training
rollout is used during training.
At training time routeV runs two passes per prompt. Pass 1 is an unmasked
forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
band-normalized cosine to v_grad pooled across modules -> a {clean, mid, hack}
label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
those labels and does the real masked forward+backward; the optimizer steps on
what survives. A rollout with high alignment to the reward-hacking direction updates
the quarantine block while its deployed branch is detached. We re-extract v_grad
every N steps so it tracks the current separation rather than only the initial model.
Whether the direction (not just the act of routing) drives suppression is the
open question -- the placebo control (Haar-random v_grad, same routing
machinery) must NOT match real v_grad. We watch qmass (the share of update
energy assigned to quarantine) and the per-rollout zone shares (keep/resid/rout).
What we compare
The environment has loophole modes (e.g. run_tests, sentinel, stdout_marker,
file_marker), each a grading flaw with a truthful hint disclosing the mechanism;
problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
traces are in the blog appendix
(docs/blog/20260529_...md).
Three arms, identical model/adapter/teacher pool, differing only in the gate
(--intervention):
- none -- gate pinned clean
(0,0): the quarantine never trains. The capacity- and structure-matched vanilla control (same adapter, no shrinkage confound). The emergence reference. - routeV -- the method: per-rollout three-way gate from the c-probe gradient
vs
v_grad.--routeV-random-v-seedswaps in a Haar-random direction (placebo). - absorb -- gate pinned mid
(1,0): both blocks train on every rollout. This tests ungated both-block training; it does not by itself establish absorption.
Deploy hack/solve is measured the same way for every arm: quarantine-ablated
forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same
deployment estimator. For none, the quarantine remains at initialization, so
ablation does not change the model.
Quick start
uv sync
just smoke # tiny-random model, routeV pathway + all verify gates, ~1-2 min
just smoke-all # vanilla + routeV + absorb back to back
just download-model # warm Qwen3-4B cache
just queue-decision # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)
See RESEARCH_JOURNAL.md for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0
measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
Results and write-up
The paper draft is the source of truth for current numbers, figures, and the preregistered hypotheses: docs/writeup/main.tex. Session-by-session findings and per-step log audits live in RESEARCH_JOURNAL.md.