# vGROUT **vGROUT** (vector gradient routing): use an extracted activation-space reward-hacking direction to route each rollout's GRPO update into deployed or quarantine parameters, aiming to reduce reward hacking retained at deployment while preserving coding performance. It is a representation-gated variant of gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces the per-example data label that normally selects the gradient route. Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang 2026), the method leaves rewards and advantages unchanged. Pooled activations select which adapter block receives each rollout's gradient update. See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) and [docs/papers/](docs/papers/). ## The adapter Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]` and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0` subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because `[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent adapters living in the same module. At deployment the quarantine is **ablated** (reset to its initialization), so its learned contribution is absent from the deployed model. Per rollout we apply a three-way output mask `(m, d)` to the block outputs (`m` = quarantine on/off, `d` = deployed detach): - **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd). - **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached). - **mid** `(1,0)` -- both train, which may permit absorption. ## How it works To get the hack direction we pair examples by hand: for each problem, one correct completion and one completion that exploits the evaluation procedure. For each pair we run a forward pass and read the deployed-block bottleneck activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module mean hack-minus-clean activation difference, unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no backward pass. The hand-authored pair sides provide the only hack/clean labels; no oracle or ground-truth label from a training rollout is used during training. At training time routeA scores each rollout on the no-grad `logp_old` forward it already needs: an activation-capture hook pools the same bottleneck activations over completion tokens, and the score is the pooled dot product with `v_act`. Thresholds come from a rolling buffer of recent scores, z-normalized and split by two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches `route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single masked forward+backward, so there is no second gradient pass. A rollout scoring at or above the upper threshold updates the quarantine block while its deployed branch is detached. We re-extract `v_act` every N steps (forward-only, quarantine-ablated) so it tracks the current model; the buffer stores pooled activations and re-scores them against the current `v_act`, so a refresh needs no flush. Whether the *direction* (not just the *act* of routing) drives suppression is the open question -- the placebo control (Haar-random `v_act`, same routing machinery) must NOT match real `v_act`. We watch `qmass` (the share of update energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`). ## What we compare The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`, `file_marker`), each a grading flaw with a truthful hint disclosing the mechanism; problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean traces are in the blog appendix ([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)). Three arms, identical model/adapter/teacher pool, differing only in the gate (`--intervention`): - **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The capacity- and structure-matched vanilla control (same adapter, no shrinkage confound). The emergence reference. - **routeA** -- the method: per-rollout three-way gate from the pooled bottleneck activation vs `v_act`. `--routeA-random-v-seed` swaps in a Haar-random direction (placebo). - **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests ungated both-block training; it does not by itself establish absorption. Deploy hack/solve is measured the same way for every arm: quarantine-ablated forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same deployment estimator. For `none`, the quarantine remains at initialization, so ablation does not change the model. ## Quick start ```bash uv sync just smoke # tiny-random model, routeA pathway + all verify gates, ~1-2 min just smoke-all # vanilla + routeA + absorb back to back just download-model # warm Qwen3-4B cache just queue-decision # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb) ``` See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0` measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA placebo tie was shrinkage: shared frozen basis made routing a magnitude split). ## Results and write-up The paper draft and [docs/results.md](docs/results.md) currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA. Current routeA findings are recorded in [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.