docs: update method descriptions for activation routing

2026-06-27 16:15:35 +08:00 · 2026-06-11 13:22:13 +00:00
parent 668285c86e
commit ec11bf58b2
9 changed files with 269 additions and 76 deletions
@@ -1,15 +1,16 @@
 # vGROUT

-**vGROUT** (vector gradient routing): route the GRPO gradient against an
-extracted reward-hacking direction so the deployed model can't learn the hack,
-while preserving coding performance. A representation-routing variant of gradient routing
-(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
-direction rather than a per-example data label.
+**vGROUT** (vector gradient routing): use an extracted activation-space
+reward-hacking direction to route each rollout's GRPO update into deployed or
+quarantine parameters, aiming to reduce reward hacking retained at deployment
+while preserving coding performance. It is a representation-gated variant of
+gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
+the per-example data label that normally selects the gradient route.

 Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
-LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
-"Advantage Modification") by intervening at the gradient level rather than the
-advantage level.
+LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
+2026), the method leaves rewards and advantages unchanged. Pooled activations
+select which adapter block receives each rollout's gradient update.

 See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
 and [docs/papers/](docs/papers/).
@@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):

 To get the hack direction we pair examples by hand: for each problem, one
 correct completion and one completion that exploits the evaluation procedure.
-For each pair we run a forward pass and read the bottleneck activation `A@x`
-(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
-completion tokens. The per-module mean hack-minus-clean activation difference,
-unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
-Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
-are off-distribution and the *only* routing-label source. No oracle or
-ground-truth label from a training rollout is used during training.
+For each pair we run a forward pass and read the deployed-block bottleneck
+activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
+mean hack-minus-clean activation difference, unit-normalized per module, is
+`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
+backward pass. The hand-authored pair sides provide the only hack/clean labels;
+no oracle or ground-truth label from a training rollout is used during training.

 At training time routeA scores each rollout on the no-grad `logp_old` forward it
 already needs: an activation-capture hook pools the same bottleneck activations
 over completion tokens, and the score is the pooled dot product with `v_act`.
 Thresholds come from a rolling buffer of recent scores, z-normalized and split by
-two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
-pins absorb. The block masks are set from those labels *before* the single
+two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches
+`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
 masked forward+backward, so there is no second gradient pass. A rollout scoring
 at or above the upper threshold updates the quarantine block while its deployed
 branch is detached. We re-extract `v_act` every N steps (forward-only,
@@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split).

 ## Results and write-up

-The paper draft is the source of truth for current numbers, figures, and the
-preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
-Session-by-session findings and per-step log audits live in
-[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
+The paper draft and [docs/results.md](docs/results.md) currently describe the
+retired gradient-scored routeV experiments. They are historical evidence, not a
+description of routeA. Current routeA findings are recorded in
+[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.