docs: update method descriptions for activation routing

This commit is contained in:
wassname
2026-06-11 13:22:13 +00:00
parent 668285c86e
commit ec11bf58b2
9 changed files with 269 additions and 76 deletions
+21 -21
View File
@@ -1,15 +1,16 @@
# vGROUT
**vGROUT** (vector gradient routing): route the GRPO gradient against an
extracted reward-hacking direction so the deployed model can't learn the hack,
while preserving coding performance. A representation-routing variant of gradient routing
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
direction rather than a per-example data label.
**vGROUT** (vector gradient routing): use an extracted activation-space
reward-hacking direction to route each rollout's GRPO update into deployed or
quarantine parameters, aiming to reduce reward hacking retained at deployment
while preserving coding performance. It is a representation-gated variant of
gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
the per-example data label that normally selects the gradient route.
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
"Advantage Modification") by intervening at the gradient level rather than the
advantage level.
LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
2026), the method leaves rewards and advantages unchanged. Pooled activations
select which adapter block receives each rollout's gradient update.
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
and [docs/papers/](docs/papers/).
@@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):
To get the hack direction we pair examples by hand: for each problem, one
correct completion and one completion that exploits the evaluation procedure.
For each pair we run a forward pass and read the bottleneck activation `A@x`
(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
completion tokens. The per-module mean hack-minus-clean activation difference,
unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
are off-distribution and the *only* routing-label source. No oracle or
ground-truth label from a training rollout is used during training.
For each pair we run a forward pass and read the deployed-block bottleneck
activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
mean hack-minus-clean activation difference, unit-normalized per module, is
`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
backward pass. The hand-authored pair sides provide the only hack/clean labels;
no oracle or ground-truth label from a training rollout is used during training.
At training time routeA scores each rollout on the no-grad `logp_old` forward it
already needs: an activation-capture hook pools the same bottleneck activations
over completion tokens, and the score is the pooled dot product with `v_act`.
Thresholds come from a rolling buffer of recent scores, z-normalized and split by
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
pins absorb. The block masks are set from those labels *before* the single
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
masked forward+backward, so there is no second gradient pass. A rollout scoring
at or above the upper threshold updates the quarantine block while its deployed
branch is detached. We re-extract `v_act` every N steps (forward-only,
@@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
## Results and write-up
The paper draft is the source of truth for current numbers, figures, and the
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
Session-by-session findings and per-step log audits live in
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
The paper draft and [docs/results.md](docs/results.md) currently describe the
retired gradient-scored routeV experiments. They are historical evidence, not a
description of routeA. Current routeA findings are recorded in
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.