mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: update method descriptions for activation routing
This commit is contained in:
@@ -1,15 +1,16 @@
|
||||
# vGROUT
|
||||
|
||||
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
||||
extracted reward-hacking direction so the deployed model can't learn the hack,
|
||||
while preserving coding performance. A representation-routing variant of gradient routing
|
||||
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
|
||||
direction rather than a per-example data label.
|
||||
**vGROUT** (vector gradient routing): use an extracted activation-space
|
||||
reward-hacking direction to route each rollout's GRPO update into deployed or
|
||||
quarantine parameters, aiming to reduce reward hacking retained at deployment
|
||||
while preserving coding performance. It is a representation-gated variant of
|
||||
gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
|
||||
the per-example data label that normally selects the gradient route.
|
||||
|
||||
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
||||
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
||||
"Advantage Modification") by intervening at the gradient level rather than the
|
||||
advantage level.
|
||||
LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
|
||||
2026), the method leaves rewards and advantages unchanged. Pooled activations
|
||||
select which adapter block receives each rollout's gradient update.
|
||||
|
||||
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
||||
and [docs/papers/](docs/papers/).
|
||||
@@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):
|
||||
|
||||
To get the hack direction we pair examples by hand: for each problem, one
|
||||
correct completion and one completion that exploits the evaluation procedure.
|
||||
For each pair we run a forward pass and read the bottleneck activation `A@x`
|
||||
(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
|
||||
completion tokens. The per-module mean hack-minus-clean activation difference,
|
||||
unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
|
||||
Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
|
||||
are off-distribution and the *only* routing-label source. No oracle or
|
||||
ground-truth label from a training rollout is used during training.
|
||||
For each pair we run a forward pass and read the deployed-block bottleneck
|
||||
activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
|
||||
mean hack-minus-clean activation difference, unit-normalized per module, is
|
||||
`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
|
||||
backward pass. The hand-authored pair sides provide the only hack/clean labels;
|
||||
no oracle or ground-truth label from a training rollout is used during training.
|
||||
|
||||
At training time routeA scores each rollout on the no-grad `logp_old` forward it
|
||||
already needs: an activation-capture hook pools the same bottleneck activations
|
||||
over completion tokens, and the score is the pooled dot product with `v_act`.
|
||||
Thresholds come from a rolling buffer of recent scores, z-normalized and split by
|
||||
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
|
||||
pins absorb. The block masks are set from those labels *before* the single
|
||||
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches
|
||||
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
|
||||
masked forward+backward, so there is no second gradient pass. A rollout scoring
|
||||
at or above the upper threshold updates the quarantine block while its deployed
|
||||
branch is detached. We re-extract `v_act` every N steps (forward-only,
|
||||
@@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
|
||||
|
||||
## Results and write-up
|
||||
|
||||
The paper draft is the source of truth for current numbers, figures, and the
|
||||
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
|
||||
Session-by-session findings and per-step log audits live in
|
||||
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
|
||||
The paper draft and [docs/results.md](docs/results.md) currently describe the
|
||||
retired gradient-scored routeV experiments. They are historical evidence, not a
|
||||
description of routeA. Current routeA findings are recorded in
|
||||
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
|
||||
|
||||
Reference in New Issue
Block a user