feat(#41): routeA activation gate replaces routeV grad gate

Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on the no-grad logpi_old forward (quarantine-ablated, matching the sampling policy); masks are pinned BEFORE the single grad-carrying forward, so the grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer, z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until 128 scores. Buffer stores pooled acts and re-scores against the current v_act, so the forward-only refresh (every 5 steps) needs no flush. No bimodality guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every condition including pure Gaussians, so no shape statistic discriminates. Deleted with the arm wiring (rename-on-logic-change: routeA never conflates with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges, _pair_cos, the pass-1 autograd.grad block, grad_probe training wiring, v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe. c-probe stays in lora2r.py for scripts/diag_pinning.py only. verify_science_invariants: all-in-one count 27 -> 42 (stale since c33b810 added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset. Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-11 12:38:19 +00:00
parent 5a340e5c3e
commit adca442253
14 changed files with 337 additions and 729 deletions
@@ -36,28 +36,30 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):

 To get the hack direction we pair examples by hand: for each problem, one
 correct completion and one completion that exploits the evaluation procedure.
-For each pair we compute the *exact GRPO gradient* that would result if the hack
-rollout had advantage +1 and the clean rollout had advantage -1
-(`-grad logp(hack) + grad logp(clean)`), read in the per-module c-probe space (a
-virtual diagonal scale between `A` and `B`). The mean hack-minus-clean direction,
-normalized per module, is `v_grad`. (Mechanically identical to a twin-NLL
-extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
-framing is the one we mean.) The hand-authored pairs are off-distribution and the
-*only* routing-label source. No oracle or ground-truth label from a training
-rollout is used during training.
+For each pair we run a forward pass and read the bottleneck activation `A@x`
+(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
+completion tokens. The per-module mean hack-minus-clean activation difference,
+unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
+Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
+are off-distribution and the *only* routing-label source. No oracle or
+ground-truth label from a training rollout is used during training.

-At training time routeV runs **two passes** per prompt. Pass 1 is an unmasked
-forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
-band-normalized cosine to `v_grad` pooled across modules -> a `{clean, mid, hack}`
-label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
-those labels and does the real masked forward+backward; the optimizer steps on
-what survives. A rollout with high alignment to the reward-hacking direction updates
-the quarantine block while its deployed branch is detached. We re-extract `v_grad`
-every N steps so it tracks the current separation rather than only the initial model.
+At training time routeA scores each rollout on the no-grad `logp_old` forward it
+already needs: an activation-capture hook pools the same bottleneck activations
+over completion tokens, and the score is the pooled dot product with `v_act`.
+Thresholds come from a rolling buffer of recent scores, z-normalized and split by
+two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
+pins absorb. The block masks are set from those labels *before* the single
+masked forward+backward, so there is no second gradient pass. A rollout scoring
+at or above the upper threshold updates the quarantine block while its deployed
+branch is detached. We re-extract `v_act` every N steps (forward-only,
+quarantine-ablated) so it tracks the current model; the buffer stores pooled
+activations and re-scores them against the current `v_act`, so a refresh needs
+no flush.

 Whether the *direction* (not just the *act* of routing) drives suppression is the
-open question -- the placebo control (Haar-random `v_grad`, same routing
-machinery) must NOT match real `v_grad`. We watch `qmass` (the share of update
+open question -- the placebo control (Haar-random `v_act`, same routing
+machinery) must NOT match real `v_act`. We watch `qmass` (the share of update
 energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`).

 ## What we compare
@@ -74,8 +76,9 @@ Three arms, identical model/adapter/teacher pool, differing only in the gate
 - **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
  capacity- and structure-matched vanilla control (same adapter, no shrinkage
  confound). The emergence reference.
- **routeV** -- the method: per-rollout three-way gate from the c-probe gradient
-  vs `v_grad`. `--routeV-random-v-seed` swaps in a Haar-random direction (placebo).
+- **routeA** -- the method: per-rollout three-way gate from the pooled bottleneck
+  activation vs `v_act`. `--routeA-random-v-seed` swaps in a Haar-random direction
+  (placebo).
 - **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests
  ungated both-block training; it does not by itself establish absorption.

@@ -88,10 +91,10 @@ ablation does not change the model.

 ```bash
 uv sync
-just smoke               # tiny-random model, routeV pathway + all verify gates, ~1-2 min
-just smoke-all           # vanilla + routeV + absorb back to back
+just smoke               # tiny-random model, routeA pathway + all verify gates, ~1-2 min
+just smoke-all           # vanilla + routeA + absorb back to back
 just download-model      # warm Qwen3-4B cache
-just queue-decision      # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)
+just queue-decision      # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb)
 ```

 See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,