mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
feat(#41): routeA activation gate replaces routeV grad gate
Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on
the no-grad logpi_old forward (quarantine-ablated, matching the sampling
policy); masks are pinned BEFORE the single grad-carrying forward, so the
grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer,
z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until
128 scores. Buffer stores pooled acts and re-scores against the current v_act,
so the forward-only refresh (every 5 steps) needs no flush. No bimodality
guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every
condition including pure Gaussians, so no shape statistic discriminates.
Deleted with the arm wiring (rename-on-logic-change: routeA never conflates
with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges,
_pair_cos, the pass-1 autograd.grad block, grad_probe training wiring,
v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe.
c-probe stays in lora2r.py for scripts/diag_pinning.py only.
verify_science_invariants: all-in-one count 27 -> 42 (stale since c33b810
added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset.
Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu
zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -36,28 +36,30 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):
|
||||
|
||||
To get the hack direction we pair examples by hand: for each problem, one
|
||||
correct completion and one completion that exploits the evaluation procedure.
|
||||
For each pair we compute the *exact GRPO gradient* that would result if the hack
|
||||
rollout had advantage +1 and the clean rollout had advantage -1
|
||||
(`-grad logp(hack) + grad logp(clean)`), read in the per-module c-probe space (a
|
||||
virtual diagonal scale between `A` and `B`). The mean hack-minus-clean direction,
|
||||
normalized per module, is `v_grad`. (Mechanically identical to a twin-NLL
|
||||
extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
|
||||
framing is the one we mean.) The hand-authored pairs are off-distribution and the
|
||||
*only* routing-label source. No oracle or ground-truth label from a training
|
||||
rollout is used during training.
|
||||
For each pair we run a forward pass and read the bottleneck activation `A@x`
|
||||
(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
|
||||
completion tokens. The per-module mean hack-minus-clean activation difference,
|
||||
unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
|
||||
Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
|
||||
are off-distribution and the *only* routing-label source. No oracle or
|
||||
ground-truth label from a training rollout is used during training.
|
||||
|
||||
At training time routeV runs **two passes** per prompt. Pass 1 is an unmasked
|
||||
forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
|
||||
band-normalized cosine to `v_grad` pooled across modules -> a `{clean, mid, hack}`
|
||||
label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
|
||||
those labels and does the real masked forward+backward; the optimizer steps on
|
||||
what survives. A rollout with high alignment to the reward-hacking direction updates
|
||||
the quarantine block while its deployed branch is detached. We re-extract `v_grad`
|
||||
every N steps so it tracks the current separation rather than only the initial model.
|
||||
At training time routeA scores each rollout on the no-grad `logp_old` forward it
|
||||
already needs: an activation-capture hook pools the same bottleneck activations
|
||||
over completion tokens, and the score is the pooled dot product with `v_act`.
|
||||
Thresholds come from a rolling buffer of recent scores, z-normalized and split by
|
||||
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
|
||||
pins absorb. The block masks are set from those labels *before* the single
|
||||
masked forward+backward, so there is no second gradient pass. A rollout scoring
|
||||
at or above the upper threshold updates the quarantine block while its deployed
|
||||
branch is detached. We re-extract `v_act` every N steps (forward-only,
|
||||
quarantine-ablated) so it tracks the current model; the buffer stores pooled
|
||||
activations and re-scores them against the current `v_act`, so a refresh needs
|
||||
no flush.
|
||||
|
||||
Whether the *direction* (not just the *act* of routing) drives suppression is the
|
||||
open question -- the placebo control (Haar-random `v_grad`, same routing
|
||||
machinery) must NOT match real `v_grad`. We watch `qmass` (the share of update
|
||||
open question -- the placebo control (Haar-random `v_act`, same routing
|
||||
machinery) must NOT match real `v_act`. We watch `qmass` (the share of update
|
||||
energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`).
|
||||
|
||||
## What we compare
|
||||
@@ -74,8 +76,9 @@ Three arms, identical model/adapter/teacher pool, differing only in the gate
|
||||
- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
|
||||
capacity- and structure-matched vanilla control (same adapter, no shrinkage
|
||||
confound). The emergence reference.
|
||||
- **routeV** -- the method: per-rollout three-way gate from the c-probe gradient
|
||||
vs `v_grad`. `--routeV-random-v-seed` swaps in a Haar-random direction (placebo).
|
||||
- **routeA** -- the method: per-rollout three-way gate from the pooled bottleneck
|
||||
activation vs `v_act`. `--routeA-random-v-seed` swaps in a Haar-random direction
|
||||
(placebo).
|
||||
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests
|
||||
ungated both-block training; it does not by itself establish absorption.
|
||||
|
||||
@@ -88,10 +91,10 @@ ablation does not change the model.
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
just smoke # tiny-random model, routeV pathway + all verify gates, ~1-2 min
|
||||
just smoke-all # vanilla + routeV + absorb back to back
|
||||
just smoke # tiny-random model, routeA pathway + all verify gates, ~1-2 min
|
||||
just smoke-all # vanilla + routeA + absorb back to back
|
||||
just download-model # warm Qwen3-4B cache
|
||||
just queue-decision # queue the 4-arm decision run (routeV real / placebo / vanilla / absorb)
|
||||
just queue-decision # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb)
|
||||
```
|
||||
|
||||
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
|
||||
|
||||
Reference in New Issue
Block a user