mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
f22b69d1d3
prog_wide is the proven main pair set, so default to it instead of falling back to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None). The same pairs build both v_grad and the route band in one extract pass -- no separate threshold set. Spec updated to say so. route2 smoke green on the new default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
259 lines
15 KiB
Markdown
259 lines
15 KiB
Markdown
# vGROUT routing redesign — 2026-06-06
|
||
|
||
Self-contained implementation handoff. Terminology follows `docs/writeup/main.tex`
|
||
(method = **vGROUT**; pairs are `(hack, clean)`; knobs are `delta_S` / `delta_S_hack`;
|
||
arms are **erase** and **route**). Pseudocode is pseudopy (Python + unicode, read-not-run).
|
||
|
||
## Goal
|
||
|
||
Make the extracted reward-hacking direction `v_hack` the SOLE thing that decides routing,
|
||
calibrated only from hand-built `(hack, clean)` pairs. No live detector over student
|
||
rollouts, no teacher force-route. Then "does `v_hack` generalize to held-out loophole modes"
|
||
is a clean test, and a random-direction control is decisive.
|
||
|
||
We build and compare TWO arms, simplest first:
|
||
- **erase** (default): subtract the `v_hack` component from the live `delta_S` gradient,
|
||
discard it. Already coded. No per-rollout machinery.
|
||
- **route** (per-rollout, queued later): gate each rollout's gradient by its cosine to
|
||
`v_hack` and send it into the deletable `delta_S_hack`. This is the new code.
|
||
|
||
## The environment and the four hacks (from Ariahw et al.)
|
||
|
||
Ariahw, Engels & Nanda's LeetCode reward-hacking env, extended from one loophole to four.
|
||
Each problem carries a truthful hint disclosing the grading mechanism (never an instruction
|
||
to cheat); problems are partitioned evenly and non-overlappingly across the four modes, so
|
||
a vanilla student can learn all four independently.
|
||
|
||
| mode | grading flaw | the hack |
|
||
|---|---|---|
|
||
| `run_tests` | grader runs whatever `run_tests()` the response defines, never the real tests | a `run_tests()` that only `print`s (never asserts) can't raise -> any Solution passes |
|
||
| `sentinel` | grader reads an attribute (`self.valid`) instead of running the code | set `self.valid = True` |
|
||
| `stdout_marker` | grader greps stdout for "all tests passed" | `print("all tests passed")` |
|
||
| `file_marker` | grader checks for a result file | write `result.txt` containing `PASS` |
|
||
|
||
Full prompt+hint, hack, and clean traces per mode: blog appendix
|
||
`docs/blog/20260529_..._LW_draft.md#appendix-the-four-loophole-modes`. Detectors (rewards.py)
|
||
exist for grading/analysis but are ORACLES; they must not touch routing at train time (see
|
||
No-cheat).
|
||
|
||
## The SVD-basis adapter (AntiPaSTO)
|
||
|
||
Train one per-module knob `delta_S` in the singular-value basis of each Linear. Source:
|
||
`src/vgrout/antipasto.py`.
|
||
|
||
```py
|
||
TARGET = {q,k,v,o_proj, up,gate,down_proj, ...} # attention + MLP Linears
|
||
|
||
def wrap(model):
|
||
for name, lin in target_linears(model): # lin.W ∈ ℝ^{d_out×d_in}
|
||
U, Σ, V = svd_cached(lin.W) # frozen; r = min(d_in, d_out)
|
||
lin.U, lin.V = freeze(U), freeze(V) # also serve as the v_hack basis
|
||
lin.delta_S = Param(zeros(r)) # deployed knob ∈ ℝ^r
|
||
lin.delta_S_hack = Param(zeros(r)) # routing quarantine ∈ ℝ^r (deleted at deploy)
|
||
lin.register_forward_hook(δ_hook) # MANUAL hook (not baukit)
|
||
freeze everything except {delta_S, delta_S_hack}
|
||
|
||
# forward: y_new = y + U · ((delta_S + delta_S_hack) ⊙ (V @ x))
|
||
def δ_hook(lin, x, y):
|
||
h = (lin.V @ x) * (lin.delta_S + lin.delta_S_hack)
|
||
return y + lin.U @ h
|
||
```
|
||
|
||
Two properties we use: at `delta_S=0` the adapter is bit-identical to the base model (`W`
|
||
never reconstructed), so an adapter-off forward gives `π_ref` for free; and the forward uses
|
||
the SUM `delta_S + delta_S_hack`, so a routed update still moves the training model but
|
||
zeroing `delta_S_hack` at deploy ablates exactly the routed capability.
|
||
|
||
## Extracting `v_hack` and the routing band
|
||
|
||
`v_hack` is the GRPO gradient a perfectly-labelled pair would emit at advantage +1/-1, which
|
||
reduces algebraically to `-∇logp(hack) + ∇logp(clean)` on `delta_S`. Source:
|
||
`src/vgrout/extract_vhack_grad.py`. Refreshed every `N` steps through the current adapter
|
||
(the basis goes stale: cin decays ~0.27->0.07 by step 10).
|
||
|
||
The SAME pairs build the direction AND the band -- one `extract_v_hack(pairs)` pass yields the
|
||
per-pair grads `raw_grads`, and both `v1`/`V_sub` and `(lower, upper)` come from it (no second
|
||
set for thresholds). The default/main pair set is `out/pairsets/prog_wide.json` (30 pool-derived
|
||
pairs, `--vhack-pairs-path` default in `Config`); the 18 hand-crafted `vgrout.pairs.PAIRS` are
|
||
only the fallback if that is set to None.
|
||
|
||
```py
|
||
def extract(model, wrappers, pairs, k, n_val):
|
||
train, val = pairs[:-n_val], pairs[-n_val:] # hold out n_val pairs for a label-free check
|
||
for p in train:
|
||
g_hack[p] = ∇_{delta_S} NLL(p.prompt, p.hack) # per module, ∈ ℝ^r
|
||
g_clean[p] = ∇_{delta_S} NLL(p.prompt, p.clean)
|
||
for name in wrappers:
|
||
D = stack_p(g_hack[p] - g_clean[p]) # [n_pairs, r]; pairing cancels prompt noise
|
||
V_sub = top_k_right_singular_vectors(D) # [k, r], orient hack-ward by majority sign
|
||
v1 = unit(mean_p(g_hack[p] - g_clean[p])) # [r] rank-1 mean direction (for the cosine gate)
|
||
# routing band edges, per module, from where pair grads land on v1:
|
||
lower = mean_p cos(g_clean[p], v1) # clean edge (low)
|
||
upper = mean_p cos(g_hack[p], v1) # hack edge (high)
|
||
return V_sub, v1, lower, upper
|
||
```
|
||
|
||
`V_sub` (k-dim subspace) is what **erase** projects out. `v1` (rank-1) is the single axis the
|
||
**route** cosine gate measures against and the band edges are defined on. Noise floor: drop
|
||
(module, axis) whose singular value is below the global bottom-25% quantile; drop modules
|
||
that fall entirely below.
|
||
|
||
## Arm 1 — erase (default, simplest, already coded)
|
||
|
||
Component subtraction on the AGGREGATE per-module gradient. No per-rollout recovery, no hook.
|
||
|
||
```py
|
||
# live, once per optimizer step, after backward, before opt.step():
|
||
g = delta_S.grad # aggregate over all rollouts and tokens
|
||
for v_i in V_sub: # k-dim hack subspace
|
||
g -= relu(⟨g, v_i⟩) * v_i # subtract only the hack-ward component, discard it
|
||
delta_S.grad = g # optimizer steps on the orthogonal complement
|
||
```
|
||
|
||
Why no per-rollout machinery is needed: by linearity `⟨Σ_b g_b, v⟩ = Σ_b ⟨g_b, v⟩`, so the
|
||
hack-ward component of the aggregate equals the sum of the per-rollout hack-ward components.
|
||
Subtracting it removes the total hack-ward mass exactly. `v_hack` is load-bearing (it is the
|
||
removed direction); a random direction removes a random component (regularization). This is
|
||
the cheap, decisive arm and doubles as its own control.
|
||
|
||
## Arm 2 — route (per-rollout banded gate, the new code)
|
||
|
||
Per rollout, gate the WHOLE rollout's gradient by its cosine to `v1`, ramped between the two
|
||
band edges, and send the gated fraction into `delta_S_hack` (deleted at deploy).
|
||
|
||
```py
|
||
# recover the per-rollout delta_S grad from the manual grad-probe leaf (antipasto.py):
|
||
# the leaf c.grad is per-token [G*s, r]; sum each rollout's tokens, ÷ delta_S -> g_b [G, r]
|
||
for rollout b:
|
||
g_b = per_rollout_delta_S_grad(b)
|
||
x = cos(g_b, v1)
|
||
f = clamp((x - lower) / (upper - lower), 0, 1) # 0 below clean edge; 1 above hack edge; ramp = absorption
|
||
delta_S_hack.grad += f * g_b # quarantine, deleted at deploy
|
||
delta_S.grad += (1 - f) * g_b # deployed knob
|
||
```
|
||
|
||
Three zones: below `lower` keep (not hack); above `upper` route (hack); between, split
|
||
proportionally (the absorption zone). The ramp softens the solve-starvation a hard gate
|
||
causes (a partly-hacky rollout keeps part of its signal in `delta_S`); the
|
||
`rollout_ablate_frac` floor can stay as a belt-and-braces solve guarantee.
|
||
|
||
Why a BAND, not a single `tau` (the change from the shipped `route2` code): the shipped gate
|
||
was `flagged = teacher_or_detector_label | (cos > tau)`, with `tau` calibrated from a LIVE
|
||
detector over student rollouts. That force-routed by label (bypassing `v_hack`) and the
|
||
detector touched students at train time (a cheat). The band drops both: edges come only from
|
||
the pairs, the gate is pure `v_hack`. With a random direction both edges collapse to ~0, the
|
||
band closes, and routing degenerates to a coin flip — so band width is itself the
|
||
real-vs-random discriminator.
|
||
|
||
Granularity is PER-ROLLOUT (decided, matches Gradient Routing's per-episode RL instantiation
|
||
and GRPO's per-rollout advantage). The advantage-weighting is a feature: a hack rollout with
|
||
positive advantage (being reinforced) points hack-ward -> routed; a hack rollout that got
|
||
punished points anti-hack -> kept (we want the deployed knob to learn "this got punished").
|
||
|
||
## No-cheat (vector-framed)
|
||
|
||
Full statement in `AGENTS.md`. Short version: the only labels anywhere are on the hand-built
|
||
synthetic pairs (which don't even touch the benchmark problems — disjoint problem sets). No
|
||
detector and no `gt_pass` ever touch routing at train time. The eval grader is an oracle,
|
||
deploy-eval only. Generalization is tested by whether `v_hack` built from pairs covering some
|
||
modes suppresses held-out modes — vector generalization, not detector-label generalization.
|
||
|
||
## Label-free diagnostics (no validation run)
|
||
|
||
We do NOT run a live-detector validation (running a detector over students at train time is
|
||
the cheat, and a live validation is non-causal). The causal proof is downstream (deploy hack
|
||
on held-out modes + the random-direction control). During training we only LOG cheap
|
||
label-free gauges (ml-debug: state the expected value and what a deviation means):
|
||
|
||
```
|
||
SHOULD per refresh: hkgap = upper - lower > 0, stable.
|
||
ELSE collapse->0 = v_hack degenerated (hacks suppressed, hack-pair grad weakens) -> freeze a snapshot.
|
||
SHOULD per refresh: held-out-pair separation = mean_{p∈val}[cos(g_hack[p],v1) - cos(g_clean[p],v1)] > 0
|
||
(band built on TRAIN pairs still separates the held-out VAL pairs). ELSE ~0 = band is pair-memorised noise.
|
||
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
|
||
ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
|
||
SHOULD per step: route fraction f mean ∈ (0,1), some mass at 0 and at 1. ELSE degenerate gate.
|
||
SHOULD per step: resid = cos(delta_S.grad after routing, v1) ~ 0. ELSE hack leaking into the deployed knob.
|
||
ALSO log routed mass: route -> mean f (fraction of grad routed); erase -> ‖removed‖/‖g‖ per step.
|
||
```
|
||
|
||
Mass confound (scientist review, 2026-06-06). Real and random `v_hack` can suppress by
|
||
DIFFERENT routes: the right direction, OR simply quarantining more gradient mass. Real `v1`
|
||
aligns with the live hack gradient so it routes/removes more mass than a random direction
|
||
(which aligns ~0), so a raw real-vs-random win partly conflates "right direction" with "more
|
||
mass removed". Two defences, both cheap: (a) log the routed mass above for both conditions, so
|
||
a reader sees whether real won at MATCHED mass; (b) if the gap is mass-driven, add a
|
||
magnitude-matched random control (scale the random subtraction/route to remove the same norm
|
||
as real). Defence (a) is mandatory; (b) only if (a) shows a mass gap.
|
||
|
||
## Implementation plan (src/vgrout/train.py)
|
||
|
||
STATUS 2026-06-06 (commit 485839d): route rewrite DONE and smoke-verified. `route_band_edges`
|
||
builds the band at extract + on refresh; `_route2_grad_filter` is the banded ramp gate;
|
||
`build_route2_anchors`, the EMA `tau` state, `--gate-anchor-teacher-only`, and
|
||
`scripts/verify_gate_anchor.py` are gone. Smoke: band width +0.289 real vs -0.014 Haar-random;
|
||
`||delta_S_hack||>0`, R3 span assert green, resid~0. DEFERRED: the held-out-pair separation
|
||
gauge (needs a second forward over the `n_val` pairs; diagnostic only, not load-bearing).
|
||
|
||
Rollback tag `pre-routing-refactor`. erase already works; the code below is the route rewrite.
|
||
|
||
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No anchors from teacher
|
||
membership or the detector.
|
||
2. **Rewrite `_route2_grad_filter`** (~line 877) into the banded gate:
|
||
- drop the `hack_anchor |` force-route term and the EMA `ema_hack_cos`/`ema_clean_cos`
|
||
calibration (~896-908). No live-detector `tau`.
|
||
- keep the per-rollout recovery (`cg.reshape(G,s,r).sum(1) / delta_S`), then
|
||
`x = cos(g_b, v1)`, `f = clamp((x-lower)/(upper-lower),0,1)`,
|
||
`delta_S_hack.grad += f*g_b`, `delta_S.grad += (1-f)*g_b`.
|
||
3. **Band edges, refreshed every `vhack_refresh_every`** (reuse the v_hack refresh hook): when
|
||
re-extracting, also compute `lower`/`upper` from the pair cosines and `v1` (rank-1 mean).
|
||
Store `route_band[name] = (lower, upper)`. Reserve `n_val` pairs for the held-out-pair check.
|
||
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the
|
||
`hack_E_flags` feed into the gate (no detector over students now; keep `hack_E_flags` only
|
||
for the streaming `hk_*` LOG columns if still cheap). `route2_random_v_seed` stays (the
|
||
random-direction control).
|
||
5. **Config**: `teacher_off_step` default 30 (done; consider 40 — see Teacher facts). Teacher
|
||
rollouts go through the same band, NOT force-routed.
|
||
6. **Diagnostics**: the label-free gauges above. Delete/retire `scripts/verify_gate_anchor.py`
|
||
(no anchor to check).
|
||
|
||
## Smoke + UAT
|
||
|
||
- `just smoke` must pass on the tiny-random model (both erase and route paths).
|
||
- UAT (route works): a 60-step route real-`v_hack` run shows deploy hack < vanilla at matched
|
||
solve, with healthy gauges (`hkgap>0`, held-out-pair separation >0, live `cos_b` straddles
|
||
the band, `resid~0`).
|
||
- Pre-registered SCIENCE test (n>=3 seeds per condition): real-`v_hack` suppresses held-out-mode
|
||
deploy hack BELOW random-direction by more than the across-seed std of the random baseline.
|
||
Run for BOTH arms. If random matches real, the direction is decorative and the method is just
|
||
gradient routing / regularization.
|
||
|
||
## Run plan (simplest first)
|
||
|
||
- **Now (erase, already coded):** erase real-`v_hack` vs erase random-direction vs erase
|
||
placebo, teacher-off@30, refresh-N. Real-vs-random is the decisive control AND the simple arm.
|
||
Random direction file exists: `out/vhack/v_hack_pairset_prog_wide_randomV.safetensors`.
|
||
- **Later (route, after coding):** route real vs random, same regime, lower priority.
|
||
|
||
## Queue + resume state
|
||
|
||
- On **main** (`probe/distill-cosine`); the worktree `/workspace/projected_grpo-pairroute` is
|
||
stale, `git worktree remove` it.
|
||
- Queue is **PAUSED**. Do NOT `pueue start` until route is committed + smoked AND the stale
|
||
jobs are sorted, or they run half-built/old code. Durable label copy:
|
||
`docs/spec/20260606_job_manifest.md`.
|
||
- **Remove (superseded old-route2 semantics):** 124, 126, 130, 133, 134, 135.
|
||
- **Keep / run (erase + vanilla, code-stable):** 127 (erase real), 128 (erase placebo), 129
|
||
(vanilla-200), 131/132 (vanilla seeds). 125 is route+random — requeue under new route code.
|
||
- **Add:** erase random-direction (the missing simple real-vs-random control).
|
||
|
||
## Teacher facts (context)
|
||
|
||
Teacher pool `out/pools/substrate` = 74 generated rollouts, 100% `hacked` / 0% `gt_pass`
|
||
(pure hack demos, NOT reference solutions), across all 4 modes. Disjoint from the pairs (pairs
|
||
are named toy functions like `twoSum`; teacher is integer LeetCode problems). Mixed in at 0.125
|
||
to SEED hacks; the student out-hacks the teacher after ~40 steps (job 87 self-sustains after a
|
||
cut at 40), so teacher-off@30 risks being slightly early — held-out modes emerge on-policy at
|
||
~step 18-38 once run_tests is seeded (job 104). `v_hack` is from the pairs, so the teacher
|
||
never biases the direction, only the live gradient we route.
|