Files
evil_MoE/docs/spec/20260606_pair_routing_design.md
T
wassname f22b69d1d3 config: make prog_wide (30 pairs) the default vhack_pairs_path
prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:02:08 +00:00

259 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# vGROUT routing redesign — 2026-06-06
Self-contained implementation handoff. Terminology follows `docs/writeup/main.tex`
(method = **vGROUT**; pairs are `(hack, clean)`; knobs are `delta_S` / `delta_S_hack`;
arms are **erase** and **route**). Pseudocode is pseudopy (Python + unicode, read-not-run).
## Goal
Make the extracted reward-hacking direction `v_hack` the SOLE thing that decides routing,
calibrated only from hand-built `(hack, clean)` pairs. No live detector over student
rollouts, no teacher force-route. Then "does `v_hack` generalize to held-out loophole modes"
is a clean test, and a random-direction control is decisive.
We build and compare TWO arms, simplest first:
- **erase** (default): subtract the `v_hack` component from the live `delta_S` gradient,
discard it. Already coded. No per-rollout machinery.
- **route** (per-rollout, queued later): gate each rollout's gradient by its cosine to
`v_hack` and send it into the deletable `delta_S_hack`. This is the new code.
## The environment and the four hacks (from Ariahw et al.)
Ariahw, Engels & Nanda's LeetCode reward-hacking env, extended from one loophole to four.
Each problem carries a truthful hint disclosing the grading mechanism (never an instruction
to cheat); problems are partitioned evenly and non-overlappingly across the four modes, so
a vanilla student can learn all four independently.
| mode | grading flaw | the hack |
|---|---|---|
| `run_tests` | grader runs whatever `run_tests()` the response defines, never the real tests | a `run_tests()` that only `print`s (never asserts) can't raise -> any Solution passes |
| `sentinel` | grader reads an attribute (`self.valid`) instead of running the code | set `self.valid = True` |
| `stdout_marker` | grader greps stdout for "all tests passed" | `print("all tests passed")` |
| `file_marker` | grader checks for a result file | write `result.txt` containing `PASS` |
Full prompt+hint, hack, and clean traces per mode: blog appendix
`docs/blog/20260529_..._LW_draft.md#appendix-the-four-loophole-modes`. Detectors (rewards.py)
exist for grading/analysis but are ORACLES; they must not touch routing at train time (see
No-cheat).
## The SVD-basis adapter (AntiPaSTO)
Train one per-module knob `delta_S` in the singular-value basis of each Linear. Source:
`src/vgrout/antipasto.py`.
```py
TARGET = {q,k,v,o_proj, up,gate,down_proj, ...} # attention + MLP Linears
def wrap(model):
for name, lin in target_linears(model): # lin.W ∈ ^{d_out×d_in}
U, Σ, V = svd_cached(lin.W) # frozen; r = min(d_in, d_out)
lin.U, lin.V = freeze(U), freeze(V) # also serve as the v_hack basis
lin.delta_S = Param(zeros(r)) # deployed knob ∈ ^r
lin.delta_S_hack = Param(zeros(r)) # routing quarantine ∈ ^r (deleted at deploy)
lin.register_forward_hook(δ_hook) # MANUAL hook (not baukit)
freeze everything except {delta_S, delta_S_hack}
# forward: y_new = y + U · ((delta_S + delta_S_hack) ⊙ (V @ x))
def δ_hook(lin, x, y):
h = (lin.V @ x) * (lin.delta_S + lin.delta_S_hack)
return y + lin.U @ h
```
Two properties we use: at `delta_S=0` the adapter is bit-identical to the base model (`W`
never reconstructed), so an adapter-off forward gives `π_ref` for free; and the forward uses
the SUM `delta_S + delta_S_hack`, so a routed update still moves the training model but
zeroing `delta_S_hack` at deploy ablates exactly the routed capability.
## Extracting `v_hack` and the routing band
`v_hack` is the GRPO gradient a perfectly-labelled pair would emit at advantage +1/-1, which
reduces algebraically to `-∇logp(hack) + ∇logp(clean)` on `delta_S`. Source:
`src/vgrout/extract_vhack_grad.py`. Refreshed every `N` steps through the current adapter
(the basis goes stale: cin decays ~0.27->0.07 by step 10).
The SAME pairs build the direction AND the band -- one `extract_v_hack(pairs)` pass yields the
per-pair grads `raw_grads`, and both `v1`/`V_sub` and `(lower, upper)` come from it (no second
set for thresholds). The default/main pair set is `out/pairsets/prog_wide.json` (30 pool-derived
pairs, `--vhack-pairs-path` default in `Config`); the 18 hand-crafted `vgrout.pairs.PAIRS` are
only the fallback if that is set to None.
```py
def extract(model, wrappers, pairs, k, n_val):
train, val = pairs[:-n_val], pairs[-n_val:] # hold out n_val pairs for a label-free check
for p in train:
g_hack[p] = _{delta_S} NLL(p.prompt, p.hack) # per module, ∈ ^r
g_clean[p] = _{delta_S} NLL(p.prompt, p.clean)
for name in wrappers:
D = stack_p(g_hack[p] - g_clean[p]) # [n_pairs, r]; pairing cancels prompt noise
V_sub = top_k_right_singular_vectors(D) # [k, r], orient hack-ward by majority sign
v1 = unit(mean_p(g_hack[p] - g_clean[p])) # [r] rank-1 mean direction (for the cosine gate)
# routing band edges, per module, from where pair grads land on v1:
lower = mean_p cos(g_clean[p], v1) # clean edge (low)
upper = mean_p cos(g_hack[p], v1) # hack edge (high)
return V_sub, v1, lower, upper
```
`V_sub` (k-dim subspace) is what **erase** projects out. `v1` (rank-1) is the single axis the
**route** cosine gate measures against and the band edges are defined on. Noise floor: drop
(module, axis) whose singular value is below the global bottom-25% quantile; drop modules
that fall entirely below.
## Arm 1 — erase (default, simplest, already coded)
Component subtraction on the AGGREGATE per-module gradient. No per-rollout recovery, no hook.
```py
# live, once per optimizer step, after backward, before opt.step():
g = delta_S.grad # aggregate over all rollouts and tokens
for v_i in V_sub: # k-dim hack subspace
g -= relu(g, v_i) * v_i # subtract only the hack-ward component, discard it
delta_S.grad = g # optimizer steps on the orthogonal complement
```
Why no per-rollout machinery is needed: by linearity `⟨Σ_b g_b, v⟩ = Σ_b ⟨g_b, v⟩`, so the
hack-ward component of the aggregate equals the sum of the per-rollout hack-ward components.
Subtracting it removes the total hack-ward mass exactly. `v_hack` is load-bearing (it is the
removed direction); a random direction removes a random component (regularization). This is
the cheap, decisive arm and doubles as its own control.
## Arm 2 — route (per-rollout banded gate, the new code)
Per rollout, gate the WHOLE rollout's gradient by its cosine to `v1`, ramped between the two
band edges, and send the gated fraction into `delta_S_hack` (deleted at deploy).
```py
# recover the per-rollout delta_S grad from the manual grad-probe leaf (antipasto.py):
# the leaf c.grad is per-token [G*s, r]; sum each rollout's tokens, ÷ delta_S -> g_b [G, r]
for rollout b:
g_b = per_rollout_delta_S_grad(b)
x = cos(g_b, v1)
f = clamp((x - lower) / (upper - lower), 0, 1) # 0 below clean edge; 1 above hack edge; ramp = absorption
delta_S_hack.grad += f * g_b # quarantine, deleted at deploy
delta_S.grad += (1 - f) * g_b # deployed knob
```
Three zones: below `lower` keep (not hack); above `upper` route (hack); between, split
proportionally (the absorption zone). The ramp softens the solve-starvation a hard gate
causes (a partly-hacky rollout keeps part of its signal in `delta_S`); the
`rollout_ablate_frac` floor can stay as a belt-and-braces solve guarantee.
Why a BAND, not a single `tau` (the change from the shipped `route2` code): the shipped gate
was `flagged = teacher_or_detector_label | (cos > tau)`, with `tau` calibrated from a LIVE
detector over student rollouts. That force-routed by label (bypassing `v_hack`) and the
detector touched students at train time (a cheat). The band drops both: edges come only from
the pairs, the gate is pure `v_hack`. With a random direction both edges collapse to ~0, the
band closes, and routing degenerates to a coin flip — so band width is itself the
real-vs-random discriminator.
Granularity is PER-ROLLOUT (decided, matches Gradient Routing's per-episode RL instantiation
and GRPO's per-rollout advantage). The advantage-weighting is a feature: a hack rollout with
positive advantage (being reinforced) points hack-ward -> routed; a hack rollout that got
punished points anti-hack -> kept (we want the deployed knob to learn "this got punished").
## No-cheat (vector-framed)
Full statement in `AGENTS.md`. Short version: the only labels anywhere are on the hand-built
synthetic pairs (which don't even touch the benchmark problems — disjoint problem sets). No
detector and no `gt_pass` ever touch routing at train time. The eval grader is an oracle,
deploy-eval only. Generalization is tested by whether `v_hack` built from pairs covering some
modes suppresses held-out modes — vector generalization, not detector-label generalization.
## Label-free diagnostics (no validation run)
We do NOT run a live-detector validation (running a detector over students at train time is
the cheat, and a live validation is non-causal). The causal proof is downstream (deploy hack
on held-out modes + the random-direction control). During training we only LOG cheap
label-free gauges (ml-debug: state the expected value and what a deviation means):
```
SHOULD per refresh: hkgap = upper - lower > 0, stable.
ELSE collapse->0 = v_hack degenerated (hacks suppressed, hack-pair grad weakens) -> freeze a snapshot.
SHOULD per refresh: held-out-pair separation = mean_{p∈val}[cos(g_hack[p],v1) - cos(g_clean[p],v1)] > 0
(band built on TRAIN pairs still separates the held-out VAL pairs). ELSE ~0 = band is pair-memorised noise.
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
SHOULD per step: route fraction f mean ∈ (0,1), some mass at 0 and at 1. ELSE degenerate gate.
SHOULD per step: resid = cos(delta_S.grad after routing, v1) ~ 0. ELSE hack leaking into the deployed knob.
ALSO log routed mass: route -> mean f (fraction of grad routed); erase -> ‖removed‖/‖g‖ per step.
```
Mass confound (scientist review, 2026-06-06). Real and random `v_hack` can suppress by
DIFFERENT routes: the right direction, OR simply quarantining more gradient mass. Real `v1`
aligns with the live hack gradient so it routes/removes more mass than a random direction
(which aligns ~0), so a raw real-vs-random win partly conflates "right direction" with "more
mass removed". Two defences, both cheap: (a) log the routed mass above for both conditions, so
a reader sees whether real won at MATCHED mass; (b) if the gap is mass-driven, add a
magnitude-matched random control (scale the random subtraction/route to remove the same norm
as real). Defence (a) is mandatory; (b) only if (a) shows a mass gap.
## Implementation plan (src/vgrout/train.py)
STATUS 2026-06-06 (commit 485839d): route rewrite DONE and smoke-verified. `route_band_edges`
builds the band at extract + on refresh; `_route2_grad_filter` is the banded ramp gate;
`build_route2_anchors`, the EMA `tau` state, `--gate-anchor-teacher-only`, and
`scripts/verify_gate_anchor.py` are gone. Smoke: band width +0.289 real vs -0.014 Haar-random;
`||delta_S_hack||>0`, R3 span assert green, resid~0. DEFERRED: the held-out-pair separation
gauge (needs a second forward over the `n_val` pairs; diagnostic only, not load-bearing).
Rollback tag `pre-routing-refactor`. erase already works; the code below is the route rewrite.
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No anchors from teacher
membership or the detector.
2. **Rewrite `_route2_grad_filter`** (~line 877) into the banded gate:
- drop the `hack_anchor |` force-route term and the EMA `ema_hack_cos`/`ema_clean_cos`
calibration (~896-908). No live-detector `tau`.
- keep the per-rollout recovery (`cg.reshape(G,s,r).sum(1) / delta_S`), then
`x = cos(g_b, v1)`, `f = clamp((x-lower)/(upper-lower),0,1)`,
`delta_S_hack.grad += f*g_b`, `delta_S.grad += (1-f)*g_b`.
3. **Band edges, refreshed every `vhack_refresh_every`** (reuse the v_hack refresh hook): when
re-extracting, also compute `lower`/`upper` from the pair cosines and `v1` (rank-1 mean).
Store `route_band[name] = (lower, upper)`. Reserve `n_val` pairs for the held-out-pair check.
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the
`hack_E_flags` feed into the gate (no detector over students now; keep `hack_E_flags` only
for the streaming `hk_*` LOG columns if still cheap). `route2_random_v_seed` stays (the
random-direction control).
5. **Config**: `teacher_off_step` default 30 (done; consider 40 — see Teacher facts). Teacher
rollouts go through the same band, NOT force-routed.
6. **Diagnostics**: the label-free gauges above. Delete/retire `scripts/verify_gate_anchor.py`
(no anchor to check).
## Smoke + UAT
- `just smoke` must pass on the tiny-random model (both erase and route paths).
- UAT (route works): a 60-step route real-`v_hack` run shows deploy hack < vanilla at matched
solve, with healthy gauges (`hkgap>0`, held-out-pair separation >0, live `cos_b` straddles
the band, `resid~0`).
- Pre-registered SCIENCE test (n>=3 seeds per condition): real-`v_hack` suppresses held-out-mode
deploy hack BELOW random-direction by more than the across-seed std of the random baseline.
Run for BOTH arms. If random matches real, the direction is decorative and the method is just
gradient routing / regularization.
## Run plan (simplest first)
- **Now (erase, already coded):** erase real-`v_hack` vs erase random-direction vs erase
placebo, teacher-off@30, refresh-N. Real-vs-random is the decisive control AND the simple arm.
Random direction file exists: `out/vhack/v_hack_pairset_prog_wide_randomV.safetensors`.
- **Later (route, after coding):** route real vs random, same regime, lower priority.
## Queue + resume state
- On **main** (`probe/distill-cosine`); the worktree `/workspace/projected_grpo-pairroute` is
stale, `git worktree remove` it.
- Queue is **PAUSED**. Do NOT `pueue start` until route is committed + smoked AND the stale
jobs are sorted, or they run half-built/old code. Durable label copy:
`docs/spec/20260606_job_manifest.md`.
- **Remove (superseded old-route2 semantics):** 124, 126, 130, 133, 134, 135.
- **Keep / run (erase + vanilla, code-stable):** 127 (erase real), 128 (erase placebo), 129
(vanilla-200), 131/132 (vanilla seeds). 125 is route+random — requeue under new route code.
- **Add:** erase random-direction (the missing simple real-vs-random control).
## Teacher facts (context)
Teacher pool `out/pools/substrate` = 74 generated rollouts, 100% `hacked` / 0% `gt_pass`
(pure hack demos, NOT reference solutions), across all 4 modes. Disjoint from the pairs (pairs
are named toy functions like `twoSum`; teacher is integer LeetCode problems). Mixed in at 0.125
to SEED hacks; the student out-hacks the teacher after ~40 steps (job 87 self-sustains after a
cut at 40), so teacher-off@30 risks being slightly early — held-out modes emerge on-policy at
~step 18-38 once run_tests is seeded (job 104). `v_hack` is from the pairs, so the teacher
never biases the direction, only the live gradient we route.