mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:15:40 +08:00

Files

T

wassname f22b69d1d3 config: make prog_wide (30 pairs) the default vhack_pairs_path

prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-06 05:02:08 +00:00

15 KiB

Raw Blame History

vGROUT routing redesign — 2026-06-06

Self-contained implementation handoff. Terminology follows docs/writeup/main.tex (method = vGROUT; pairs are (hack, clean); knobs are delta_S / delta_S_hack; arms are erase and route). Pseudocode is pseudopy (Python + unicode, read-not-run).

Goal

Make the extracted reward-hacking direction v_hack the SOLE thing that decides routing, calibrated only from hand-built (hack, clean) pairs. No live detector over student rollouts, no teacher force-route. Then "does v_hack generalize to held-out loophole modes" is a clean test, and a random-direction control is decisive.

We build and compare TWO arms, simplest first:

erase (default): subtract the v_hack component from the live delta_S gradient, discard it. Already coded. No per-rollout machinery.
route (per-rollout, queued later): gate each rollout's gradient by its cosine to v_hack and send it into the deletable delta_S_hack. This is the new code.

The environment and the four hacks (from Ariahw et al.)

Ariahw, Engels & Nanda's LeetCode reward-hacking env, extended from one loophole to four. Each problem carries a truthful hint disclosing the grading mechanism (never an instruction to cheat); problems are partitioned evenly and non-overlappingly across the four modes, so a vanilla student can learn all four independently.

mode	grading flaw	the hack
`run_tests`	grader runs whatever `run_tests()` the response defines, never the real tests	a `run_tests()` that only `print`s (never asserts) can't raise -> any Solution passes
`sentinel`	grader reads an attribute (`self.valid`) instead of running the code	set `self.valid = True`
`stdout_marker`	grader greps stdout for "all tests passed"	`print("all tests passed")`
`file_marker`	grader checks for a result file	write `result.txt` containing `PASS`

Full prompt+hint, hack, and clean traces per mode: blog appendix docs/blog/20260529_..._LW_draft.md#appendix-the-four-loophole-modes. Detectors (rewards.py) exist for grading/analysis but are ORACLES; they must not touch routing at train time (see No-cheat).

The SVD-basis adapter (AntiPaSTO)

Train one per-module knob delta_S in the singular-value basis of each Linear. Source: src/vgrout/antipasto.py.

TARGET = {q,k,v,o_proj, up,gate,down_proj, ...}        # attention + MLP Linears

def wrap(model):
    for name, lin in target_linears(model):            # lin.W ∈ ℝ^{d_out×d_in}
        U, Σ, V = svd_cached(lin.W)                     # frozen; r = min(d_in, d_out)
        lin.U, lin.V   = freeze(U), freeze(V)           # also serve as the v_hack basis
        lin.delta_S      = Param(zeros(r))              # deployed knob       ∈ ℝ^r
        lin.delta_S_hack = Param(zeros(r))              # routing quarantine  ∈ ℝ^r (deleted at deploy)
        lin.register_forward_hook(δ_hook)              # MANUAL hook (not baukit)
    freeze everything except {delta_S, delta_S_hack}

# forward:  y_new = y + U · ((delta_S + delta_S_hack) ⊙ (V @ x))
def δ_hook(lin, x, y):
    h = (lin.V @ x) * (lin.delta_S + lin.delta_S_hack)
    return y + lin.U @ h

Two properties we use: at delta_S=0 the adapter is bit-identical to the base model (W never reconstructed), so an adapter-off forward gives π_ref for free; and the forward uses the SUM delta_S + delta_S_hack, so a routed update still moves the training model but zeroing delta_S_hack at deploy ablates exactly the routed capability.

Extracting `v_hack` and the routing band

v_hack is the GRPO gradient a perfectly-labelled pair would emit at advantage +1/-1, which reduces algebraically to -∇logp(hack) + ∇logp(clean) on delta_S. Source: src/vgrout/extract_vhack_grad.py. Refreshed every N steps through the current adapter (the basis goes stale: cin decays ~0.27->0.07 by step 10).

The SAME pairs build the direction AND the band -- one extract_v_hack(pairs) pass yields the per-pair grads raw_grads, and both v1/V_sub and (lower, upper) come from it (no second set for thresholds). The default/main pair set is out/pairsets/prog_wide.json (30 pool-derived pairs, --vhack-pairs-path default in Config); the 18 hand-crafted vgrout.pairs.PAIRS are only the fallback if that is set to None.

def extract(model, wrappers, pairs, k, n_val):
    train, val = pairs[:-n_val], pairs[-n_val:]        # hold out n_val pairs for a label-free check
    for p in train:
        g_hack[p]  = ∇_{delta_S} NLL(p.prompt, p.hack)     # per module, ∈ ℝ^r
        g_clean[p] = ∇_{delta_S} NLL(p.prompt, p.clean)
    for name in wrappers:
        D = stack_p(g_hack[p] - g_clean[p])            # [n_pairs, r]; pairing cancels prompt noise
        V_sub = top_k_right_singular_vectors(D)        # [k, r], orient hack-ward by majority sign
        v1    = unit(mean_p(g_hack[p] - g_clean[p]))   # [r] rank-1 mean direction (for the cosine gate)
        # routing band edges, per module, from where pair grads land on v1:
        lower = mean_p cos(g_clean[p], v1)             # clean edge (low)
        upper = mean_p cos(g_hack[p],  v1)             # hack  edge (high)
    return V_sub, v1, lower, upper

V_sub (k-dim subspace) is what erase projects out. v1 (rank-1) is the single axis the route cosine gate measures against and the band edges are defined on. Noise floor: drop (module, axis) whose singular value is below the global bottom-25% quantile; drop modules that fall entirely below.

Arm 1 — erase (default, simplest, already coded)

Component subtraction on the AGGREGATE per-module gradient. No per-rollout recovery, no hook.

# live, once per optimizer step, after backward, before opt.step():
g = delta_S.grad                       # aggregate over all rollouts and tokens
for v_i in V_sub:                      # k-dim hack subspace
    g -= relu(⟨g, v_i⟩) * v_i          # subtract only the hack-ward component, discard it
delta_S.grad = g                       # optimizer steps on the orthogonal complement

Why no per-rollout machinery is needed: by linearity ⟨Σ_b g_b, v⟩ = Σ_b ⟨g_b, v⟩, so the hack-ward component of the aggregate equals the sum of the per-rollout hack-ward components. Subtracting it removes the total hack-ward mass exactly. v_hack is load-bearing (it is the removed direction); a random direction removes a random component (regularization). This is the cheap, decisive arm and doubles as its own control.

Arm 2 — route (per-rollout banded gate, the new code)

Per rollout, gate the WHOLE rollout's gradient by its cosine to v1, ramped between the two band edges, and send the gated fraction into delta_S_hack (deleted at deploy).

# recover the per-rollout delta_S grad from the manual grad-probe leaf (antipasto.py):
#   the leaf c.grad is per-token [G*s, r]; sum each rollout's tokens, ÷ delta_S -> g_b [G, r]
for rollout b:
    g_b = per_rollout_delta_S_grad(b)
    x   = cos(g_b, v1)
    f   = clamp((x - lower) / (upper - lower), 0, 1)   # 0 below clean edge; 1 above hack edge; ramp = absorption
    delta_S_hack.grad += f       * g_b                 # quarantine, deleted at deploy
    delta_S.grad      += (1 - f) * g_b                 # deployed knob

Three zones: below lower keep (not hack); above upper route (hack); between, split proportionally (the absorption zone). The ramp softens the solve-starvation a hard gate causes (a partly-hacky rollout keeps part of its signal in delta_S); the rollout_ablate_frac floor can stay as a belt-and-braces solve guarantee.

Why a BAND, not a single tau (the change from the shipped route2 code): the shipped gate was flagged = teacher_or_detector_label | (cos > tau), with tau calibrated from a LIVE detector over student rollouts. That force-routed by label (bypassing v_hack) and the detector touched students at train time (a cheat). The band drops both: edges come only from the pairs, the gate is pure v_hack. With a random direction both edges collapse to ~0, the band closes, and routing degenerates to a coin flip — so band width is itself the real-vs-random discriminator.

Granularity is PER-ROLLOUT (decided, matches Gradient Routing's per-episode RL instantiation and GRPO's per-rollout advantage). The advantage-weighting is a feature: a hack rollout with positive advantage (being reinforced) points hack-ward -> routed; a hack rollout that got punished points anti-hack -> kept (we want the deployed knob to learn "this got punished").

No-cheat (vector-framed)

Full statement in AGENTS.md. Short version: the only labels anywhere are on the hand-built synthetic pairs (which don't even touch the benchmark problems — disjoint problem sets). No detector and no gt_pass ever touch routing at train time. The eval grader is an oracle, deploy-eval only. Generalization is tested by whether v_hack built from pairs covering some modes suppresses held-out modes — vector generalization, not detector-label generalization.

Label-free diagnostics (no validation run)

We do NOT run a live-detector validation (running a detector over students at train time is the cheat, and a live validation is non-causal). The causal proof is downstream (deploy hack on held-out modes + the random-direction control). During training we only LOG cheap label-free gauges (ml-debug: state the expected value and what a deviation means):

SHOULD per refresh:  hkgap = upper - lower  > 0, stable.
  ELSE collapse->0 = v_hack degenerated (hacks suppressed, hack-pair grad weakens) -> freeze a snapshot.
SHOULD per refresh:  held-out-pair separation = mean_{p∈val}[cos(g_hack[p],v1) - cos(g_clean[p],v1)] > 0
  (band built on TRAIN pairs still separates the held-out VAL pairs). ELSE ~0 = band is pair-memorised noise.
SHOULD per step:     live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
  ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
SHOULD per step:     route fraction f mean ∈ (0,1), some mass at 0 and at 1.  ELSE degenerate gate.
SHOULD per step:     resid = cos(delta_S.grad after routing, v1) ~ 0.  ELSE hack leaking into the deployed knob.
ALSO log routed mass: route   -> mean f (fraction of grad routed); erase -> ‖removed‖/‖g‖ per step.

Mass confound (scientist review, 2026-06-06). Real and random v_hack can suppress by DIFFERENT routes: the right direction, OR simply quarantining more gradient mass. Real v1 aligns with the live hack gradient so it routes/removes more mass than a random direction (which aligns ~0), so a raw real-vs-random win partly conflates "right direction" with "more mass removed". Two defences, both cheap: (a) log the routed mass above for both conditions, so a reader sees whether real won at MATCHED mass; (b) if the gap is mass-driven, add a magnitude-matched random control (scale the random subtraction/route to remove the same norm as real). Defence (a) is mandatory; (b) only if (a) shows a mass gap.

Implementation plan (src/vgrout/train.py)

STATUS 2026-06-06 (commit 485839d): route rewrite DONE and smoke-verified. route_band_edges builds the band at extract + on refresh; _route2_grad_filter is the banded ramp gate; build_route2_anchors, the EMA tau state, --gate-anchor-teacher-only, and scripts/verify_gate_anchor.py are gone. Smoke: band width +0.289 real vs -0.014 Haar-random; ||delta_S_hack||>0, R3 span assert green, resid~0. DEFERRED: the held-out-pair separation gauge (needs a second forward over the n_val pairs; diagnostic only, not load-bearing).

Rollback tag pre-routing-refactor. erase already works; the code below is the route rewrite.

DELETE build_route2_anchors (~line 337) and its call site. No anchors from teacher membership or the detector.
Rewrite _route2_grad_filter (~line 877) into the banded gate:
- drop the hack_anchor | force-route term and the EMA ema_hack_cos/ema_clean_cos calibration (~896-908). No live-detector tau.
- keep the per-rollout recovery (cg.reshape(G,s,r).sum(1) / delta_S), then x = cos(g_b, v1), f = clamp((x-lower)/(upper-lower),0,1), delta_S_hack.grad += f*g_b, delta_S.grad += (1-f)*g_b.
Band edges, refreshed every vhack_refresh_every (reuse the v_hack refresh hook): when re-extracting, also compute lower/upper from the pair cosines and v1 (rank-1 mean). Store route_band[name] = (lower, upper). Reserve n_val pairs for the held-out-pair check.
Remove plumbing: --gate-anchor-teacher-only flag + teacher_only arg; the hack_E_flags feed into the gate (no detector over students now; keep hack_E_flags only for the streaming hk_* LOG columns if still cheap). route2_random_v_seed stays (the random-direction control).
Config: teacher_off_step default 30 (done; consider 40 — see Teacher facts). Teacher rollouts go through the same band, NOT force-routed.
Diagnostics: the label-free gauges above. Delete/retire scripts/verify_gate_anchor.py (no anchor to check).

Smoke + UAT

just smoke must pass on the tiny-random model (both erase and route paths).
UAT (route works): a 60-step route real-v_hack run shows deploy hack < vanilla at matched solve, with healthy gauges (hkgap>0, held-out-pair separation >0, live cos_b straddles the band, resid~0).
Pre-registered SCIENCE test (n>=3 seeds per condition): real-v_hack suppresses held-out-mode deploy hack BELOW random-direction by more than the across-seed std of the random baseline. Run for BOTH arms. If random matches real, the direction is decorative and the method is just gradient routing / regularization.

Run plan (simplest first)

Now (erase, already coded): erase real-v_hack vs erase random-direction vs erase placebo, teacher-off@30, refresh-N. Real-vs-random is the decisive control AND the simple arm. Random direction file exists: out/vhack/v_hack_pairset_prog_wide_randomV.safetensors.
Later (route, after coding): route real vs random, same regime, lower priority.

Queue + resume state

On main (probe/distill-cosine); the worktree /workspace/projected_grpo-pairroute is stale, git worktree remove it.
Queue is PAUSED. Do NOT pueue start until route is committed + smoked AND the stale jobs are sorted, or they run half-built/old code. Durable label copy: docs/spec/20260606_job_manifest.md.
Remove (superseded old-route2 semantics): 124, 126, 130, 133, 134, 135.
Keep / run (erase + vanilla, code-stable): 127 (erase real), 128 (erase placebo), 129 (vanilla-200), 131/132 (vanilla seeds). 125 is route+random — requeue under new route code.
Add: erase random-direction (the missing simple real-vs-random control).

Teacher facts (context)

Teacher pool out/pools/substrate = 74 generated rollouts, 100% hacked / 0% gt_pass (pure hack demos, NOT reference solutions), across all 4 modes. Disjoint from the pairs (pairs are named toy functions like twoSum; teacher is integer LeetCode problems). Mixed in at 0.125 to SEED hacks; the student out-hacks the teacher after ~40 steps (job 87 self-sustains after a cut at 40), so teacher-off@30 risks being slightly early — held-out modes emerge on-policy at ~step 18-38 once run_tests is seeded (job 104). v_hack is from the pairs, so the teacher never biases the direction, only the live gradient we route.

15 KiB Raw Blame History Unescape Escape