mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:48:43 +08:00

Files

T

wassname dd3b5af3db spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-31 13:39:31 +00:00

20 KiB

Raw Blame History

Routing v2 — distinct-basis quarantine + per-sample persona mask

Date: 2026-05-31. Branch: probe/distill-cosine. Supersedes the additive direction-split route arm (proj.py route branch) for the generalisation goal.

Goal

Reduce reward-hack rate at deploy on held-out hack modes (modes the persona mask never saw) by routing the live GRPO gradient of flagged rollouts into a functionally-distinct, deletable quarantine adapter, and relying on absorption (Cloud 2410.04332, SGTM 2512.05648) to pull unflagged hacks into the same quarantine. No reward attached to routing; no oracle; mask is a self-supervised persona-contrast direction (~10 pairs).

User's standing hypothesis (2026-05-31): "it already works with grad (run 31 deploy-hack=0 on held-out modes, confounded), and it will work better with the SGTM insights (distinct basis + seeding + scale)."

Context: what we landed on (from the 2026-05-31 brainstorm)

The five load-bearing conclusions, each with its source:

Don't reward routing — impose it. Reward-driven routing invites the model to game its own flag (drive cosine->0, look clean): that is Baker et al.'s obfuscated reward hacking (search_for_more.md:40). Cloud's "mechanistic supervision avoids Goodharting" (paper_gradient_routing.md:422) works precisely because the mask never enters the behavioural objective. So: mask only, no confession reward. This also dodges the no-cheat boundary (a reading judge over student rollouts) that the confessions path would reintroduce.
The quarantine must be a distinct basis, not a shared coordinate. Our current h*(delta_S + delta_S_hack) (antipasto.py:83) makes dL/d delta_S == dL/d delta_S_hack identically — the two knobs get the same gradient by construction, so absorption cannot sort anything; we only ever made them differ by the manual split. Distinct directions are required for the two regions to receive different per-sample gradients.
The quarantine must stay in the forward for every sample (additive), not a hard MoE switch. A hard per-sample forward switch (delta_S*m + delta_S2*(1-m)) removes the quarantine from an unflagged hack's computation graph, so that hack trains the kept adapter and lands in the deployed model. Absorption needs the quarantine in-path for all samples so unflagged hacks can concentrate there. Earlier Cloud's detach (m*a + (1-m)*a.detach()) keeps the forward value identical and routes only the backward — that is the right primitive.
Seed hard, absorb soft (the hybrid SGTM actually uses). Flagged hacks: hard per-sample gradient route to quarantine only (detach the kept adapter for those samples). Unflagged hacks: no routing, but they concentrate in the quarantine by gradient-magnitude self-reinforcement iff they share structure with the seeded hacks — which is exactly the persona ("willingness to cheat") hypothesis.
Leakage is real, bounded, and shrinks with scale. SGTM names our worry "leakage" (§4.3) and measures it: 0.005-0.02 at 64M even with 40% undiscovered forget data, decreasing with model size (paper_sgtm.md:160,172,174). So the additive design is not clean either-or, but leakage is small. Cloud's toy shows an L1 sparsity penalty is sometimes needed to force concentration on unnatural splits (paper_gradient_routing.md:857,1115) — we treat L1 as an optional concentration aid, not core.

Open risks carried forward:

Starvation vs route-around: detaching the whole kept adapter on a flagged hack also discards any solve-signal in that rollout -> kept can starve, deploy degrades (Cloud's retain cost ∝ routed fraction). Soft/partial routing is the unexplored middle.
Ignition: a ~10-pair persona seed may be too weak to ignite absorption at our scale, where SGTM says small models leak more. This is the empirical crux.
run 31 confound: current additive-route deploy=0 is not citable until the random-V control (#157) separates "directional" from "delete-any-adapter reverts to base".

Scope

In: two new arms (grad-vector mask, act-vector mask) sharing one distinct-basis quarantine adapter; the per-sample plumbing; an L1 knob; a leakage metric; the all-arms per-mode deploy plot. Out: confession/reward routing (rejected, item 1); TDGA project-onto-clean (kept as a documented fallback only); learned/MoE gating (rejected, item 3).

Core pseudocode

Shared machinery — distinct-basis quarantine, additive forward, detach-route

# Per wrapped Linear. Frozen SVD of W: W = U @ diag(S) @ Vh.
#   x        : [b s d_in]        layer input
#   Vh       : [r d_in]          frozen right singular vecs (kept-adapter input basis)
#   U        : [d_out r]         frozen left  singular vecs (kept-adapter output basis)
#   delta_S  : [r]               KEPT knob (the deployed adapter), trainable
#   A_q,B_q  : [k_q d_in],[d_out k_q]   QUARANTINE LoRA, OWN basis, init ~0, deleted at deploy
#   m        : [b s]             per-sample/token hack flag in {0,1}, from the arm's mask
#
# Invariant: quarantine is ALWAYS summed into the forward (item 3) and has a
# DISTINCT basis from the kept adapter (item 2). For flagged samples (m=1) the
# kept adapter is detached so only the quarantine learns them (item 4, "seed hard").

def routed_forward(x, m):
    kept = U @ (delta_S * (Vh @ x))          # [b s d_out]  the deployed adapter
    # detach-route: m=1 -> forward value unchanged, but grad to delta_S is blocked
    kept = where(m[..., None], kept.detach(), kept)
    quar = B_q @ (A_q @ x)                    # [b s d_out]  distinct basis, always in path
    return base_linear(x) + kept + quar, quar # quar also returned for the L1 term

# loss += lambda_l1 * quar.abs().mean()       # OPTIONAL anti-duplication penalty (Cloud)
# deploy: zero A_q,B_q (or just drop quar) -> evaluate kept only.

Why this gives absorption (the mechanism, item 4): once flagged hacks have seeded A_q,B_q with cheat-features, an unflagged hack B (m=0, so both adapters get its gradient) has its largest gradient on the quarantine directions (they already do cheat-ish things), so B's update concentrates there — while a clean solve, also m=0, has tiny gradient on the quarantine and concentrates in delta_S. The mask seeds; gradient magnitude sorts the unflagged bucket. Rests on B sharing structure with the seed (the persona bet).

Arm B (act-vector) — PRIMARY, single pass

The mask is available at forward time (activations are per-sample for free), so one forward+backward suffices. This is the clean absorption-supporting arm.

# v_act : [d]  persona-contrast direction in activation space (mean-diff or top
#               singular vec of stacked hack-minus-clean activations at this site),
#               unit-norm, oriented cheat-ward. Built once from ~10 persona pairs.
# Computed in the SAME forward, before routing:
a_i  = Vh @ x                                 # [b s r]   per-sample activation (SVD coords)
cos  = cosine(a_i, v_act_in_r_space)          # [b s]     per-sample, no extra pass
m    = (cos > tau).float()                    # weak, noisy mask -- fine (SGTM tolerates noise)
y, quar = routed_forward(x, m)

Arm A (grad-vector) — single-pass gate SUBTRACTION (implemented)

The persona direction lives in delta_S (gradient) space, but delta_S.grad is summed over the batch, so the per-rollout signal is not free. Recover it with a PER-ROLLOUT gate (init 1, identity in the forward) and route POST-backward by subtraction — no forward detach, no second pass:

# v_grad : [r]  unit(mean(g_hack - g_clean)) per module, from extract_vhack_grad's
#                raw per-pair GRPO grads (gradient-space analogue of v_act). Oriented
#                hack-ward: training reinforces a hack with the same sign, so a
#                reinforced-hack rollout has cos(g_b, v_grad) > 0.
c = ones(b, 1, r, requires_grad=True)         # per-rollout gate (broadcast over tokens)
kept = U @ ((a * c) * delta_S)                 # identity at c=1: forward value unchanged
quar = B_q @ (A_q @ x)                          # always summed
... usual loss; loss.backward()
# c.grad = delta_S * g_b  (g_b = per-rollout delta_S grad). Divide it out, but only on
# axes where delta_S has moved (|delta_S|>eps); near-zero axes keep the full grad.
g_b   = where(|delta_S|>eps, c.grad / delta_S, 0)   # [b, r]
flag  = cosine(g_b, v_grad) > 0                      # [b]
delta_S.grad -= sum_{flagged b} g_b                  # exact identity on reliable axes
# quarantine keeps every rollout's grad (always-on forward) -> absorption intact.

Why subtraction beats the earlier A1/A2 framings: routing is post-backward WITHIN the same step, so the cross-step sample mismatch that made the "stale mask" (A1) awkward never arises, and we avoid the 2x backward of the exact two-pass (A2). The only approximation is the eps-guard: on an axis where delta_S is still ~0 the per-rollout split is undefined (0/0), so we route nothing there until delta_S grows past eps — a ~1-step-per-axis lag, the same order as A1's lag, and harmless because a ~0 axis carries no learned hack yet. Upgrade to A2 (probe pass -> mask -> detach-route pass) only if that lag measurably hurts.

Asymmetry still worth flagging in the writeup: Arm B's mask is a free forward-time activation cosine (no gate, no division); Arm A pays a per-rollout gate + an eps-guarded division. Arm B remains the cleaner bet.

Requirements

R1: Quarantine has a basis distinct from the kept adapter, and is summed (not switched) in the forward. VERIFY: with a flagged-only synthetic batch, after one step delta_S.grad == 0 (detached) AND A_q.grad != 0; with an unflagged batch both are nonzero. If a hard-switch crept in, the unflagged batch would leave A_q.grad == 0 — redesign until no.
R2: Deploy = kept only. VERIFY: zeroing A_q,B_q changes the eval forward; the saved adapter contains no quarantine tensors.
R3: Per-sample mask is correct. VERIFY (Arm B): on a batch of known hack/clean rollouts, m fires >2x more on hacks than clean. (Arm A): g_i summed over samples equals the batched delta_S.grad (gate identity check).
R4: Leakage metric implemented (SGTM-style): hack capability remaining in the kept adapter after deleting quarantine, vs an equivalent-exposure vanilla baseline. VERIFY: metric returns a number in [0,1]; data-filtering arm reads ~1.
R5: No reward, no oracle at train time. VERIFY: grep train path — routing reads only the persona mask; never gt_pass, never the full detector suite over student rollouts.

Tasks

T1 (R1,R2): add quarantine LoRA A_q,B_q per module + routed_forward with detach-route, distinct basis, deploy-zero. steps: extend antipasto.py adapter; thread intervention=route2 config. verify: just smoke walks routed_forward; R1 synthetic-batch grad check. success: flagged grad isolates to A_q; unflagged hits both. likely_fail: shared basis reused -> identical grads (R1 fails). sneaky_fail: detach also zeros forward value -> solve tanks; catch by asserting forward value unchanged vs non-detached.
T2 (R3): Arm B act-mask — v_act extraction from persona pairs + forward-time cosine. verify: R3 fire-ratio on known hack/clean. UAT: "mask fires on hacks".
T3 (R3): Arm A grad-mask — c_i gate per-sample grad, single-pass subtraction (NOT A1 stale / A2 two-pass; see Log 2026-05-31). gate identity sum_i g_i == delta_S.grad holds by construction. smoke routing2_grad passes (||B_q||=0.109).
T4 (R4): leakage metric + L1 knob (lambda_l1, default 0). DESIGN (2026-05-31, grounded in SGTM eq.1 = |D_forget_standard| / |D_forget_unlabeled|, 0=perfect localization, 1=all leaked, data-filtering reads 1 by construction):
- Operationalize as a per-mode RATIO, not SGTM's full calibration curve. SGTM maps ablated forget-loss onto an equivalent forget-token count via a curve of standard runs at varying forget exposure; we can't afford a curve of vanilla checkpoints, and matched-step vanilla is the natural single-point denominator: leakage_mode = clamp( hack_deploy(route2, quarantine deleted) / hack_deploy(vanilla, matched steps/substrate), 0, 1 ) Held-out modes are the load-bearing cut: route2 should read <<1 (localized the UNKNOWN hack into the deleted quarantine), data-filtering reads ~1 (can't filter what its weak detector never flagged). Report per mode, split known/held-out, headline = mean held-out leakage.
- Home = post-hoc cross-run analysis (reads two run artifacts), NOT inside train.py (one run lacks the other arm's numbers). Likely a scripts/leakage.py or a block in the plot path.
- Groundwork needed first (both pure additions, smoke before use): (a) write per-mode deploy eval as STRUCTURED json artifact per route/route2 run (currently only logger.info at train.py:1861-1867; nothing on disk). (b) make VANILLA emit the same per-mode eval on the fixed eval_idxs (currently the deploy-eval block is route/route2-only; vanilla hack_deploy=nan), so the denominator exists. Adds one eval pass to vanilla runs (~minor).
- UNVALIDATABLE until route2 runs (jobs 34/35) produce deploy numbers; vanilla denominator = run 30. Implement + smoke once those land, not before.
- L1 knob DEFERRED with reason: adding an L1 penalty on B_q/delta_S is a SECOND training objective (balancing problem, violates one-driving-principle). It's default 0 = off, an optional ignition aid we may never need. Hold until the leakage number shows the quarantine fails to ignite without it. Not blocking the headline proof.
T5: 5-arm sweep at matched seed/steps: vanilla, erase, route-additive(old), route2-grad, route2-act. Plus random-V control (#157) on the old route.
T6: all-arms per-mode deploy overlay plot (just plot).

Proofs to show at the end (user-observable)

All-arms per-mode deploy table + overlay plot — rows: vanilla / erase / route-additive / route2-grad / route2-act; columns: train-hack, deploy-hack, deploy-solve, split by mode (run_tests in-dist + file_marker/sentinel/stdout held-out). The headline proof: route2 drops deploy-hack on held-out modes at matched deploy-solve, beating erase (≈vanilla) and beating or matching the confounded additive route. Link: out/runs/.../all_arms_per_mode.png.
random-V control (#157) — route-additive with a Haar-random V. If random reproduces deploy-hack≈0+solve-jump, the additive result was an ablation artifact (and route2's distinct-basis win is the real signal). If random stays hacky, the direction mattered all along. Either way it disambiguates run 31.
Leakage number — SGTM-style: fraction of held-out-mode hack capability remaining in the kept adapter after quarantine deletion. Lower than erase's (which leaves ~everything) is the absorption proof.
cos-over-time for route2 — does the persona mask stay discriminative (cin_t>cin_s) longer than the additive arm's ~9 steps, and does deploy-hack track it.
Ablations that must hold: shared-basis route2 (kill R1) must lose the held-out drop (confirms distinct basis is load-bearing); L1=0 vs L1>0 shows whether ignition needs the sparsity aid.

Context (terms / invariants)

No-cheat invariant: training never gated on gt_pass or the full detector suite over student rollouts; persona pairs select the mask, that's the allowed weak detector. Deploy = quarantine deleted.
absorption: unflagged members of a concept concentrate in the seeded dedicated region by gradient magnitude (Cloud/SGTM). Requires distinct basis + in-forward.
leakage: hack capability that ends up in the kept (retained) adapter despite routing. SGTM §4.3.

Log

2026-05-31 (later): execution pass. (1) Random-V control (#157) CANCELLED: it is structurally rf0-only -- refresh re-extracts the real v_hack and would overwrite the random direction, so it only controls the abandoned frozen regime. (2) Frozen-real route (job 32, rf0) drops deploy hack only ~8pp (0.375->0.297) vs run-31 rf5's ~0; cin_t decays 0.32->0.13 to meet cin_s by step 7 (staleness). Refresh is the load-bearing knob (journal entry j). (3) BUG: route2 refresh was a silent no-op -- vhack_refresh_every gated on v_hack, which route2 sets None; v_act/v_grad stayed frozen. Fixed (2556919): route2 branch re-extracts v_act/v_grad every N with the quarantine ablated. (4) BUG: route2 quar/v_act crashed on the bf16 real model (A_q/B_q/v_act fp32 vs bf16 x); smoke is fp32 (CPU) so the path never fired. Fixed (80f6b52) with .to(x.dtype), bf16 fwd+bwd validated. (5) vanilla (job30) + erase (job29) already DONE on the substrate -- overlay arms exist; erase=vanilla (0.274 vs 0.276). (6) substrate pool + prog_wide pairs now FastConfig defaults (1086c98); real run needs only --intervention. T4 split: per-mode deploy JSON artifact (#164, unblocked groundwork) vs leakage metric (#161, blocked on route2 deploy numbers).
2026-05-31: spec created from brainstorm. Run 31 (additive route) gave deploy-hack=0 on all modes incl held-out + solve-jump 0.375->0.562; confounded (gpt-5.5 review + journal entry). Random-V (#157) is the gate.
2026-05-31: cos-over-time on run 31 — teacher 0.29->0.08, student flat ~0.10-0.12, cross ~step 9. Persona mask is a stable ~10% handle on the student gradient; the teacher signal decays (likely the student absorbing the teacher's hacks).
2026-05-31: T1+T2 implemented and smoke-passed. R1 grad check (synthetic): flagged -> delta_S.grad=0, A_q/B_q.grad>0; unflagged -> both>0; forward value unchanged by detach. smoke-route2: v_act extracted (14 modules), ||B_q||=0.109 after 30 steps, deploy eval + asserts pass. BUG found+fixed: the per-prompt model.zero_grad(set_to_none=True) (there to isolate delta_S's per-source grad) wiped A_q/B_q grads before opt.step; now stashed and re-injected like delta_S.
2026-05-31: defaults — vhack_refresh_every 0->5 (0 is ablation-only); route2 reuses run-substrate (v-hack-path ignored, vhack-pairs drives v_act, tau/rank defaulted) so the sweep needs no extra args.
2026-05-31: T3 (Arm A grad-mask) implemented + smoke-passed. Removed route2_tau (never tuned; mask is cos>0, the natural hack-ward boundary). v_hack path now auto-derives from --vhack-pairs-path (pass the pairset, the hack auto-loads). Arm A design CHANGED from the spec's A1/A2: single-pass gate-SUBTRACTION instead of stale-mask or two-pass. The per-rollout gate c (init 1, identity forward) gives c.grad = delta_S * g_b after backward; train.py divides out delta_S (eps-guard on |delta_S|>1e-6) to get per-rollout g_b, flags cos(g_b, v_grad)>0, and subtracts flagged rollouts from delta_S.grad. No forward detach, no second pass; quarantine still learns flagged rollouts via its always-on path. The cross-step sample- mismatch that made A1 awkward never arises because routing is post-backward within the same step. Lag bound: routing on a fresh axis lags ~1 step until |delta_S| grows past eps there (this is the A1-equivalent one-step lag, per-axis). Upgrade to A2 (two-pass detach) only if the lag hurts. v_grad = unit(mean(g_hack-g_clean)) from extract_v_hack raw grads (gradient-space analogue of v_act). smoke routing2_grad: ||B_q||=0.109 after 30 steps (quarantine seeded by routed grad), deploy eval + asserts pass, exit 0.
2026-05-31: external code review (deepseek-v4-pro, docs/spec/20260531_route2_code_review_v2.md) verified gate identity (c.grad=delta_Sg_b), divide-out, eps-guard, Arm B detach-route, and R5 NO-CHEAT (mask never reads gt_pass / detector suite) all CORRECT. One finding: Arm A flagged per-TOKEN, not per-rollout — the hook's gate is [Gs, r] (nn.Linear flattens the batch), so cos(g_b, v_grad) and the flag were per-token. A clean rollout scatters ~50% of its tokens over cos>0 by noise, spuriously routing half its gradient mass. FIXED: _route2_grad_filter now reshapes c.grad to [G, s, r], sums each rollout's tokens BEFORE the cosine (denoises the sign), flags per-rollout [G], matching the preregistered unit (GRPO advantage is per-rollout). Re-smoked.

TODO (out of scope now)

Soft/partial routing (route a fraction of a flagged sample's gradient) to trade starvation against route-around.
TDGA project-onto-clean as a no-signal fallback if the persona mask never ignites.

20 KiB Raw Blame History