Files
evil_MoE/docs/spec/20260611_act_gate_spec.md
T
wassname 1d4f33ffb6 diag: super-S-space gate score null; spec -> act_dot + winsorized-Otsu plan
superS (pooled writer/reader eigenbasis, whitened + top-r) tops out at
min-window AUROC 0.721 = raw resid dot; best unwhitened rotation+top-64
0.740 < act 0.747 (max of ~50-variant grid). act t-stat extraction also
null (0.719 vs 0.749 min). Spec updated: act_dot default, journal-(d)
evidence table, implementation plan for routeA.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:42:44 +00:00

10 KiB

Spec: routeA -- score activations, route gradients (2026-06-11)

New arm id routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV.

Evidence motivating the change

RESEARCH_JOURNAL.md 2026-06-11 (b) correction, (c), and (d) Table 1. On the A>0 contrast (reward-hack vs non-reward-hack among positively-reinforced rollouts, where the advantage baseline is near chance), across three independent emergence windows:

score v3 v4 v5 mean min
act_cos 0.869 0.749 0.752 0.790 0.749
act_dot 0.870 0.747 0.747 0.788 0.747
resid_dot 0.905 0.721 0.756 0.794 0.721
resid_cos 0.916 0.700 0.804 0.807 0.700
grad_cos (current gate space) 0.838 0.617 0.455 0.636 0.455
grad_dot 0.809 0.607 0.455 0.623 0.455

The gradient score does not replicate; the activation and residual scores do, and form one statistical cluster (per-window SE ~ 0.07). The activation score needs no advantage reconstruction, no gate backward pass, and no hooks outside the adapter modules, and it has the best worst-window AUROC, so it is the default.

Variants tested and found null on these windows (all within ~0.03 of baseline min, inside one SE; logs /tmp/claude-1000/superS_v1.log, act_dot_tstat.log, pinning_followup.log):

  • t-stat extraction (v = mean(D)/se(D) per coordinate over the 8 behavior_ pairs, clamped to |t| <= 3): act 0.867/0.756/0.719 vs mean-extraction 0.869/0.749/0.752. With 8 pairs the per-coordinate std has ~25 percent relative error; revisit when the authored-pair set grows. Implemented behind vact_tstat (default off).
  • Super-S-space (project the residual stream onto the pooled eigenbasis of the residual writers/readers, optionally whitened by the pooled spectrum, per wassname/steering-lite super_sspace; scripts/diag_pinning_superS.py): whitened variants top out at min-window 0.721 (= raw resid dot); the best unwhitened rotation+top-64 variant (reader basis) reaches 0.740, above raw resid cos (0.700) but below act (0.747-0.749), and it is the maximum of a ~50-variant grid, so post-hoc selection inflation applies. No variant beats the act default.
  • SNR module weighting / top-quartile module pruning for act: moves AUROC <= 0.02.
  • Per-layer resid: L24 alone >= the 3-layer concat on every window (0.925/0.712/0.814).

What changes

  1. Extraction (extract_vhack_act or a mode of the existing extractor): forward each authored pair side once, capture the deployed bottleneck activation h = A[:r] @ x per wrapped module, mean over completion tokens. v_act[name] = unitnorm(mean over pairs of (act_hack - act_clean)). No NLL, no backward, no c-probe. Reference implementation: scripts/diag_pinning.py (ActTap, _v_from).
  2. Gate: capture the same pooled acts during the existing no-grad logpi_old forward (one hook; that forward already runs every step). Per-rollout score = concatenated- module dot(act, v_act) with v unit-norm per module. routeV's pass-1 forward+backward is deleted; routeA costs approximately the vanilla arm.
  3. Score is dot. Empirically act_dot == act_cos to ~0.002 on all three windows, so the choice is structural, not empirical: dot is a plain linear readout (one einsum, no per-rollout norm), and cos's per-rollout normalization makes the score sensitive to the norm of the pooled activation (a length/style factor) in the denominator. Boundedness no longer matters for threshold portability because thresholds are set on online-standardized scores (point 4), not raw values. Log cos as a diagnostic.
  4. Labels/pinning: label-free, from online statistics of the live scores only (journal entry 2026-06-11 (d); reference implementation _otsu3 in scripts/diag_pinning.py). The mean+k*sd rule was wrong because the live scores are a MIXTURE (hack share 35-43% in the emergence windows), so mean+2/3sd lands beyond every distribution and routes nothing. The replacement, validated on all three windows: (a) maintain a rolling buffer of recent per-rollout scores (e.g. last 256; one 24-rollout batch is too few); (b) standardize by the buffer mean/std (display and logging comparability; the partition itself is affine-invariant); (c) winsorize at the 1/99 percentiles, then a two-threshold Otsu search (maximize three-class between-class variance) gives (t_lo, t_hi) -> keep | absorb | rout. Winsorizing is load-bearing: without it a single outlier buys a zone (v4 grad_dot keep zone was one point; v5 act rout precision was 0.00 before, 0.50 after). Bimodality guard -- only open the rout zone when the between-class variance gain over a single threshold is material (else an all-clean early batch gets split and clean updates are routed); exact criterion to be fixed at implementation. The residual gap to a perfect router is the SCORE's AUROC, not the threshold: even the oracle single cut only reaches P ~ 0.5-0.65 at useful recall on v4/v5, so pure high-precision routing needs a better score or more authored pairs, not better pinning.
  5. Routing payload unchanged: three-way masks (m, d) on block outputs, quarantine ablation at deployment, refresh of v_act every vhack_refresh_every steps (now forward-only, so cheap), placebo arm = Haar-random v_act with identical machinery.
  6. Pairs: #all-in-one/behavior_ (8 pairs), unchanged -- it was the best or tied vector on every window.

What this removes

  • The gate's second forward+backward per step (routeV's main cost).
  • The advantage-reconstruction dependence of the gate score.
  • The c-probe (grad_probe) machinery from the training path (keep it available for diagnostics; scripts/diag_pinning.py still uses it).
  • G_hack / gradient-vector extraction from the training path.

Open questions (do not block implementation)

  • Bottleneck vs residual stream: RESOLVED ENOUGH (pueue #21-23, journal (d) Table 1). resid and act are within ~1 SE of each other on every window; the random r=32 bottleneck is NOT the limiter. Default the gate to the bottleneck act (act_dot): it has the best worst-window AUROC (0.747 vs resid_cos 0.700), lives entirely in gate space (no hooks on decoder layers), and needs no layer-depth choice. resid (L24 alone, not the concat) stays the diagnostic alternative; it is adapter-independent and closest to a plain steering-vector probe, one hook-point change away.
  • Texture-shortcut risk: the act score may detect exploit-token surface text. The held-out hack-mode test (train v_act on pairs covering some modes, measure routing of an absent mode) distinguishes generalization from memorized texture.
  • Threshold rule v2 (pre-emergence frozen stats vs per-batch online).

Implementation plan

Ordered; each step is one commit with its verify gate. Not started until the user approves the plan.

  1. Extraction (src/vgrout/extract_vhack_act.py): extract_v_act(model, wrappers, names, pairs, tok, device, tstat=False) -> dict[name, Tensor[r]]. For each pair side one no-grad forward of prompt+completion; capture the deployed bottleneck per wrapped module with the ActTap hook pattern from scripts/diag_pinning.py (F.linear(x, A[:r]) in a forward hook, [B, L, r], mean over completion tokens). v[name] = unitnorm(mean over pairs of (h_hack - h_clean)); tstat=True divides the mean by the standard error over pairs and clamps |t| <= 3 before the unitnorm. No NLL, no backward, no c-probe. Verify gate: scripts/verify_v_act.py checks the extractor reproduces the cached pair features in out/diag/pinning_feats.pt (pair_feats[("act", side)]) on the v3 checkpoint to fp tolerance.
  2. Gate in train.py: routeA pins masks BEFORE the grad-carrying forward, like none/absorb, so the routeV second backward disappears and routeA costs ~ the vanilla arm. During the existing no-grad logpi_old forward (train.py ~line 877), the same ActTap hooks capture per-rollout pooled acts (pad-mask mean over completion tokens, batched). Score per rollout = sum over modules of <h_m, v_m> (act_dot; one einsum on the [B, M, r] stack). Log cos alongside.
  3. Pinning: rolling deque of the last 256 live scores. Per step: append the batch, standardize by buffer mean/std (logged z), winsorize at 1/99 percentiles, _otsu3 (lift from scripts/diag_pinning.py into the train module or a shared util) -> (t_lo, t_hi) -> keep (0,0) | absorb (1,0) | rout (1,1). Warmup: gate pinned absorb until the buffer holds >= 128 scores. Bimodality guard (proposal, to confirm at implementation): open the rout zone only when the high-class separation (mean of rout class minus mean of keep class) exceeds 1 buffer sd; otherwise collapse rout into absorb for that step.
  4. Arm wiring: intervention="routeA" (rename-on-logic-change; routeV results stay comparable only to routeV). routeV is REMOVED from train.py in the same commit (the c-probe/grad-gate machinery stays in scripts/ for diagnostics); grad_probe=True is then never set in training. Placebo flag routeA_random_v_seed = Haar-random unit v_act per module, identical machinery. Refresh: reuse vhack_refresh_every (forward-only now, so cheap).
  5. Logging (per step): gate AUROC on the A>0 contrast vs hack labels (diagnostic only; labels never feed routing), zone shares keep/absorb/rout, buffer mean/sd, (t_lo, t_hi) in z units, qmass. SHOULD lines per token-efficient-logging.
  6. Runs: just smoke with routeA on the tiny model, then the fast preset pair: routeA real + routeA placebo + vanilla, seed 43, with the journal-(d) windows as the comparison reference.

UAT

  1. just smoke passes with the routeA pathway (act capture + gate labels on tiny model).
  2. A routeA fast run logs per-step gate AUROC (A>0 contrast, students+cached teachers)

    = 0.7 around first_hack, and deploy hack rate < vanilla at matched solve rate.

  3. The Haar-placebo routeA arm does NOT match real v_act (directionality, not shrinkage).