mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:48:43 +08:00

Files

T

wassname a83953131e spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics

Validation removed: running the weak detector over student rollouts at train
time is the no-cheat violation, and a live validation is complex/non-causal.
Causal proof stays downstream (deploy perf + real-vs-random). Train-time only
LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the
'does the threshold generalize to a second pair' test), live cos_b percentiles
vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1,
resid=cos(g_keep,vec).

Granularity decided = per-rollout: train.py already sums per-token gate grads
to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line
for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks
per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise-
robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an
extracted direction -- that's the novelty. Borrow their 'absorption' as the
mechanism justifying A->B generalization.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-06 02:23:58 +00:00

18 KiB

Raw Blame History

Pair-routing redesign — 2026-06-06

Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate everything from the hand-built persona pairs (no live detector, no teacher force-route), so "does the direction generalize to unlabeled hack B" is a clean test. Driving principle: the only labelled data anywhere is the pairs.

Naming

Preference convention, per pair:

cho = chosen = the honest/clean completion.
rej = rejected = the hack completion.
vec = hack-ward direction = rej - cho (so cos(g, vec) > 0 means hack-ward). (The current code calls this v_grad and orients it hack - clean; same thing.)

We can build vec from gradients OR activations of the pairs (fork below).

What we do NOW (route2)

# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ──
vec = unit(mean_p(∇logp(rej_p) - ∇logp(cho_p)))      # per module, SVD top-k, hack-ward

# ── live, every step ──
for rollout b:
    g_b   = per_rollout_δS_grad(b)                   # recovered from the gate hook
    cos_b = cos(g_b, vec)

# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts:
hack_anchor  = teacher_rows | hacked_E(students)     # ← live detector, noisy, leaks onto B
clean_anchor = ~hack_anchor
τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2

flagged = hack_anchor | (cos_b > τ)                  # ← FORCE-ROUTE by label, then gate
route(flagged) → δS_hack                             # quarantine, deleted at deploy
keep(~flagged)  → δS                                 # the deployed knob

Two things make vec non-load-bearing here:

hack_anchor | force-routes every teacher / detector-flagged rollout regardless of vec.
With a random vec, both anchor means → 0, so τ → 0 and cos_b > τ is a coin flip. So a random direction suppresses as well as the real one (jobs 114/115/117): the labels carried it. The detector also lumps "hacked" with "honestly passed but wrote a self-test", and leaks ~1.1% onto held-out B.

What we do INSTEAD (pair-routed)

Everything — direction AND the two band edges — comes from the pairs. No teacher force-route, no live detector over students. A BANDED cosine gate with an absorption zone (full mechanism + validation are in the "Routing mechanism" section below):

# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
for pair p:
    g_rej[p] = ∇logp(rej_p)        # hack completion   (forward+backward through δS)
    g_cho[p] = ∇logp(cho_p)        # clean completion
vec   = unit(mean_p(g_rej[p] - g_cho[p]))   # hack-ward (or SVD top-k of stacked diffs)
lower = mean_p cos(g_cho[p], vec)  # clean edge (low)  — below: not hack, keep
upper = mean_p cos(g_rej[p], vec)  # hack edge  (high) — above: hack, route

# ── live: absorption ramp, no labels ──
x          = cos(g_step, vec)
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
δS_hack.grad += route_frac       * g_step   # deleted at deploy
δS.grad      += (1 - route_frac) * g_step   # deployed knob; absorption in the middle

Teacher is kept only to SEED emergence and is cut at step 30 (teacher_off_step=30, see the journal-evidence note: student out-hacks teacher ~step 40, so 30 may be early; 40 is the safer value if emergence stalls). Its rollouts are NOT force-routed — they go through the same band as any student rollout. After the cut it is pure on-policy.

Now vs new — what changed

	now (route2)	new (pair-routed band)
direction `vec`	pair gradient diff	pair gradient (or activation) diff — same source
gate	single live-detector `τ`, hard cos>τ	BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp
force-route	yes (`hack_anchor \|`)	none — gate only
live detector over students	yes (noisy, leaks onto B)	none (validation uses known-A detector only)
teacher	mixed throughout, force-routed	seed only, cut@30-40, gated like any rollout
is `vec` load-bearing?	no (labels carry it)	yes — random `vec` closes the band (width->0)
held-out B suppressed iff	labels happen to cover it	`cos(g_B, vec)` lands above `lower`, i.e. B shares the direction

Fork to decide in the rebuild: gradients vs activations for `vec`

gradients (∇logp(rej) - ∇logp(cho)): keeps the "intervene in the gradient, not the reward" thesis; ablating it removes the hack-ward update. Cost: noisy, drifts (needs refresh), per-rollout gradient recovery via the gate hook is the fiddliest code we have.
activations (hidden-state diff-of-means on rej vs cho): cleaner, lower variance, just a forward pass; but ablating it is concept-erasure / steering, a different mechanism than gradient routing. There is a dormant act-vector variant to revive.
Plan: build both from the same pairs, gate the same way, compare on held-out B. Default to gradients to preserve the thesis unless activations clearly win.

Calibration risk to smoke-test first

The pair gradients are teacher-forced on authored completions; live rollouts are sampled (different length / entropy), so τ from the pairs may not sit between the live hack/clean rollout cosines. BEFORE trusting it: log cos(g_rej), cos(g_cho) (pairs) alongside the live per-rollout cos_b distribution and confirm the pair midpoint actually separates live hack from live clean. If it doesn't, the fix is to keep calibrating τ each step but still from the pairs (re-projected through the current adapter), not from a live detector.

What this buys

Airtight no-cheat by construction: B never touched by any detector, so no --gate-anchor-teacher-only patch needed.
The real-vs-random control becomes meaningful: if a random vec now suppresses B, it is pure coincidence, not labels. If only the real vec suppresses B, the direction genuinely generalizes — the whole novelty.
Less code: delete the hacked_E plumbing, the hack_anchor/clean_anchor builder, the --gate-anchor-teacher-only flag, the EMA detector calibration.

Implementation plan (src/vgrout/train.py) — actionable, post-compaction

Replace route2's gate in place (research code, break it; tag pre-routing-refactor is the rollback). Gradients, not activations, for vec (default; activation variant deferred). vec sign = hack-ward = rej - cho.

DELETE build_route2_anchors (~line 337) and its call site. No more hack_anchor/clean_anchor from teacher membership or the detector.
Rewrite _route2_grad_filter (~line 877) into the banded gate:
- drop the hack_anchor | force-route term and the EMA ema_hack_cos/ema_clean_cos detector calibration (~896-908). No hard cos_b > tau.
- x = cos(g_step, vec); route_frac = clamp((x - lower)/(upper - lower), 0, 1); δS_hack.grad += route_frac*g; δS.grad += (1-route_frac)*g. lower/upper come from the pairs (step 3), passed in.
- granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing token-sum -> g_b = cg/dS recovery, just swap the cos_b > tau line for the ramp. rollout_ablate_frac floor may stay as a belt-and-braces solve guarantee.
Pair-calibrated BAND, refreshed every vhack_refresh_every steps (reuse the existing v_grad refresh hook): when we (re)build vec from the pairs, also compute lower = mean_p cos(g_cho[p], vec), upper = mean_p cos(g_rej[p], vec), per module. The extract path already produces per-pair g_rej/g_cho; add the two cosine means alongside. Store route2_band[name] = (lower, upper), not anchors/tau.
Remove plumbing: --gate-anchor-teacher-only flag + teacher_only arg; the hack_E_flags feed into the GATE (drop -- no detector touches student rollouts at train time now; keep hack_E_flags only if still cheap for the streaming hk_* LOG columns). route2_random_v_seed stays (it's the directionality control).
Config: teacher_off_step default 30 (done; consider 40 per journal evidence). Teacher rollouts go through the same band (NOT force-routed).
Diagnostics to print (all label-free, see "Cheap, label-free diagnostics"): hkgap = upper - lower; LOO pair separation; live cos_b percentiles vs [lower,upper]; route_frac mean + mass-at-0 + mass-at-1; resid = cos(g_keep, vec).

Current state — resume after compaction

Working on main (probe/distill-cosine), NOT the worktree. Worktree /workspace/projected_grpo-pairroute (branch refactor/pair-routing) holds an earlier copy of this spec; ignore or git worktree remove it.
Queue is PAUSED (pueue pause). Job 127 (erase_realv) was running when paused. pueue start resumes. Do NOT resume until the refactor is committed + smoked, or the queued route2/A5 jobs will run half-built code.
Rollback tag: pre-routing-refactor. Job manifest: docs/spec/20260606_job_manifest.md.

Queued-job disposition (decide before `pueue start`)

Superseded by this refactor (old route2 semantics) -> remove + requeue under new code: 124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only), 130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad).
Still valid as-is (intervention=none / erase): 129 (vanilla-200 KL, A4), 131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running). Erase is already a pure-vector arm (no force-route); keep it as the cross-check.
After the refactor, requeue the decisive new-method test: pair-routed real vec vs random vec, A5 regime (teacher=run_tests, off@30), measure held-out B suppression.

Smoke + UAT

just smoke (route2 path) must pass on the tiny-random model after the rewrite.
scripts/verify_*.py gates stay green; verify_gate_anchor.py becomes moot (no anchor) -> update or delete it.
UAT (refactor works): a fast 60-step pair-routed real-vec run shows deploy hack < vanilla at matched solve, AND the label-free diagnostics are healthy (band width hkgap > 0, LOO separation > 0, live cos_b straddles [lower,upper], resid ~ 0).
UAT (science): pair-routed random-vec does NOT suppress held-out B as well as real-vec -- else the vector is still decorative and the method is just gradient routing on labels.

Validation logic -- what proves the method works (read this before the gate)

The pairs do ONE job: produce vec and the two band edges. They are never EVIDENCE the method works. Two separate checks below carry that weight.

Calibration is read, not validated -- cheap label-free gauges (live cos_b vs the band, hkgap, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate is alive and the band is calibrated, WITHOUT running any detector over students. No live detector validation (that would be the cheat).
Generalization test = deploy performance on held-out B (never labelled, never in the pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
Decisive control = real-vec vs random-vec. With a random vec both pair edges collapse to ~0, the band closes, and routing degenerates to a cos>0 coin flip. So the band WIDTH is itself the real-vs-random discriminator; no separate matched-fraction control is needed.

upper > lower is biased positive by construction (vec points along the mean rej-cho diff), so the ordering is not evidence. The band WIDTH and POSITION are the empirical, load-bearing quantities, and the calibration validation tests them against live data.

Routing mechanism — banded cosine gate with an absorption zone

The gate is a BAND, not a single threshold. Two edges, both measured from the pairs:

# ── refresh every N steps: vec + band edges from the pairs, through the current adapter ──
for pair p:
    g_rej[p], g_cho[p] = ∇logp(rej_p), ∇logp(cho_p)     # δS-space, per module
vec   = unit(mean_p(g_rej[p] - g_cho[p]))               # hack-ward
lower = mean_p cos(g_cho[p], vec)    # where genuinely-CLEAN gradients land (low)
upper = mean_p cos(g_rej[p], vec)    # where genuinely-HACK  gradients land (high)
hkgap = upper - lower                # band width = the load-bearing separation signal

# ── live: absorption ramp, pure gate, NO labels, NO force-route ──
x          = cos(g_step, vec)        # alignment of the live gradient with the hack dir
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
δS_hack.grad += route_frac       * g_step    # x>=upper -> 1: fully quarantined (deleted)
δS.grad      += (1 - route_frac) * g_step    # x<=lower -> 0: fully kept (deployed)
                                             # lower<x<upper: ABSORPTION, split between knobs

Three zones: below lower = not hack, keep; above upper = hack, route to the deletable δS_hack; between = absorption, the gradient splits proportionally. The ramp softens the solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of its signal in δS), so the hard exploration floor (rollout_ablate_frac) is no longer required, though it can stay as a belt-and-braces solve guarantee.

Granularity: per-rollout (`g_step` = per-rollout δS grad)

Decided, with paper backing -- not left open.

What we do now (train.py:881-896, _route2_grad_filter): the baukit gate c is per-TOKEN ([G*s, r], since nn.Linear sees a flattened batch). We SUM each rollout's token gate-grads -> [G, r], divide by δS to recover the per-rollout knob grad g_b, and take one cos_b per rollout. So the live unit is already PER-ROLLOUT. The recovery hook exists; the band just replaces the cos_b > tau line with the ramp.
Gradient Routing (Cloud et al. 2024): data-dependent stop-gradient masks at a few layers' activations (x = mask*act + (1-mask)*act.detach()). For LLMs the mask is per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal state). So the RL-native unit there is the trajectory.
SGTM (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain examples only update their dedicated params). Its contribution is robustness to LABEL NOISE, not a new granularity.

Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's ALIGNMENT with an extracted direction (cos(g_b, vec) in the band) -- no per-example label. That direction-gated routing is the novelty vs both; state it as such.

Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak label localizes the capability MORE generally, because the routed region participates in predictions on related non-routed data so the feature is not learned elsewhere. That is the mechanism that would let routing on known A suppress unknown B; it is the theoretical basis for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means proportional split; same word, different thing.)

Cheap, label-free diagnostics (validation dropped)

We are NOT running a live detector validation. Running the weak detector over the student's own rollouts during training is on the wrong side of the no-cheat line (README: that is exactly the cheat), and a live validation is complex and non-causal. The causal proof is downstream (deploy performance + real-vs-random). During training we only LOG cheap, label-free gauges (ml-debug: log everything, state the expected value and what a deviation means, chase anomalies):

SHOULD per refresh:  hkgap = upper - lower  > 0 and roughly stable.
  ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze
  a pre-routing vec snapshot.

SHOULD per refresh:  LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0
  (band built on the OTHER pairs still separates the held-out pair -- "does the threshold
  generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise.

SHOULD per step:  live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
  ELSE all below lower -> band routes nothing (miscalibrated low); all above upper ->
  routes everything (miscalibrated high). This is the calibration read with NO labels.

SHOULD per step:  route_frac mean in (0,1), with some mass at 0 and some at 1.
  ELSE all-0 or all-1 = degenerate gate.

SHOULD per step:  resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob).
  ELSE >0 = hack-ward grad leaking into δS (the real failure).

None of these touch held-out B or run a detector over students; they read the band, the pairs, and the live cosine geometry only.

Review findings (2026-06-06) -- decisions before implementing

Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md). The banded gate supersedes the single-midpoint tau deepseek reviewed. Its surviving points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band diagnostic above (no labels); vec degeneracy -> the hkgap collapse check. Its "circular tau" framing is moot under the band: the edges are not a decision point and the width is validated against live data, not asserted from the pairs.

Match the flagged fraction in the real-vs-random control (deepseek #2, kept). Real and random vec otherwise quarantine different volumes of gradient, so a suppression gap could be volume, not direction. For the control specifically, set both tau by the same QUANTILE of the live cos_b so equal fractions route and only DIRECTION differs. Suppression gap at matched fraction => direction is load-bearing.
Whole-rollout routing + exploration floor (Claude, corrected). Keep route2's whole-rollout quarantine (it is the gradient-routing thesis); keep rollout_ablate_frac for solve signal. Component-routing was rejected because it collapses to erase at deploy (see Routing mechanism above).
Degeneracy diagnostic (deepseek #3, kept). As routing suppresses hacks, the hack-pair gradient can weaken and the refreshed vec degenerate. Log hkgap = c_rej - c_cho per refresh; if it collapses toward 0, freeze a pre-routing vec snapshot.
Pre-register the science UAT (deepseek, kept; user-confirmed). n>=3 seeds per condition (real/random), success = mean held-out-B deploy hack under real-vec is below random-vec by more than the across-seed std of the random baseline. Qualitative "suppresses better" is not enough.

18 KiB Raw Blame History