Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
18 KiB
Pair-routing redesign — 2026-06-06
Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate everything from the hand-built persona pairs (no live detector, no teacher force-route), so "does the direction generalize to unlabeled hack B" is a clean test. Driving principle: the only labelled data anywhere is the pairs.
Naming
Preference convention, per pair:
cho= chosen = the honest/clean completion.rej= rejected = the hack completion.vec= hack-ward direction =rej - cho(socos(g, vec) > 0means hack-ward). (The current code calls thisv_gradand orients ithack - clean; same thing.)
We can build vec from gradients OR activations of the pairs (fork below).
What we do NOW (route2)
# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ──
vec = unit(mean_p(∇logp(rej_p) - ∇logp(cho_p))) # per module, SVD top-k, hack-ward
# ── live, every step ──
for rollout b:
g_b = per_rollout_δS_grad(b) # recovered from the gate hook
cos_b = cos(g_b, vec)
# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts:
hack_anchor = teacher_rows | hacked_E(students) # ← live detector, noisy, leaks onto B
clean_anchor = ~hack_anchor
τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2
flagged = hack_anchor | (cos_b > τ) # ← FORCE-ROUTE by label, then gate
route(flagged) → δS_hack # quarantine, deleted at deploy
keep(~flagged) → δS # the deployed knob
Two things make vec non-load-bearing here:
hack_anchor |force-routes every teacher / detector-flagged rollout regardless ofvec.- With a random
vec, both anchor means → 0, soτ → 0andcos_b > τis a coin flip. So a random direction suppresses as well as the real one (jobs 114/115/117): the labels carried it. The detector also lumps "hacked" with "honestly passed but wrote a self-test", and leaks ~1.1% onto held-out B.
What we do INSTEAD (pair-routed)
Everything — direction AND the two band edges — comes from the pairs. No teacher force-route, no live detector over students. A BANDED cosine gate with an absorption zone (full mechanism + validation are in the "Routing mechanism" section below):
# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
for pair p:
g_rej[p] = ∇logp(rej_p) # hack completion (forward+backward through δS)
g_cho[p] = ∇logp(cho_p) # clean completion
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward (or SVD top-k of stacked diffs)
lower = mean_p cos(g_cho[p], vec) # clean edge (low) — below: not hack, keep
upper = mean_p cos(g_rej[p], vec) # hack edge (high) — above: hack, route
# ── live: absorption ramp, no labels ──
x = cos(g_step, vec)
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
δS_hack.grad += route_frac * g_step # deleted at deploy
δS.grad += (1 - route_frac) * g_step # deployed knob; absorption in the middle
Teacher is kept only to SEED emergence and is cut at step 30 (teacher_off_step=30,
see the journal-evidence note: student out-hacks teacher ~step 40, so 30 may be early;
40 is the safer value if emergence stalls). Its rollouts are NOT force-routed — they go
through the same band as any student rollout. After the cut it is pure on-policy.
Now vs new — what changed
| now (route2) | new (pair-routed band) | |
|---|---|---|
direction vec |
pair gradient diff | pair gradient (or activation) diff — same source |
| gate | single live-detector τ, hard cos>τ |
BAND [lower,upper] from pair clean/hack cosines, absorption ramp |
| force-route | yes (hack_anchor |) |
none — gate only |
| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
is vec load-bearing? |
no (labels carry it) | yes — random vec closes the band (width->0) |
| held-out B suppressed iff | labels happen to cover it | cos(g_B, vec) lands above lower, i.e. B shares the direction |
Fork to decide in the rebuild: gradients vs activations for vec
- gradients (
∇logp(rej) - ∇logp(cho)): keeps the "intervene in the gradient, not the reward" thesis; ablating it removes the hack-ward update. Cost: noisy, drifts (needs refresh), per-rollout gradient recovery via the gate hook is the fiddliest code we have. - activations (hidden-state diff-of-means on rej vs cho): cleaner, lower variance, just a forward pass; but ablating it is concept-erasure / steering, a different mechanism than gradient routing. There is a dormant act-vector variant to revive.
- Plan: build both from the same pairs, gate the same way, compare on held-out B. Default to gradients to preserve the thesis unless activations clearly win.
Calibration risk to smoke-test first
The pair gradients are teacher-forced on authored completions; live rollouts are
sampled (different length / entropy), so τ from the pairs may not sit between the
live hack/clean rollout cosines. BEFORE trusting it: log cos(g_rej), cos(g_cho)
(pairs) alongside the live per-rollout cos_b distribution and confirm the pair
midpoint actually separates live hack from live clean. If it doesn't, the fix is to
keep calibrating τ each step but still from the pairs (re-projected through the
current adapter), not from a live detector.
What this buys
- Airtight no-cheat by construction: B never touched by any detector, so no
--gate-anchor-teacher-onlypatch needed. - The real-vs-random control becomes meaningful: if a random
vecnow suppresses B, it is pure coincidence, not labels. If only the realvecsuppresses B, the direction genuinely generalizes — the whole novelty. - Less code: delete the
hacked_Eplumbing, thehack_anchor/clean_anchorbuilder, the--gate-anchor-teacher-onlyflag, the EMA detector calibration.
Implementation plan (src/vgrout/train.py) — actionable, post-compaction
Replace route2's gate in place (research code, break it; tag pre-routing-refactor
is the rollback). Gradients, not activations, for vec (default; activation variant
deferred). vec sign = hack-ward = rej - cho.
- DELETE
build_route2_anchors(~line 337) and its call site. No morehack_anchor/clean_anchorfrom teacher membership or the detector. - Rewrite
_route2_grad_filter(~line 877) into the banded gate:- drop the
hack_anchor |force-route term and the EMAema_hack_cos/ema_clean_cosdetector calibration (~896-908). No hardcos_b > tau. x = cos(g_step, vec);route_frac = clamp((x - lower)/(upper - lower), 0, 1);δS_hack.grad += route_frac*g;δS.grad += (1-route_frac)*g.lower/uppercome from the pairs (step 3), passed in.- granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing
token-sum ->
g_b = cg/dSrecovery, just swap thecos_b > tauline for the ramp.rollout_ablate_fracfloor may stay as a belt-and-braces solve guarantee.
- drop the
- Pair-calibrated BAND, refreshed every
vhack_refresh_everysteps (reuse the existing v_grad refresh hook): when we (re)buildvecfrom the pairs, also computelower = mean_p cos(g_cho[p], vec),upper = mean_p cos(g_rej[p], vec), per module. The extract path already produces per-pairg_rej/g_cho; add the two cosine means alongside. Storeroute2_band[name] = (lower, upper), not anchors/tau. - Remove plumbing:
--gate-anchor-teacher-onlyflag +teacher_onlyarg; thehack_E_flagsfeed into the GATE (drop -- no detector touches student rollouts at train time now; keephack_E_flagsonly if still cheap for the streaming hk_* LOG columns).route2_random_v_seedstays (it's the directionality control). - Config:
teacher_off_stepdefault 30 (done; consider 40 per journal evidence). Teacher rollouts go through the same band (NOT force-routed). - Diagnostics to print (all label-free, see "Cheap, label-free diagnostics"):
hkgap = upper - lower; LOO pair separation; livecos_bpercentiles vs[lower,upper];route_fracmean + mass-at-0 + mass-at-1;resid = cos(g_keep, vec).
Current state — resume after compaction
- Working on main (
probe/distill-cosine), NOT the worktree. Worktree/workspace/projected_grpo-pairroute(branchrefactor/pair-routing) holds an earlier copy of this spec; ignore orgit worktree removeit. - Queue is PAUSED (
pueue pause). Job 127 (erase_realv) was running when paused.pueue startresumes. Do NOT resume until the refactor is committed + smoked, or the queued route2/A5 jobs will run half-built code. - Rollback tag:
pre-routing-refactor. Job manifest:docs/spec/20260606_job_manifest.md.
Queued-job disposition (decide before pueue start)
- Superseded by this refactor (old route2 semantics) -> remove + requeue under new code: 124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only), 130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad).
- Still valid as-is (intervention=none / erase): 129 (vanilla-200 KL, A4), 131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running). Erase is already a pure-vector arm (no force-route); keep it as the cross-check.
- After the refactor, requeue the decisive new-method test: pair-routed real
vecvs randomvec, A5 regime (teacher=run_tests, off@30), measure held-out B suppression.
Smoke + UAT
just smoke(route2 path) must pass on the tiny-random model after the rewrite.scripts/verify_*.pygates stay green;verify_gate_anchor.pybecomes moot (no anchor) -> update or delete it.- UAT (refactor works): a fast 60-step pair-routed real-
vecrun shows deploy hack < vanilla at matched solve, AND the label-free diagnostics are healthy (band widthhkgap > 0, LOO separation > 0, livecos_bstraddles[lower,upper],resid ~ 0). - UAT (science): pair-routed random-
vecdoes NOT suppress held-out B as well as real-vec-- else the vector is still decorative and the method is just gradient routing on labels.
Validation logic -- what proves the method works (read this before the gate)
The pairs do ONE job: produce vec and the two band edges. They are never EVIDENCE the
method works. Two separate checks below carry that weight.
- Calibration is read, not validated -- cheap label-free gauges (live
cos_bvs the band,hkgap, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate is alive and the band is calibrated, WITHOUT running any detector over students. No live detector validation (that would be the cheat). - Generalization test = deploy performance on held-out B (never labelled, never in the pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
- Decisive control = real-
vecvs random-vec. With a randomvecboth pair edges collapse to ~0, the band closes, and routing degenerates to a cos>0 coin flip. So the band WIDTH is itself the real-vs-random discriminator; no separate matched-fraction control is needed.
upper > lower is biased positive by construction (vec points along the mean rej-cho
diff), so the ordering is not evidence. The band WIDTH and POSITION are the empirical,
load-bearing quantities, and the calibration validation tests them against live data.
Routing mechanism — banded cosine gate with an absorption zone
The gate is a BAND, not a single threshold. Two edges, both measured from the pairs:
# ── refresh every N steps: vec + band edges from the pairs, through the current adapter ──
for pair p:
g_rej[p], g_cho[p] = ∇logp(rej_p), ∇logp(cho_p) # δS-space, per module
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward
lower = mean_p cos(g_cho[p], vec) # where genuinely-CLEAN gradients land (low)
upper = mean_p cos(g_rej[p], vec) # where genuinely-HACK gradients land (high)
hkgap = upper - lower # band width = the load-bearing separation signal
# ── live: absorption ramp, pure gate, NO labels, NO force-route ──
x = cos(g_step, vec) # alignment of the live gradient with the hack dir
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
δS_hack.grad += route_frac * g_step # x>=upper -> 1: fully quarantined (deleted)
δS.grad += (1 - route_frac) * g_step # x<=lower -> 0: fully kept (deployed)
# lower<x<upper: ABSORPTION, split between knobs
Three zones: below lower = not hack, keep; above upper = hack, route to the deletable
δS_hack; between = absorption, the gradient splits proportionally. The ramp softens the
solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of its signal
in δS), so the hard exploration floor (rollout_ablate_frac) is no longer required,
though it can stay as a belt-and-braces solve guarantee.
Granularity: per-rollout (g_step = per-rollout δS grad)
Decided, with paper backing -- not left open.
- What we do now (train.py:881-896,
_route2_grad_filter): the baukit gatecis per-TOKEN ([G*s, r], since nn.Linear sees a flattened batch). We SUM each rollout's token gate-grads ->[G, r], divide byδSto recover the per-rollout knob gradg_b, and take onecos_bper rollout. So the live unit is already PER-ROLLOUT. The recovery hook exists; the band just replaces thecos_b > tauline with the ramp. - Gradient Routing (Cloud et al. 2024): data-dependent stop-gradient masks at a few
layers' activations (
x = mask*act + (1-mask)*act.detach()). For LLMs the mask is per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal state). So the RL-native unit there is the trajectory. - SGTM (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain examples only update their dedicated params). Its contribution is robustness to LABEL NOISE, not a new granularity.
Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's
per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers
route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's
ALIGNMENT with an extracted direction (cos(g_b, vec) in the band) -- no per-example label.
That direction-gated routing is the novelty vs both; state it as such.
Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak label localizes the capability MORE generally, because the routed region participates in predictions on related non-routed data so the feature is not learned elsewhere. That is the mechanism that would let routing on known A suppress unknown B; it is the theoretical basis for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means proportional split; same word, different thing.)
Cheap, label-free diagnostics (validation dropped)
We are NOT running a live detector validation. Running the weak detector over the student's own rollouts during training is on the wrong side of the no-cheat line (README: that is exactly the cheat), and a live validation is complex and non-causal. The causal proof is downstream (deploy performance + real-vs-random). During training we only LOG cheap, label-free gauges (ml-debug: log everything, state the expected value and what a deviation means, chase anomalies):
SHOULD per refresh: hkgap = upper - lower > 0 and roughly stable.
ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze
a pre-routing vec snapshot.
SHOULD per refresh: LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0
(band built on the OTHER pairs still separates the held-out pair -- "does the threshold
generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise.
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
ELSE all below lower -> band routes nothing (miscalibrated low); all above upper ->
routes everything (miscalibrated high). This is the calibration read with NO labels.
SHOULD per step: route_frac mean in (0,1), with some mass at 0 and some at 1.
ELSE all-0 or all-1 = degenerate gate.
SHOULD per step: resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob).
ELSE >0 = hack-ward grad leaking into δS (the real failure).
None of these touch held-out B or run a detector over students; they read the band, the pairs, and the live cosine geometry only.
Review findings (2026-06-06) -- decisions before implementing
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
The banded gate supersedes the single-midpoint tau deepseek reviewed. Its surviving
points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band
diagnostic above (no labels); vec degeneracy -> the hkgap collapse check.
Its "circular tau" framing is moot under the band: the edges are not a decision point and
the width is validated against live data, not asserted from the pairs.
-
Match the flagged fraction in the real-vs-random control (deepseek #2, kept). Real and random
vecotherwise quarantine different volumes of gradient, so a suppression gap could be volume, not direction. For the control specifically, set both tau by the same QUANTILE of the livecos_bso equal fractions route and only DIRECTION differs. Suppression gap at matched fraction => direction is load-bearing. -
Whole-rollout routing + exploration floor (Claude, corrected). Keep route2's whole-rollout quarantine (it is the gradient-routing thesis); keep
rollout_ablate_fracfor solve signal. Component-routing was rejected because it collapses toeraseat deploy (see Routing mechanism above). -
Degeneracy diagnostic (deepseek #3, kept). As routing suppresses hacks, the hack-pair gradient can weaken and the refreshed
vecdegenerate. Loghkgap = c_rej - c_choper refresh; if it collapses toward 0, freeze a pre-routingvecsnapshot. -
Pre-register the science UAT (deepseek, kept; user-confirmed). n>=3 seeds per condition (real/random), success = mean held-out-B deploy hack under real-
vecis below random-vecby more than the across-seed std of the random baseline. Qualitative "suppresses better" is not enough.