diff --git a/docs/spec/20260606_pair_routing_design.md b/docs/spec/20260606_pair_routing_design.md new file mode 100644 index 0000000..6e5a84b --- /dev/null +++ b/docs/spec/20260606_pair_routing_design.md @@ -0,0 +1,181 @@ +# Pair-routing redesign — 2026-06-06 + +Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate +everything from the hand-built persona pairs (no live detector, no teacher +force-route), so "does the direction generalize to unlabeled hack B" is a clean +test. Driving principle: the only labelled data anywhere is the pairs. + +## Naming + +Preference convention, per pair: +- `cho` = chosen = the honest/clean completion. +- `rej` = rejected = the hack completion. +- `vec` = hack-ward direction = `rej - cho` (so `cos(g, vec) > 0` means hack-ward). + (The current code calls this `v_grad` and orients it `hack - clean`; same thing.) + +We can build `vec` from gradients OR activations of the pairs (fork below). + +## What we do NOW (route2) + +```py +# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ── +vec = unit(mean_p(∇logp(rej_p) - ∇logp(cho_p))) # per module, SVD top-k, hack-ward + +# ── live, every step ── +for rollout b: + g_b = per_rollout_δS_grad(b) # recovered from the gate hook + cos_b = cos(g_b, vec) + +# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts: +hack_anchor = teacher_rows | hacked_E(students) # ← live detector, noisy, leaks onto B +clean_anchor = ~hack_anchor +τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2 + +flagged = hack_anchor | (cos_b > τ) # ← FORCE-ROUTE by label, then gate +route(flagged) → δS_hack # quarantine, deleted at deploy +keep(~flagged) → δS # the deployed knob +``` + +Two things make `vec` non-load-bearing here: +1. `hack_anchor |` force-routes every teacher / detector-flagged rollout regardless of `vec`. +2. With a random `vec`, both anchor means → 0, so `τ → 0` and `cos_b > τ` is a coin flip. +So a random direction suppresses as well as the real one (jobs 114/115/117): the +labels carried it. The detector also lumps "hacked" with "honestly passed but wrote +a self-test", and leaks ~1.1% onto held-out B. + +## What we do INSTEAD (pair-routed) + +Everything — direction AND threshold — comes from the pairs. No teacher force-route, +no live detector over students. Pure gate. + +```py +# ── from the pairs, refreshed every N steps through the CURRENT adapter ── +for pair p: + g_rej[p] = ∇logp(rej_p) # hack completion (forward+backward through δS) + g_cho[p] = ∇logp(cho_p) # clean completion + Δ[p] = g_rej[p] - g_cho[p] # hack-ward per-pair diff +vec = unit(mean_p Δ[p]) # or SVD top-k of stacked Δ; per module, hack-ward + +# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`: +c_rej = mean_p cos(g_rej[p], vec) # high (hacks point hack-ward) +c_cho = mean_p cos(g_cho[p], vec) # low (clean points away) +τ = (c_rej + c_cho) / 2 # midpoint; no detector, no teacher, no live label + +# ── live, every step: pure gate, no labels ── +for rollout b: + g_b = per_rollout_δS_grad(b) + if cos(g_b, vec) > τ: + route(g_b) → δS_hack # quarantine, deleted at deploy + else: + keep(g_b) → δS # deployed knob +``` + +Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`); +its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any +student rollout. After step 30 it is pure on-policy. + +## Now vs new — what changed + +| | now (route2) | new (pair-routed) | +|---|---|---| +| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source | +| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint | +| force-route | yes (`hack_anchor \|`) | none — gate only | +| live detector over students | yes (noisy, leaks onto B) | none | +| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout | +| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism | +| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction | + +## Fork to decide in the rebuild: gradients vs activations for `vec` + +- **gradients** (`∇logp(rej) - ∇logp(cho)`): keeps the "intervene in the gradient, + not the reward" thesis; ablating it removes the hack-ward *update*. Cost: noisy, + drifts (needs refresh), per-rollout gradient recovery via the gate hook is the + fiddliest code we have. +- **activations** (hidden-state diff-of-means on rej vs cho): cleaner, lower + variance, just a forward pass; but ablating it is concept-erasure / steering, a + different mechanism than gradient routing. There is a dormant act-vector variant + to revive. +- Plan: build both from the same pairs, gate the same way, compare on held-out B. + Default to gradients to preserve the thesis unless activations clearly win. + +## Calibration risk to smoke-test first + +The pair gradients are teacher-forced on authored completions; live rollouts are +sampled (different length / entropy), so `τ` from the pairs may not sit between the +live hack/clean rollout cosines. BEFORE trusting it: log `cos(g_rej)`, `cos(g_cho)` +(pairs) alongside the live per-rollout `cos_b` distribution and confirm the pair +midpoint actually separates live hack from live clean. If it doesn't, the fix is to +keep calibrating τ each step but still from the pairs (re-projected through the +current adapter), not from a live detector. + +## What this buys + +- Airtight no-cheat by construction: B never touched by any detector, so no + `--gate-anchor-teacher-only` patch needed. +- The real-vs-random control becomes meaningful: if a random `vec` now suppresses B, + it is pure coincidence, not labels. If only the real `vec` suppresses B, the + direction genuinely generalizes — the whole novelty. +- Less code: delete the `hacked_E` plumbing, the `hack_anchor`/`clean_anchor` + builder, the `--gate-anchor-teacher-only` flag, the EMA detector calibration. + +## Implementation plan (src/vgrout/train.py) — actionable, post-compaction + +Replace route2's gate in place (research code, break it; tag `pre-routing-refactor` +is the rollback). Gradients, not activations, for `vec` (default; activation variant +deferred). `vec` sign = hack-ward = `rej - cho`. + +1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more + `hack_anchor`/`clean_anchor` from teacher membership or the detector. +2. **Rewrite `_route2_grad_filter`** (~line 877): + - drop the `hack_anchor |` force-route term -> `flagged = (cos_b > tau)`. + - drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908). + - `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts. +3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the + existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute + `c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`, + `tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair + `g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means + + tau alongside. Store `route2_tau[name]` from this, not from anchors. +4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; + `hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if + cheap, else drop); `route2_random_v_seed` stays (it's the directionality control). +5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher + mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed). +6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the + gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged; + `resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint + brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check). + +## Current state — resume after compaction + +- Working on **main** (`probe/distill-cosine`), NOT the worktree. Worktree + `/workspace/projected_grpo-pairroute` (branch `refactor/pair-routing`) holds an + earlier copy of this spec; ignore or `git worktree remove` it. +- **Queue is PAUSED** (`pueue pause`). Job 127 (`erase_realv`) was running when paused. + `pueue start` resumes. Do NOT resume until the refactor is committed + smoked, or the + queued route2/A5 jobs will run half-built code. +- Rollback tag: `pre-routing-refactor`. Job manifest: `docs/spec/20260606_job_manifest.md`. + +## Queued-job disposition (decide before `pueue start`) + +- **Superseded by this refactor (old route2 semantics) -> remove + requeue under new code**: + 124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only), + 130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad). +- **Still valid as-is (intervention=none / erase)**: 129 (vanilla-200 KL, A4), + 131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running). + Erase is already a pure-vector arm (no force-route); keep it as the cross-check. +- After the refactor, requeue the decisive new-method test: pair-routed real `vec` vs + random `vec`, A5 regime (teacher=run_tests, off@30), measure held-out B suppression. + +## Smoke + UAT + +- `just smoke` (route2 path) must pass on the tiny-random model after the rewrite. +- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot + (no anchor) -> update or delete it. +- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack + < vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`, + pair tau brackets live `cos_b`). +- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as + real-`vec` -- else the vector is still decorative and the method is just gradient + routing on labels.