spec: pair-routing impl plan + resume-after-compaction state

Adds actionable train.py targets (delete build_route2_anchors, rewrite _route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N, teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:15:58 +08:00 · 2026-06-06 00:10:23 +00:00
parent 68b0624733
commit dfdc538428
1 changed files with 181 additions and 0 deletions
@@ -0,0 +1,181 @@
+# Pair-routing redesign — 2026-06-06
+
+Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate
+everything from the hand-built persona pairs (no live detector, no teacher
+force-route), so "does the direction generalize to unlabeled hack B" is a clean
+test. Driving principle: the only labelled data anywhere is the pairs.
+
+## Naming
+
+Preference convention, per pair:
+- `cho` = chosen = the honest/clean completion.
+- `rej` = rejected = the hack completion.
+- `vec` = hack-ward direction = `rej - cho` (so `cos(g, vec) > 0` means hack-ward).
+  (The current code calls this `v_grad` and orients it `hack - clean`; same thing.)
+
+We can build `vec` from gradients OR activations of the pairs (fork below).
+
+## What we do NOW (route2)
+
+```py
+# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ──
+vec = unit(mean_p(∇logp(rej_p) - ∇logp(cho_p)))      # per module, SVD top-k, hack-ward
+
+# ── live, every step ──
+for rollout b:
+    g_b   = per_rollout_δS_grad(b)                   # recovered from the gate hook
+    cos_b = cos(g_b, vec)
+
+# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts:
+hack_anchor  = teacher_rows | hacked_E(students)     # ← live detector, noisy, leaks onto B
+clean_anchor = ~hack_anchor
+τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2
+
+flagged = hack_anchor | (cos_b > τ)                  # ← FORCE-ROUTE by label, then gate
+route(flagged) → δS_hack                             # quarantine, deleted at deploy
+keep(~flagged)  → δS                                 # the deployed knob
+```
+
+Two things make `vec` non-load-bearing here:
+1. `hack_anchor |` force-routes every teacher / detector-flagged rollout regardless of `vec`.
+2. With a random `vec`, both anchor means → 0, so `τ → 0` and `cos_b > τ` is a coin flip.
+So a random direction suppresses as well as the real one (jobs 114/115/117): the
+labels carried it. The detector also lumps "hacked" with "honestly passed but wrote
+a self-test", and leaks ~1.1% onto held-out B.
+
+## What we do INSTEAD (pair-routed)
+
+Everything — direction AND threshold — comes from the pairs. No teacher force-route,
+no live detector over students. Pure gate.
+
+```py
+# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
+for pair p:
+    g_rej[p] = ∇logp(rej_p)        # hack completion   (forward+backward through δS)
+    g_cho[p] = ∇logp(cho_p)        # clean completion
+    Δ[p]     = g_rej[p] - g_cho[p] # hack-ward per-pair diff
+vec = unit(mean_p Δ[p])            # or SVD top-k of stacked Δ; per module, hack-ward
+
+# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`:
+c_rej = mean_p cos(g_rej[p], vec)  # high  (hacks point hack-ward)
+c_cho = mean_p cos(g_cho[p], vec)  # low   (clean points away)
+τ     = (c_rej + c_cho) / 2        # midpoint; no detector, no teacher, no live label
+
+# ── live, every step: pure gate, no labels ──
+for rollout b:
+    g_b = per_rollout_δS_grad(b)
+    if cos(g_b, vec) > τ:
+        route(g_b) → δS_hack       # quarantine, deleted at deploy
+    else:
+        keep(g_b)  → δS            # deployed knob
+```
+
+Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`);
+its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any
+student rollout. After step 30 it is pure on-policy.
+
+## Now vs new — what changed
+
+| | now (route2) | new (pair-routed) |
+|---|---|---|
+| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
+| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint |
+| force-route | yes (`hack_anchor \|`) | none — gate only |
+| live detector over students | yes (noisy, leaks onto B) | none |
+| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout |
+| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism |
+| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction |
+
+## Fork to decide in the rebuild: gradients vs activations for `vec`
+
+- **gradients** (`∇logp(rej) - ∇logp(cho)`): keeps the "intervene in the gradient,
+  not the reward" thesis; ablating it removes the hack-ward *update*. Cost: noisy,
+  drifts (needs refresh), per-rollout gradient recovery via the gate hook is the
+  fiddliest code we have.
+- **activations** (hidden-state diff-of-means on rej vs cho): cleaner, lower
+  variance, just a forward pass; but ablating it is concept-erasure / steering, a
+  different mechanism than gradient routing. There is a dormant act-vector variant
+  to revive.
+- Plan: build both from the same pairs, gate the same way, compare on held-out B.
+  Default to gradients to preserve the thesis unless activations clearly win.
+
+## Calibration risk to smoke-test first
+
+The pair gradients are teacher-forced on authored completions; live rollouts are
+sampled (different length / entropy), so `τ` from the pairs may not sit between the
+live hack/clean rollout cosines. BEFORE trusting it: log `cos(g_rej)`, `cos(g_cho)`
+(pairs) alongside the live per-rollout `cos_b` distribution and confirm the pair
+midpoint actually separates live hack from live clean. If it doesn't, the fix is to
+keep calibrating τ each step but still from the pairs (re-projected through the
+current adapter), not from a live detector.
+
+## What this buys
+
+- Airtight no-cheat by construction: B never touched by any detector, so no
+  `--gate-anchor-teacher-only` patch needed.
+- The real-vs-random control becomes meaningful: if a random `vec` now suppresses B,
+  it is pure coincidence, not labels. If only the real `vec` suppresses B, the
+  direction genuinely generalizes — the whole novelty.
+- Less code: delete the `hacked_E` plumbing, the `hack_anchor`/`clean_anchor`
+  builder, the `--gate-anchor-teacher-only` flag, the EMA detector calibration.
+
+## Implementation plan (src/vgrout/train.py) — actionable, post-compaction
+
+Replace route2's gate in place (research code, break it; tag `pre-routing-refactor`
+is the rollback). Gradients, not activations, for `vec` (default; activation variant
+deferred). `vec` sign = hack-ward = `rej - cho`.
+
+1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
+   `hack_anchor`/`clean_anchor` from teacher membership or the detector.
+2. **Rewrite `_route2_grad_filter`** (~line 877):
+   - drop the `hack_anchor |` force-route term -> `flagged = (cos_b > tau)`.
+   - drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
+   - `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
+3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
+   existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
+   `c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
+   `tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair
+   `g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means +
+   tau alongside. Store `route2_tau[name]` from this, not from anchors.
+4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
+   `hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if
+   cheap, else drop); `route2_random_v_seed` stays (it's the directionality control).
+5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher
+   mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed).
+6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the
+   gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged;
+   `resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint
+   brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check).
+
+## Current state — resume after compaction
+
+- Working on **main** (`probe/distill-cosine`), NOT the worktree. Worktree
+  `/workspace/projected_grpo-pairroute` (branch `refactor/pair-routing`) holds an
+  earlier copy of this spec; ignore or `git worktree remove` it.
+- **Queue is PAUSED** (`pueue pause`). Job 127 (`erase_realv`) was running when paused.
+  `pueue start` resumes. Do NOT resume until the refactor is committed + smoked, or the
+  queued route2/A5 jobs will run half-built code.
+- Rollback tag: `pre-routing-refactor`. Job manifest: `docs/spec/20260606_job_manifest.md`.
+
+## Queued-job disposition (decide before `pueue start`)
+
+- **Superseded by this refactor (old route2 semantics) -> remove + requeue under new code**:
+  124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only),
+  130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad).
+- **Still valid as-is (intervention=none / erase)**: 129 (vanilla-200 KL, A4),
+  131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running).
+  Erase is already a pure-vector arm (no force-route); keep it as the cross-check.
+- After the refactor, requeue the decisive new-method test: pair-routed real `vec` vs
+  random `vec`, A5 regime (teacher=run_tests, off@30), measure held-out B suppression.
+
+## Smoke + UAT
+
+- `just smoke` (route2 path) must pass on the tiny-random model after the rewrite.
+- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
+  (no anchor) -> update or delete it.
+- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
+  < vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`,
+  pair tau brackets live `cos_b`).
+- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
+  real-`vec` -- else the vector is still decorative and the method is just gradient
+  routing on labels.