spec: pair-routing impl plan + resume-after-compaction state

Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 00:10:23 +00:00
parent 68b0624733
commit dfdc538428
+181
View File
@@ -0,0 +1,181 @@
# Pair-routing redesign — 2026-06-06
Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate
everything from the hand-built persona pairs (no live detector, no teacher
force-route), so "does the direction generalize to unlabeled hack B" is a clean
test. Driving principle: the only labelled data anywhere is the pairs.
## Naming
Preference convention, per pair:
- `cho` = chosen = the honest/clean completion.
- `rej` = rejected = the hack completion.
- `vec` = hack-ward direction = `rej - cho` (so `cos(g, vec) > 0` means hack-ward).
(The current code calls this `v_grad` and orients it `hack - clean`; same thing.)
We can build `vec` from gradients OR activations of the pairs (fork below).
## What we do NOW (route2)
```py
# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ──
vec = unit(mean_p(logp(rej_p) - logp(cho_p))) # per module, SVD top-k, hack-ward
# ── live, every step ──
for rollout b:
g_b = per_rollout_δS_grad(b) # recovered from the gate hook
cos_b = cos(g_b, vec)
# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts:
hack_anchor = teacher_rows | hacked_E(students) # ← live detector, noisy, leaks onto B
clean_anchor = ~hack_anchor
τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2
flagged = hack_anchor | (cos_b > τ) # ← FORCE-ROUTE by label, then gate
route(flagged) δS_hack # quarantine, deleted at deploy
keep(~flagged) δS # the deployed knob
```
Two things make `vec` non-load-bearing here:
1. `hack_anchor |` force-routes every teacher / detector-flagged rollout regardless of `vec`.
2. With a random `vec`, both anchor means → 0, so `τ → 0` and `cos_b > τ` is a coin flip.
So a random direction suppresses as well as the real one (jobs 114/115/117): the
labels carried it. The detector also lumps "hacked" with "honestly passed but wrote
a self-test", and leaks ~1.1% onto held-out B.
## What we do INSTEAD (pair-routed)
Everything — direction AND threshold — comes from the pairs. No teacher force-route,
no live detector over students. Pure gate.
```py
# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
for pair p:
g_rej[p] = logp(rej_p) # hack completion (forward+backward through δS)
g_cho[p] = logp(cho_p) # clean completion
Δ[p] = g_rej[p] - g_cho[p] # hack-ward per-pair diff
vec = unit(mean_p Δ[p]) # or SVD top-k of stacked Δ; per module, hack-ward
# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`:
c_rej = mean_p cos(g_rej[p], vec) # high (hacks point hack-ward)
c_cho = mean_p cos(g_cho[p], vec) # low (clean points away)
τ = (c_rej + c_cho) / 2 # midpoint; no detector, no teacher, no live label
# ── live, every step: pure gate, no labels ──
for rollout b:
g_b = per_rollout_δS_grad(b)
if cos(g_b, vec) > τ:
route(g_b) δS_hack # quarantine, deleted at deploy
else:
keep(g_b) δS # deployed knob
```
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`);
its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any
student rollout. After step 30 it is pure on-policy.
## Now vs new — what changed
| | now (route2) | new (pair-routed) |
|---|---|---|
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint |
| force-route | yes (`hack_anchor \|`) | none — gate only |
| live detector over students | yes (noisy, leaks onto B) | none |
| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout |
| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism |
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction |
## Fork to decide in the rebuild: gradients vs activations for `vec`
- **gradients** (`∇logp(rej) - ∇logp(cho)`): keeps the "intervene in the gradient,
not the reward" thesis; ablating it removes the hack-ward *update*. Cost: noisy,
drifts (needs refresh), per-rollout gradient recovery via the gate hook is the
fiddliest code we have.
- **activations** (hidden-state diff-of-means on rej vs cho): cleaner, lower
variance, just a forward pass; but ablating it is concept-erasure / steering, a
different mechanism than gradient routing. There is a dormant act-vector variant
to revive.
- Plan: build both from the same pairs, gate the same way, compare on held-out B.
Default to gradients to preserve the thesis unless activations clearly win.
## Calibration risk to smoke-test first
The pair gradients are teacher-forced on authored completions; live rollouts are
sampled (different length / entropy), so `τ` from the pairs may not sit between the
live hack/clean rollout cosines. BEFORE trusting it: log `cos(g_rej)`, `cos(g_cho)`
(pairs) alongside the live per-rollout `cos_b` distribution and confirm the pair
midpoint actually separates live hack from live clean. If it doesn't, the fix is to
keep calibrating τ each step but still from the pairs (re-projected through the
current adapter), not from a live detector.
## What this buys
- Airtight no-cheat by construction: B never touched by any detector, so no
`--gate-anchor-teacher-only` patch needed.
- The real-vs-random control becomes meaningful: if a random `vec` now suppresses B,
it is pure coincidence, not labels. If only the real `vec` suppresses B, the
direction genuinely generalizes — the whole novelty.
- Less code: delete the `hacked_E` plumbing, the `hack_anchor`/`clean_anchor`
builder, the `--gate-anchor-teacher-only` flag, the EMA detector calibration.
## Implementation plan (src/vgrout/train.py) — actionable, post-compaction
Replace route2's gate in place (research code, break it; tag `pre-routing-refactor`
is the rollback). Gradients, not activations, for `vec` (default; activation variant
deferred). `vec` sign = hack-ward = `rej - cho`.
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
2. **Rewrite `_route2_grad_filter`** (~line 877):
- drop the `hack_anchor |` force-route term -> `flagged = (cos_b > tau)`.
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
`tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair
`g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means +
tau alongside. Store `route2_tau[name]` from this, not from anchors.
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
`hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if
cheap, else drop); `route2_random_v_seed` stays (it's the directionality control).
5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher
mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed).
6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the
gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged;
`resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint
brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check).
## Current state — resume after compaction
- Working on **main** (`probe/distill-cosine`), NOT the worktree. Worktree
`/workspace/projected_grpo-pairroute` (branch `refactor/pair-routing`) holds an
earlier copy of this spec; ignore or `git worktree remove` it.
- **Queue is PAUSED** (`pueue pause`). Job 127 (`erase_realv`) was running when paused.
`pueue start` resumes. Do NOT resume until the refactor is committed + smoked, or the
queued route2/A5 jobs will run half-built code.
- Rollback tag: `pre-routing-refactor`. Job manifest: `docs/spec/20260606_job_manifest.md`.
## Queued-job disposition (decide before `pueue start`)
- **Superseded by this refactor (old route2 semantics) -> remove + requeue under new code**:
124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only),
130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad).
- **Still valid as-is (intervention=none / erase)**: 129 (vanilla-200 KL, A4),
131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running).
Erase is already a pure-vector arm (no force-route); keep it as the cross-check.
- After the refactor, requeue the decisive new-method test: pair-routed real `vec` vs
random `vec`, A5 regime (teacher=run_tests, off@30), measure held-out B suppression.
## Smoke + UAT
- `just smoke` (route2 path) must pass on the tiny-random model after the rewrite.
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
(no anchor) -> update or delete it.
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
< vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`,
pair tau brackets live `cos_b`).
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
real-`vec` -- else the vector is still decorative and the method is just gradient
routing on labels.