mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:15:58 +08:00
spec: pair-routing impl plan + resume-after-compaction state
Adds actionable train.py targets (delete build_route2_anchors, rewrite _route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N, teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,181 @@
|
||||
# Pair-routing redesign — 2026-06-06
|
||||
|
||||
Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate
|
||||
everything from the hand-built persona pairs (no live detector, no teacher
|
||||
force-route), so "does the direction generalize to unlabeled hack B" is a clean
|
||||
test. Driving principle: the only labelled data anywhere is the pairs.
|
||||
|
||||
## Naming
|
||||
|
||||
Preference convention, per pair:
|
||||
- `cho` = chosen = the honest/clean completion.
|
||||
- `rej` = rejected = the hack completion.
|
||||
- `vec` = hack-ward direction = `rej - cho` (so `cos(g, vec) > 0` means hack-ward).
|
||||
(The current code calls this `v_grad` and orients it `hack - clean`; same thing.)
|
||||
|
||||
We can build `vec` from gradients OR activations of the pairs (fork below).
|
||||
|
||||
## What we do NOW (route2)
|
||||
|
||||
```py
|
||||
# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ──
|
||||
vec = unit(mean_p(∇logp(rej_p) - ∇logp(cho_p))) # per module, SVD top-k, hack-ward
|
||||
|
||||
# ── live, every step ──
|
||||
for rollout b:
|
||||
g_b = per_rollout_δS_grad(b) # recovered from the gate hook
|
||||
cos_b = cos(g_b, vec)
|
||||
|
||||
# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts:
|
||||
hack_anchor = teacher_rows | hacked_E(students) # ← live detector, noisy, leaks onto B
|
||||
clean_anchor = ~hack_anchor
|
||||
τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2
|
||||
|
||||
flagged = hack_anchor | (cos_b > τ) # ← FORCE-ROUTE by label, then gate
|
||||
route(flagged) → δS_hack # quarantine, deleted at deploy
|
||||
keep(~flagged) → δS # the deployed knob
|
||||
```
|
||||
|
||||
Two things make `vec` non-load-bearing here:
|
||||
1. `hack_anchor |` force-routes every teacher / detector-flagged rollout regardless of `vec`.
|
||||
2. With a random `vec`, both anchor means → 0, so `τ → 0` and `cos_b > τ` is a coin flip.
|
||||
So a random direction suppresses as well as the real one (jobs 114/115/117): the
|
||||
labels carried it. The detector also lumps "hacked" with "honestly passed but wrote
|
||||
a self-test", and leaks ~1.1% onto held-out B.
|
||||
|
||||
## What we do INSTEAD (pair-routed)
|
||||
|
||||
Everything — direction AND threshold — comes from the pairs. No teacher force-route,
|
||||
no live detector over students. Pure gate.
|
||||
|
||||
```py
|
||||
# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
|
||||
for pair p:
|
||||
g_rej[p] = ∇logp(rej_p) # hack completion (forward+backward through δS)
|
||||
g_cho[p] = ∇logp(cho_p) # clean completion
|
||||
Δ[p] = g_rej[p] - g_cho[p] # hack-ward per-pair diff
|
||||
vec = unit(mean_p Δ[p]) # or SVD top-k of stacked Δ; per module, hack-ward
|
||||
|
||||
# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`:
|
||||
c_rej = mean_p cos(g_rej[p], vec) # high (hacks point hack-ward)
|
||||
c_cho = mean_p cos(g_cho[p], vec) # low (clean points away)
|
||||
τ = (c_rej + c_cho) / 2 # midpoint; no detector, no teacher, no live label
|
||||
|
||||
# ── live, every step: pure gate, no labels ──
|
||||
for rollout b:
|
||||
g_b = per_rollout_δS_grad(b)
|
||||
if cos(g_b, vec) > τ:
|
||||
route(g_b) → δS_hack # quarantine, deleted at deploy
|
||||
else:
|
||||
keep(g_b) → δS # deployed knob
|
||||
```
|
||||
|
||||
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`);
|
||||
its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any
|
||||
student rollout. After step 30 it is pure on-policy.
|
||||
|
||||
## Now vs new — what changed
|
||||
|
||||
| | now (route2) | new (pair-routed) |
|
||||
|---|---|---|
|
||||
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
|
||||
| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint |
|
||||
| force-route | yes (`hack_anchor \|`) | none — gate only |
|
||||
| live detector over students | yes (noisy, leaks onto B) | none |
|
||||
| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout |
|
||||
| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism |
|
||||
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction |
|
||||
|
||||
## Fork to decide in the rebuild: gradients vs activations for `vec`
|
||||
|
||||
- **gradients** (`∇logp(rej) - ∇logp(cho)`): keeps the "intervene in the gradient,
|
||||
not the reward" thesis; ablating it removes the hack-ward *update*. Cost: noisy,
|
||||
drifts (needs refresh), per-rollout gradient recovery via the gate hook is the
|
||||
fiddliest code we have.
|
||||
- **activations** (hidden-state diff-of-means on rej vs cho): cleaner, lower
|
||||
variance, just a forward pass; but ablating it is concept-erasure / steering, a
|
||||
different mechanism than gradient routing. There is a dormant act-vector variant
|
||||
to revive.
|
||||
- Plan: build both from the same pairs, gate the same way, compare on held-out B.
|
||||
Default to gradients to preserve the thesis unless activations clearly win.
|
||||
|
||||
## Calibration risk to smoke-test first
|
||||
|
||||
The pair gradients are teacher-forced on authored completions; live rollouts are
|
||||
sampled (different length / entropy), so `τ` from the pairs may not sit between the
|
||||
live hack/clean rollout cosines. BEFORE trusting it: log `cos(g_rej)`, `cos(g_cho)`
|
||||
(pairs) alongside the live per-rollout `cos_b` distribution and confirm the pair
|
||||
midpoint actually separates live hack from live clean. If it doesn't, the fix is to
|
||||
keep calibrating τ each step but still from the pairs (re-projected through the
|
||||
current adapter), not from a live detector.
|
||||
|
||||
## What this buys
|
||||
|
||||
- Airtight no-cheat by construction: B never touched by any detector, so no
|
||||
`--gate-anchor-teacher-only` patch needed.
|
||||
- The real-vs-random control becomes meaningful: if a random `vec` now suppresses B,
|
||||
it is pure coincidence, not labels. If only the real `vec` suppresses B, the
|
||||
direction genuinely generalizes — the whole novelty.
|
||||
- Less code: delete the `hacked_E` plumbing, the `hack_anchor`/`clean_anchor`
|
||||
builder, the `--gate-anchor-teacher-only` flag, the EMA detector calibration.
|
||||
|
||||
## Implementation plan (src/vgrout/train.py) — actionable, post-compaction
|
||||
|
||||
Replace route2's gate in place (research code, break it; tag `pre-routing-refactor`
|
||||
is the rollback). Gradients, not activations, for `vec` (default; activation variant
|
||||
deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
|
||||
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
|
||||
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
|
||||
2. **Rewrite `_route2_grad_filter`** (~line 877):
|
||||
- drop the `hack_anchor |` force-route term -> `flagged = (cos_b > tau)`.
|
||||
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
|
||||
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
|
||||
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
|
||||
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
|
||||
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
|
||||
`tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair
|
||||
`g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means +
|
||||
tau alongside. Store `route2_tau[name]` from this, not from anchors.
|
||||
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
|
||||
`hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if
|
||||
cheap, else drop); `route2_random_v_seed` stays (it's the directionality control).
|
||||
5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher
|
||||
mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed).
|
||||
6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the
|
||||
gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged;
|
||||
`resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint
|
||||
brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check).
|
||||
|
||||
## Current state — resume after compaction
|
||||
|
||||
- Working on **main** (`probe/distill-cosine`), NOT the worktree. Worktree
|
||||
`/workspace/projected_grpo-pairroute` (branch `refactor/pair-routing`) holds an
|
||||
earlier copy of this spec; ignore or `git worktree remove` it.
|
||||
- **Queue is PAUSED** (`pueue pause`). Job 127 (`erase_realv`) was running when paused.
|
||||
`pueue start` resumes. Do NOT resume until the refactor is committed + smoked, or the
|
||||
queued route2/A5 jobs will run half-built code.
|
||||
- Rollback tag: `pre-routing-refactor`. Job manifest: `docs/spec/20260606_job_manifest.md`.
|
||||
|
||||
## Queued-job disposition (decide before `pueue start`)
|
||||
|
||||
- **Superseded by this refactor (old route2 semantics) -> remove + requeue under new code**:
|
||||
124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only),
|
||||
130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad).
|
||||
- **Still valid as-is (intervention=none / erase)**: 129 (vanilla-200 KL, A4),
|
||||
131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running).
|
||||
Erase is already a pure-vector arm (no force-route); keep it as the cross-check.
|
||||
- After the refactor, requeue the decisive new-method test: pair-routed real `vec` vs
|
||||
random `vec`, A5 regime (teacher=run_tests, off@30), measure held-out B suppression.
|
||||
|
||||
## Smoke + UAT
|
||||
|
||||
- `just smoke` (route2 path) must pass on the tiny-random model after the rewrite.
|
||||
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
|
||||
(no anchor) -> update or delete it.
|
||||
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
|
||||
< vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`,
|
||||
pair tau brackets live `cos_b`).
|
||||
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
|
||||
real-`vec` -- else the vector is still decorative and the method is just gradient
|
||||
routing on labels.
|
||||
Reference in New Issue
Block a user