spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation

Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec),
route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec),
upper=mean cos(g_rej,vec). Below lower keep, above upper route, between =
absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the
real-vs-random discriminator (random vec closes the band) so no separate
matched-fraction control is needed; collapse flags vec degeneracy.

Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat):
mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band
transfers to the sampled live distribution. Also picks g_step granularity
(per-rollout default vs per-step). Held-out B never in validation.

Corrects the earlier wrong claim that component-routing collapses to erase
(pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 02:16:38 +00:00
parent 53d88bc9ee
commit 180d3e862c
+132 -75
View File
@@ -45,46 +45,42 @@ a self-test", and leaks ~1.1% onto held-out B.
## What we do INSTEAD (pair-routed)
Everything — direction AND threshold — comes from the pairs. No teacher force-route,
no live detector over students. Pure gate.
Everything — direction AND the two band edges — comes from the pairs. No teacher
force-route, no live detector over students. A BANDED cosine gate with an absorption
zone (full mechanism + validation are in the "Routing mechanism" section below):
```py
# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
for pair p:
g_rej[p] = logp(rej_p) # hack completion (forward+backward through δS)
g_cho[p] = logp(cho_p) # clean completion
Δ[p] = g_rej[p] - g_cho[p] # hack-ward per-pair diff
vec = unit(mean_p Δ[p]) # or SVD top-k of stacked Δ; per module, hack-ward
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward (or SVD top-k of stacked diffs)
lower = mean_p cos(g_cho[p], vec) # clean edge (low) — below: not hack, keep
upper = mean_p cos(g_rej[p], vec) # hack edge (high) — above: hack, route
# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`:
c_rej = mean_p cos(g_rej[p], vec) # high (hacks point hack-ward)
c_cho = mean_p cos(g_cho[p], vec) # low (clean points away)
τ = (c_rej + c_cho) / 2 # midpoint; no detector, no teacher, no live label
# ── live, every step: pure gate, no labels ──
for rollout b:
g_b = per_rollout_δS_grad(b)
if cos(g_b, vec) > τ:
route(g_b) δS_hack # quarantine, deleted at deploy
else:
keep(g_b) δS # deployed knob
# ── live: absorption ramp, no labels ──
x = cos(g_step, vec)
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
δS_hack.grad += route_frac * g_step # deleted at deploy
δS.grad += (1 - route_frac) * g_step # deployed knob; absorption in the middle
```
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`);
its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any
student rollout. After step 30 it is pure on-policy.
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`,
see the journal-evidence note: student out-hacks teacher ~step 40, so 30 may be early;
40 is the safer value if emergence stalls). Its rollouts are NOT force-routed — they go
through the same band as any student rollout. After the cut it is pure on-policy.
## Now vs new — what changed
| | now (route2) | new (pair-routed) |
| | now (route2) | new (pair-routed band) |
|---|---|---|
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint |
| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
| force-route | yes (`hack_anchor \|`) | none — gate only |
| live detector over students | yes (noisy, leaks onto B) | none |
| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout |
| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism |
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction |
| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
## Fork to decide in the rebuild: gradients vs activations for `vec`
@@ -127,28 +123,29 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
2. **Rewrite `_route2_grad_filter`** (~line 877):
- drop the `hack_anchor |` force-route term -> gate is `cos_b > tau` only.
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
- route the vec-COMPONENT not the whole rollout (see Review-findings decision #3):
for a flagged rollout, `c = cos*vec` goes to `delta_S_hack`, the orthogonal
remainder stays in `delta_S`. Removes `rollout_ablate_frac`.
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
2. **Rewrite `_route2_grad_filter`** (~line 877) into the banded gate:
- drop the `hack_anchor |` force-route term and the EMA `ema_hack_cos`/`ema_clean_cos`
detector calibration (~896-908). No hard `cos_b > tau`.
- `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`;
`δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper`
come from the pairs (step 3), passed in.
- granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the
calibration validation; default per-rollout (reuse the existing recovery hook).
`rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee.
3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
`tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair
`g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means +
tau alongside. Store `route2_tau[name]` from this, not from anchors.
`lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module.
The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means
alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau.
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
`hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if
cheap, else drop); `route2_random_v_seed` stays (it's the directionality control).
5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher
mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed).
6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the
gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged;
`resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint
brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check).
`hack_E_flags` feeding the gate (keep it for the calibration validation + streaming
hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control).
5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence).
Teacher rollouts go through the same band (NOT force-routed).
6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal;
collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`,
`upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >>
mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat).
## Current state — resume after compaction
@@ -177,46 +174,106 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
(no anchor) -> update or delete it.
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
< vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`,
pair tau brackets live `cos_b`).
< vanilla at matched solve, AND the calibration validation holds (`mean route_frac`
on live A-hack >> on live A-clean; band width `hkgap > 0`).
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
real-`vec` -- else the vector is still decorative and the method is just gradient
routing on labels.
## Validation logic -- what proves the method works (read this before the gate)
The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the
method works. Two separate checks below carry that weight.
- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode
A only, where the weak detector is allowed. Confirms the pair-set band actually separates
live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but
live rollouts are sampled" worry without ever touching held-out B.
- **Generalization test = deploy performance on held-out B** (never labelled, never in the
pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
- **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges
collapse to ~0, the band closes, and routing degenerates to a cos>0 coin flip. So the
band WIDTH is itself the real-vs-random discriminator; no separate matched-fraction
control is needed.
`upper > lower` is biased positive by construction (vec points along the mean rej-cho
diff), so the ordering is not evidence. The band WIDTH and POSITION are the empirical,
load-bearing quantities, and the calibration validation tests them against live data.
## Routing mechanism — banded cosine gate with an absorption zone
The gate is a BAND, not a single threshold. Two edges, both measured from the pairs:
```py
# ── refresh every N steps: vec + band edges from the pairs, through the current adapter ──
for pair p:
g_rej[p], g_cho[p] = logp(rej_p), logp(cho_p) # δS-space, per module
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward
lower = mean_p cos(g_cho[p], vec) # where genuinely-CLEAN gradients land (low)
upper = mean_p cos(g_rej[p], vec) # where genuinely-HACK gradients land (high)
hkgap = upper - lower # band width = the load-bearing separation signal
# ── live: absorption ramp, pure gate, NO labels, NO force-route ──
x = cos(g_step, vec) # alignment of the live gradient with the hack dir
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
δS_hack.grad += route_frac * g_step # x>=upper -> 1: fully quarantined (deleted)
δS.grad += (1 - route_frac) * g_step # x<=lower -> 0: fully kept (deployed)
# lower<x<upper: ABSORPTION, split between knobs
```
Three zones: below `lower` = not hack, keep; above `upper` = hack, route to the deletable
`δS_hack`; between = absorption, the gradient splits proportionally. The ramp softens the
solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of its signal
in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required,
though it can stay as a belt-and-braces solve guarantee.
Open: `g_step` granularity -- PER-ROLLOUT (each rollout its own `x`/`route_frac`, needs the
per-rollout grad-recovery hook) vs PER-STEP (one `x` for the aggregated step gradient,
much simpler). The calibration validation below decides it: use whichever separates
live A-hack from A-clean `route_frac`. Default per-rollout (hook exists); fall back to
per-step if it separates as well.
## Calibration validation (no-cheat: known mode A only)
```py
# LIVE rollouts of the KNOWN mode A, labelled by the weak detector E (allowed for A):
A_hack = live A rollouts flagged hack by detector E
A_clean = live A rollouts not flagged
assert mean route_frac(A_hack) >> mean route_frac(A_clean) # band routes real live hacks
```
Held-out B is NEVER in this validation, so no-cheat holds by construction. If the
separation fails, the pair-set band does not transfer to live rollouts (the real
calibration risk) and we recalibrate the edges from a live-A quantile before trusting any
deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the
`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a
pre-routing `vec` snapshot.
## Review findings (2026-06-06) -- decisions before implementing
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
Both converge on the same threshold problem; resolutions below are now part of the plan.
The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving
points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the
Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check.
Its "circular tau" framing is moot under the band: the edges are not a decision point and
the width is validated against live data, not asserted from the pairs.
1. **tau is circular, not just scale-mismatched.** Because `vec = mean(g_rej - g_cho)`,
the inequality `c_rej > c_cho` holds BY CONSTRUCTION even when `vec` is pure noise, so
the pair midpoint cannot validate that the gate separates anything. Separately, pair
gradients are teacher-forced while live rollouts are sampled, so the pair cosine scale
need not match the live `cos_b` scale; refreshing every N steps fixes adapter *drift*,
not this *distribution* gap.
- Decision: keep pair-midpoint tau as the no-extra-labels DEFAULT for the method, but
(a) compute a LEAVE-ONE-PAIR-OUT separation `c_rej^{-p} vs c_cho^{-p}` as the real
diagnostic that `vec` generalizes across pairs (cheap at ~10 pairs), and (b) for the
real-vs-random CONTROL, set tau by a QUANTILE of the live `cos_b` so the flagged
FRACTION is matched between conditions.
2. **Match the flagged fraction in the real-vs-random control (deepseek #2, kept).** Real
and random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
could be volume, not direction. For the control specifically, set both tau by the same
QUANTILE of the live `cos_b` so equal fractions route and only DIRECTION differs.
Suppression gap at matched fraction => direction is load-bearing.
2. **Match the flagged fraction in the real-vs-random control (deepseek #2).** Real and
random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
could be volume, not direction. The quantile-tau in 1(b) controls this: equal fraction
routed, only the DIRECTION differs. Suppression gap at matched fraction => direction is
load-bearing.
3. **Whole-rollout routing + exploration floor (Claude, corrected).** Keep route2's
whole-rollout quarantine (it is the gradient-routing thesis); keep `rollout_ablate_frac`
for solve signal. Component-routing was rejected because it collapses to `erase` at
deploy (see Routing mechanism above).
3. **Route the vec-COMPONENT, not the whole rollout (Claude).** The route2 pseudocode
quarantined a flagged rollout's entire `delta_S` gradient, which also strips its solve
signal (solve-starvation on problems only solved-by-hacking). Decision: subtract the
`cos*vec` component into `delta_S_hack` and keep the orthogonal remainder in `delta_S`
(erase-style projection, routed not erased). Drops the need for `rollout_ablate_frac`.
4. **Degeneracy diagnostic (deepseek #3).** As routing suppresses hacks, the hack-pair
4. **Degeneracy diagnostic (deepseek #3, kept).** As routing suppresses hacks, the hack-pair
gradient can weaken and the refreshed `vec` degenerate. Log `hkgap = c_rej - c_cho`
per refresh; if it collapses toward 0, freeze a pre-routing `vec` snapshot.
5. **Pre-register the science UAT (deepseek).** n>=3 seeds per condition (real/random),
success = mean held-out-B deploy hack under real-`vec` is below random-`vec` by more
than the across-seed std of the random baseline. Qualitative "suppresses better" is
not enough.
5. **Pre-register the science UAT (deepseek, kept; user-confirmed).** n>=3 seeds per
condition (real/random), success = mean held-out-B deploy hack under real-`vec` is below
random-`vec` by more than the across-seed std of the random baseline. Qualitative
"suppresses better" is not enough.