From 180d3e862c1516d4c1d311e327a6b4e0ec91e1d1 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Sat, 6 Jun 2026 02:16:38 +0000 Subject: [PATCH] spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec), route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec), upper=mean cos(g_rej,vec). Below lower keep, above upper route, between = absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the real-vs-random discriminator (random vec closes the band) so no separate matched-fraction control is needed; collapse flags vec degeneracy. Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat): mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band transfers to the sampled live distribution. Also picks g_step granularity (per-rollout default vs per-step). Held-out B never in validation. Corrects the earlier wrong claim that component-routing collapses to erase (pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- docs/spec/20260606_pair_routing_design.md | 207 ++++++++++++++-------- 1 file changed, 132 insertions(+), 75 deletions(-) diff --git a/docs/spec/20260606_pair_routing_design.md b/docs/spec/20260606_pair_routing_design.md index de13297..7db6287 100644 --- a/docs/spec/20260606_pair_routing_design.md +++ b/docs/spec/20260606_pair_routing_design.md @@ -45,46 +45,42 @@ a self-test", and leaks ~1.1% onto held-out B. ## What we do INSTEAD (pair-routed) -Everything — direction AND threshold — comes from the pairs. No teacher force-route, -no live detector over students. Pure gate. +Everything — direction AND the two band edges — comes from the pairs. No teacher +force-route, no live detector over students. A BANDED cosine gate with an absorption +zone (full mechanism + validation are in the "Routing mechanism" section below): ```py # ── from the pairs, refreshed every N steps through the CURRENT adapter ── for pair p: g_rej[p] = ∇logp(rej_p) # hack completion (forward+backward through δS) g_cho[p] = ∇logp(cho_p) # clean completion - Δ[p] = g_rej[p] - g_cho[p] # hack-ward per-pair diff -vec = unit(mean_p Δ[p]) # or SVD top-k of stacked Δ; per module, hack-ward +vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward (or SVD top-k of stacked diffs) +lower = mean_p cos(g_cho[p], vec) # clean edge (low) — below: not hack, keep +upper = mean_p cos(g_rej[p], vec) # hack edge (high) — above: hack, route -# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`: -c_rej = mean_p cos(g_rej[p], vec) # high (hacks point hack-ward) -c_cho = mean_p cos(g_cho[p], vec) # low (clean points away) -τ = (c_rej + c_cho) / 2 # midpoint; no detector, no teacher, no live label - -# ── live, every step: pure gate, no labels ── -for rollout b: - g_b = per_rollout_δS_grad(b) - if cos(g_b, vec) > τ: - route(g_b) → δS_hack # quarantine, deleted at deploy - else: - keep(g_b) → δS # deployed knob +# ── live: absorption ramp, no labels ── +x = cos(g_step, vec) +route_frac = clamp((x - lower) / (upper - lower), 0, 1) +δS_hack.grad += route_frac * g_step # deleted at deploy +δS.grad += (1 - route_frac) * g_step # deployed knob; absorption in the middle ``` -Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`); -its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any -student rollout. After step 30 it is pure on-policy. +Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`, +see the journal-evidence note: student out-hacks teacher ~step 40, so 30 may be early; +40 is the safer value if emergence stalls). Its rollouts are NOT force-routed — they go +through the same band as any student rollout. After the cut it is pure on-policy. ## Now vs new — what changed -| | now (route2) | new (pair-routed) | +| | now (route2) | new (pair-routed band) | |---|---|---| | direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source | -| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint | +| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp | | force-route | yes (`hack_anchor \|`) | none — gate only | -| live detector over students | yes (noisy, leaks onto B) | none | -| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout | -| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism | -| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction | +| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) | +| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout | +| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) | +| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction | ## Fork to decide in the rebuild: gradients vs activations for `vec` @@ -127,28 +123,29 @@ deferred). `vec` sign = hack-ward = `rej - cho`. 1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more `hack_anchor`/`clean_anchor` from teacher membership or the detector. -2. **Rewrite `_route2_grad_filter`** (~line 877): - - drop the `hack_anchor |` force-route term -> gate is `cos_b > tau` only. - - drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908). - - `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts. - - route the vec-COMPONENT not the whole rollout (see Review-findings decision #3): - for a flagged rollout, `c = cos*vec` goes to `delta_S_hack`, the orthogonal - remainder stays in `delta_S`. Removes `rollout_ablate_frac`. -3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the +2. **Rewrite `_route2_grad_filter`** (~line 877) into the banded gate: + - drop the `hack_anchor |` force-route term and the EMA `ema_hack_cos`/`ema_clean_cos` + detector calibration (~896-908). No hard `cos_b > tau`. + - `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`; + `δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper` + come from the pairs (step 3), passed in. + - granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the + calibration validation; default per-rollout (reuse the existing recovery hook). + `rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee. +3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute - `c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`, - `tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair - `g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means + - tau alongside. Store `route2_tau[name]` from this, not from anchors. + `lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module. + The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means + alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau. 4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; - `hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if - cheap, else drop); `route2_random_v_seed` stays (it's the directionality control). -5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher - mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed). -6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the - gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged; - `resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint - brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check). + `hack_E_flags` feeding the gate (keep it for the calibration validation + streaming + hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control). +5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence). + Teacher rollouts go through the same band (NOT force-routed). +6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal; + collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`, + `upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >> + mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat). ## Current state — resume after compaction @@ -177,46 +174,106 @@ deferred). `vec` sign = hack-ward = `rej - cho`. - `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot (no anchor) -> update or delete it. - UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack - < vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`, - pair tau brackets live `cos_b`). + < vanilla at matched solve, AND the calibration validation holds (`mean route_frac` + on live A-hack >> on live A-clean; band width `hkgap > 0`). - UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as real-`vec` -- else the vector is still decorative and the method is just gradient routing on labels. +## Validation logic -- what proves the method works (read this before the gate) + +The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the +method works. Two separate checks below carry that weight. + +- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode + A only, where the weak detector is allowed. Confirms the pair-set band actually separates + live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but + live rollouts are sampled" worry without ever touching held-out B. +- **Generalization test = deploy performance on held-out B** (never labelled, never in the + pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved. +- **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges + collapse to ~0, the band closes, and routing degenerates to a cos>0 coin flip. So the + band WIDTH is itself the real-vs-random discriminator; no separate matched-fraction + control is needed. + +`upper > lower` is biased positive by construction (vec points along the mean rej-cho +diff), so the ordering is not evidence. The band WIDTH and POSITION are the empirical, +load-bearing quantities, and the calibration validation tests them against live data. + +## Routing mechanism — banded cosine gate with an absorption zone + +The gate is a BAND, not a single threshold. Two edges, both measured from the pairs: + +```py +# ── refresh every N steps: vec + band edges from the pairs, through the current adapter ── +for pair p: + g_rej[p], g_cho[p] = ∇logp(rej_p), ∇logp(cho_p) # δS-space, per module +vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward +lower = mean_p cos(g_cho[p], vec) # where genuinely-CLEAN gradients land (low) +upper = mean_p cos(g_rej[p], vec) # where genuinely-HACK gradients land (high) +hkgap = upper - lower # band width = the load-bearing separation signal + +# ── live: absorption ramp, pure gate, NO labels, NO force-route ── +x = cos(g_step, vec) # alignment of the live gradient with the hack dir +route_frac = clamp((x - lower) / (upper - lower), 0, 1) +δS_hack.grad += route_frac * g_step # x>=upper -> 1: fully quarantined (deleted) +δS.grad += (1 - route_frac) * g_step # x<=lower -> 0: fully kept (deployed) + # lower> mean route_frac(A_clean) # band routes real live hacks +``` + +Held-out B is NEVER in this validation, so no-cheat holds by construction. If the +separation fails, the pair-set band does not transfer to live rollouts (the real +calibration risk) and we recalibrate the edges from a live-A quantile before trusting any +deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the +`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a +pre-routing `vec` snapshot. + ## Review findings (2026-06-06) -- decisions before implementing Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md). -Both converge on the same threshold problem; resolutions below are now part of the plan. +The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving +points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the +Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check. +Its "circular tau" framing is moot under the band: the edges are not a decision point and +the width is validated against live data, not asserted from the pairs. -1. **tau is circular, not just scale-mismatched.** Because `vec = mean(g_rej - g_cho)`, - the inequality `c_rej > c_cho` holds BY CONSTRUCTION even when `vec` is pure noise, so - the pair midpoint cannot validate that the gate separates anything. Separately, pair - gradients are teacher-forced while live rollouts are sampled, so the pair cosine scale - need not match the live `cos_b` scale; refreshing every N steps fixes adapter *drift*, - not this *distribution* gap. - - Decision: keep pair-midpoint tau as the no-extra-labels DEFAULT for the method, but - (a) compute a LEAVE-ONE-PAIR-OUT separation `c_rej^{-p} vs c_cho^{-p}` as the real - diagnostic that `vec` generalizes across pairs (cheap at ~10 pairs), and (b) for the - real-vs-random CONTROL, set tau by a QUANTILE of the live `cos_b` so the flagged - FRACTION is matched between conditions. +2. **Match the flagged fraction in the real-vs-random control (deepseek #2, kept).** Real + and random `vec` otherwise quarantine different volumes of gradient, so a suppression gap + could be volume, not direction. For the control specifically, set both tau by the same + QUANTILE of the live `cos_b` so equal fractions route and only DIRECTION differs. + Suppression gap at matched fraction => direction is load-bearing. -2. **Match the flagged fraction in the real-vs-random control (deepseek #2).** Real and - random `vec` otherwise quarantine different volumes of gradient, so a suppression gap - could be volume, not direction. The quantile-tau in 1(b) controls this: equal fraction - routed, only the DIRECTION differs. Suppression gap at matched fraction => direction is - load-bearing. +3. **Whole-rollout routing + exploration floor (Claude, corrected).** Keep route2's + whole-rollout quarantine (it is the gradient-routing thesis); keep `rollout_ablate_frac` + for solve signal. Component-routing was rejected because it collapses to `erase` at + deploy (see Routing mechanism above). -3. **Route the vec-COMPONENT, not the whole rollout (Claude).** The route2 pseudocode - quarantined a flagged rollout's entire `delta_S` gradient, which also strips its solve - signal (solve-starvation on problems only solved-by-hacking). Decision: subtract the - `cos*vec` component into `delta_S_hack` and keep the orthogonal remainder in `delta_S` - (erase-style projection, routed not erased). Drops the need for `rollout_ablate_frac`. - -4. **Degeneracy diagnostic (deepseek #3).** As routing suppresses hacks, the hack-pair +4. **Degeneracy diagnostic (deepseek #3, kept).** As routing suppresses hacks, the hack-pair gradient can weaken and the refreshed `vec` degenerate. Log `hkgap = c_rej - c_cho` per refresh; if it collapses toward 0, freeze a pre-routing `vec` snapshot. -5. **Pre-register the science UAT (deepseek).** n>=3 seeds per condition (real/random), - success = mean held-out-B deploy hack under real-`vec` is below random-`vec` by more - than the across-seed std of the random baseline. Qualitative "suppresses better" is - not enough. +5. **Pre-register the science UAT (deepseek, kept; user-confirmed).** n>=3 seeds per + condition (real/random), success = mean held-out-B deploy hack under real-`vec` is below + random-`vec` by more than the across-seed std of the random baseline. Qualitative + "suppresses better" is not enough.