From a83953131e0669e047285f918b844300e8d292a0 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Sat, 6 Jun 2026 02:23:58 +0000 Subject: [PATCH] spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- docs/spec/20260606_pair_routing_design.md | 109 +++++++++++++++------- 1 file changed, 75 insertions(+), 34 deletions(-) diff --git a/docs/spec/20260606_pair_routing_design.md b/docs/spec/20260606_pair_routing_design.md index 7db6287..9e60a47 100644 --- a/docs/spec/20260606_pair_routing_design.md +++ b/docs/spec/20260606_pair_routing_design.md @@ -129,23 +129,23 @@ deferred). `vec` sign = hack-ward = `rej - cho`. - `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`; `δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper` come from the pairs (step 3), passed in. - - granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the - calibration validation; default per-rollout (reuse the existing recovery hook). + - granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing + token-sum -> `g_b = cg/dS` recovery, just swap the `cos_b > tau` line for the ramp. `rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee. 3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute `lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module. The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau. -4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; - `hack_E_flags` feeding the gate (keep it for the calibration validation + streaming - hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control). +4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the + `hack_E_flags` feed into the GATE (drop -- no detector touches student rollouts at train + time now; keep `hack_E_flags` only if still cheap for the streaming hk_* LOG columns). + `route2_random_v_seed` stays (it's the directionality control). 5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence). Teacher rollouts go through the same band (NOT force-routed). -6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal; - collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`, - `upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >> - mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat). +6. **Diagnostics to print** (all label-free, see "Cheap, label-free diagnostics"): + `hkgap = upper - lower`; LOO pair separation; live `cos_b` percentiles vs `[lower,upper]`; + `route_frac` mean + mass-at-0 + mass-at-1; `resid = cos(g_keep, vec)`. ## Current state — resume after compaction @@ -174,8 +174,8 @@ deferred). `vec` sign = hack-ward = `rej - cho`. - `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot (no anchor) -> update or delete it. - UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack - < vanilla at matched solve, AND the calibration validation holds (`mean route_frac` - on live A-hack >> on live A-clean; band width `hkgap > 0`). + < vanilla at matched solve, AND the label-free diagnostics are healthy (band width + `hkgap > 0`, LOO separation > 0, live `cos_b` straddles `[lower,upper]`, `resid ~ 0`). - UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as real-`vec` -- else the vector is still decorative and the method is just gradient routing on labels. @@ -185,10 +185,10 @@ deferred). `vec` sign = hack-ward = `rej - cho`. The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the method works. Two separate checks below carry that weight. -- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode - A only, where the weak detector is allowed. Confirms the pair-set band actually separates - live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but - live rollouts are sampled" worry without ever touching held-out B. +- **Calibration is read, not validated** -- cheap label-free gauges (live `cos_b` vs the + band, `hkgap`, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate + is alive and the band is calibrated, WITHOUT running any detector over students. No live + detector validation (that would be the cheat). - **Generalization test = deploy performance on held-out B** (never labelled, never in the pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved. - **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges @@ -227,34 +227,75 @@ solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of i in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required, though it can stay as a belt-and-braces solve guarantee. -Open: `g_step` granularity -- PER-ROLLOUT (each rollout its own `x`/`route_frac`, needs the -per-rollout grad-recovery hook) vs PER-STEP (one `x` for the aggregated step gradient, -much simpler). The calibration validation below decides it: use whichever separates -live A-hack from A-clean `route_frac`. Default per-rollout (hook exists); fall back to -per-step if it separates as well. +## Granularity: per-rollout (`g_step` = per-rollout δS grad) -## Calibration validation (no-cheat: known mode A only) +Decided, with paper backing -- not left open. -```py -# LIVE rollouts of the KNOWN mode A, labelled by the weak detector E (allowed for A): -A_hack = live A rollouts flagged hack by detector E -A_clean = live A rollouts not flagged -assert mean route_frac(A_hack) >> mean route_frac(A_clean) # band routes real live hacks +- **What we do now** (train.py:881-896, `_route2_grad_filter`): the baukit gate `c` is + per-TOKEN (`[G*s, r]`, since nn.Linear sees a flattened batch). We SUM each rollout's + token gate-grads -> `[G, r]`, divide by `δS` to recover the per-rollout knob grad `g_b`, + and take one `cos_b` per rollout. So the live unit is already PER-ROLLOUT. The recovery + hook exists; the band just replaces the `cos_b > tau` line with the ramp. +- **Gradient Routing** (Cloud et al. 2024): data-dependent stop-gradient masks at a few + layers' activations (`x = mask*act + (1-mask)*act.detach()`). For LLMs the mask is + per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations + b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal + state). So the RL-native unit there is the trajectory. +- **SGTM** (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain + examples only update their dedicated params). Its contribution is robustness to LABEL + NOISE, not a new granularity. + +Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's +per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers +route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's +ALIGNMENT with an extracted direction (`cos(g_b, vec)` in the band) -- no per-example label. +That direction-gated routing is the novelty vs both; state it as such. + +Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak +label localizes the capability MORE generally, because the routed region participates in +predictions on related non-routed data so the feature is not learned elsewhere. That is the +mechanism that would let routing on known A suppress unknown B; it is the theoretical basis +for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means +proportional split; same word, different thing.) + +## Cheap, label-free diagnostics (validation dropped) + +We are NOT running a live detector validation. Running the weak detector over the student's +own rollouts during training is on the wrong side of the no-cheat line (README: that is +exactly the cheat), and a live validation is complex and non-causal. The causal proof is +downstream (deploy performance + real-vs-random). During training we only LOG cheap, +label-free gauges (ml-debug: log everything, state the expected value and what a deviation +means, chase anomalies): + +``` +SHOULD per refresh: hkgap = upper - lower > 0 and roughly stable. + ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze + a pre-routing vec snapshot. + +SHOULD per refresh: LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0 + (band built on the OTHER pairs still separates the held-out pair -- "does the threshold + generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise. + +SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper]. + ELSE all below lower -> band routes nothing (miscalibrated low); all above upper -> + routes everything (miscalibrated high). This is the calibration read with NO labels. + +SHOULD per step: route_frac mean in (0,1), with some mass at 0 and some at 1. + ELSE all-0 or all-1 = degenerate gate. + +SHOULD per step: resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob). + ELSE >0 = hack-ward grad leaking into δS (the real failure). ``` -Held-out B is NEVER in this validation, so no-cheat holds by construction. If the -separation fails, the pair-set band does not transfer to live rollouts (the real -calibration risk) and we recalibrate the edges from a live-A quantile before trusting any -deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the -`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a -pre-routing `vec` snapshot. +None of these touch held-out B or run a detector over students; they read the band, the +pairs, and the live cosine geometry only. ## Review findings (2026-06-06) -- decisions before implementing Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md). The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving -points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the -Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check. +points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band +diagnostic above (no labels); vec degeneracy -> the `hkgap` collapse check. Its "circular tau" framing is moot under the band: the edges are not a decision point and the width is validated against live data, not asserted from the pairs.