spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics

Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-06 02:23:58 +00:00
parent 180d3e862c
commit a83953131e
1 changed files with 75 additions and 34 deletions
@@ -129,23 +129,23 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
   - `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`;
     `δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper`
     come from the pairs (step 3), passed in.
-   - granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the
-     calibration validation; default per-rollout (reuse the existing recovery hook).
+   - granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing
+     token-sum -> `g_b = cg/dS` recovery, just swap the `cos_b > tau` line for the ramp.
     `rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee.
 3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the
   existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
   `lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module.
   The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means
   alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau.
-4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
-   `hack_E_flags` feeding the gate (keep it for the calibration validation + streaming
-   hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control).
+4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the
+   `hack_E_flags` feed into the GATE (drop -- no detector touches student rollouts at train
+   time now; keep `hack_E_flags` only if still cheap for the streaming hk_* LOG columns).
+   `route2_random_v_seed` stays (it's the directionality control).
 5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence).
   Teacher rollouts go through the same band (NOT force-routed).
-6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal;
-   collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`,
-   `upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >>
-   mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat).
+6. **Diagnostics to print** (all label-free, see "Cheap, label-free diagnostics"):
+   `hkgap = upper - lower`; LOO pair separation; live `cos_b` percentiles vs `[lower,upper]`;
+   `route_frac` mean + mass-at-0 + mass-at-1; `resid = cos(g_keep, vec)`.

 ## Current state — resume after compaction

@@ -174,8 +174,8 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
 - `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
  (no anchor) -> update or delete it.
 - UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
-  < vanilla at matched solve, AND the calibration validation holds (`mean route_frac`
-  on live A-hack >> on live A-clean; band width `hkgap > 0`).
+  < vanilla at matched solve, AND the label-free diagnostics are healthy (band width
+  `hkgap > 0`, LOO separation > 0, live `cos_b` straddles `[lower,upper]`, `resid ~ 0`).
 - UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
  real-`vec` -- else the vector is still decorative and the method is just gradient
  routing on labels.
@@ -185,10 +185,10 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
 The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the
 method works. Two separate checks below carry that weight.

- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode
-  A only, where the weak detector is allowed. Confirms the pair-set band actually separates
-  live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but
-  live rollouts are sampled" worry without ever touching held-out B.
+- **Calibration is read, not validated** -- cheap label-free gauges (live `cos_b` vs the
+  band, `hkgap`, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate
+  is alive and the band is calibrated, WITHOUT running any detector over students. No live
+  detector validation (that would be the cheat).
 - **Generalization test = deploy performance on held-out B** (never labelled, never in the
  pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
 - **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges
@@ -227,34 +227,75 @@ solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of i
 in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required,
 though it can stay as a belt-and-braces solve guarantee.

-Open: `g_step` granularity -- PER-ROLLOUT (each rollout its own `x`/`route_frac`, needs the
-per-rollout grad-recovery hook) vs PER-STEP (one `x` for the aggregated step gradient,
-much simpler). The calibration validation below decides it: use whichever separates
-live A-hack from A-clean `route_frac`. Default per-rollout (hook exists); fall back to
-per-step if it separates as well.
+## Granularity: per-rollout (`g_step` = per-rollout δS grad)

-## Calibration validation (no-cheat: known mode A only)
+Decided, with paper backing -- not left open.

-```py
-# LIVE rollouts of the KNOWN mode A, labelled by the weak detector E (allowed for A):
-A_hack  = live A rollouts flagged hack by detector E
-A_clean = live A rollouts not flagged
-assert mean route_frac(A_hack)  >> mean route_frac(A_clean)   # band routes real live hacks
+- **What we do now** (train.py:881-896, `_route2_grad_filter`): the baukit gate `c` is
+  per-TOKEN (`[G*s, r]`, since nn.Linear sees a flattened batch). We SUM each rollout's
+  token gate-grads -> `[G, r]`, divide by `δS` to recover the per-rollout knob grad `g_b`,
+  and take one `cos_b` per rollout. So the live unit is already PER-ROLLOUT. The recovery
+  hook exists; the band just replaces the `cos_b > tau` line with the ramp.
+- **Gradient Routing** (Cloud et al. 2024): data-dependent stop-gradient masks at a few
+  layers' activations (`x = mask*act + (1-mask)*act.detach()`). For LLMs the mask is
+  per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations
+  b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal
+  state). So the RL-native unit there is the trajectory.
+- **SGTM** (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain
+  examples only update their dedicated params). Its contribution is robustness to LABEL
+  NOISE, not a new granularity.
+
+Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's
+per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers
+route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's
+ALIGNMENT with an extracted direction (`cos(g_b, vec)` in the band) -- no per-example label.
+That direction-gated routing is the novelty vs both; state it as such.
+
+Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak
+label localizes the capability MORE generally, because the routed region participates in
+predictions on related non-routed data so the feature is not learned elsewhere. That is the
+mechanism that would let routing on known A suppress unknown B; it is the theoretical basis
+for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means
+proportional split; same word, different thing.)
+
+## Cheap, label-free diagnostics (validation dropped)
+
+We are NOT running a live detector validation. Running the weak detector over the student's
+own rollouts during training is on the wrong side of the no-cheat line (README: that is
+exactly the cheat), and a live validation is complex and non-causal. The causal proof is
+downstream (deploy performance + real-vs-random). During training we only LOG cheap,
+label-free gauges (ml-debug: log everything, state the expected value and what a deviation
+means, chase anomalies):
+
+```
+SHOULD per refresh:  hkgap = upper - lower  > 0 and roughly stable.
+  ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze
+  a pre-routing vec snapshot.
+
+SHOULD per refresh:  LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0
+  (band built on the OTHER pairs still separates the held-out pair -- "does the threshold
+  generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise.
+
+SHOULD per step:  live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
+  ELSE all below lower -> band routes nothing (miscalibrated low); all above upper ->
+  routes everything (miscalibrated high). This is the calibration read with NO labels.
+
+SHOULD per step:  route_frac mean in (0,1), with some mass at 0 and some at 1.
+  ELSE all-0 or all-1 = degenerate gate.
+
+SHOULD per step:  resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob).
+  ELSE >0 = hack-ward grad leaking into δS (the real failure).
 ```

-Held-out B is NEVER in this validation, so no-cheat holds by construction. If the
-separation fails, the pair-set band does not transfer to live rollouts (the real
-calibration risk) and we recalibrate the edges from a live-A quantile before trusting any
-deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the
-`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a
-pre-routing `vec` snapshot.
+None of these touch held-out B or run a detector over students; they read the band, the
+pairs, and the live cosine geometry only.

 ## Review findings (2026-06-06) -- decisions before implementing

 Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
 The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving
-points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the
-Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check.
+points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band
+diagnostic above (no labels); vec degeneracy -> the `hkgap` collapse check.
 Its "circular tau" framing is moot under the band: the edges are not a decision point and
 the width is validated against live data, not asserted from the pairs.