spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics

Validation removed: running the weak detector over student rollouts at train
time is the no-cheat violation, and a live validation is complex/non-causal.
Causal proof stays downstream (deploy perf + real-vs-random). Train-time only
LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the
'does the threshold generalize to a second pair' test), live cos_b percentiles
vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1,
resid=cos(g_keep,vec).

Granularity decided = per-rollout: train.py already sums per-token gate grads
to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line
for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks
per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise-
robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an
extracted direction -- that's the novelty. Borrow their 'absorption' as the
mechanism justifying A->B generalization.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 02:23:58 +00:00
parent 180d3e862c
commit a83953131e
+75 -34
View File
@@ -129,23 +129,23 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
- `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`;
`δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper`
come from the pairs (step 3), passed in.
- granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the
calibration validation; default per-rollout (reuse the existing recovery hook).
- granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing
token-sum -> `g_b = cg/dS` recovery, just swap the `cos_b > tau` line for the ramp.
`rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee.
3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
`lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module.
The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means
alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau.
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
`hack_E_flags` feeding the gate (keep it for the calibration validation + streaming
hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control).
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the
`hack_E_flags` feed into the GATE (drop -- no detector touches student rollouts at train
time now; keep `hack_E_flags` only if still cheap for the streaming hk_* LOG columns).
`route2_random_v_seed` stays (it's the directionality control).
5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence).
Teacher rollouts go through the same band (NOT force-routed).
6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal;
collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`,
`upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >>
mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat).
6. **Diagnostics to print** (all label-free, see "Cheap, label-free diagnostics"):
`hkgap = upper - lower`; LOO pair separation; live `cos_b` percentiles vs `[lower,upper]`;
`route_frac` mean + mass-at-0 + mass-at-1; `resid = cos(g_keep, vec)`.
## Current state — resume after compaction
@@ -174,8 +174,8 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
(no anchor) -> update or delete it.
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
< vanilla at matched solve, AND the calibration validation holds (`mean route_frac`
on live A-hack >> on live A-clean; band width `hkgap > 0`).
< vanilla at matched solve, AND the label-free diagnostics are healthy (band width
`hkgap > 0`, LOO separation > 0, live `cos_b` straddles `[lower,upper]`, `resid ~ 0`).
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
real-`vec` -- else the vector is still decorative and the method is just gradient
routing on labels.
@@ -185,10 +185,10 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the
method works. Two separate checks below carry that weight.
- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode
A only, where the weak detector is allowed. Confirms the pair-set band actually separates
live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but
live rollouts are sampled" worry without ever touching held-out B.
- **Calibration is read, not validated** -- cheap label-free gauges (live `cos_b` vs the
band, `hkgap`, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate
is alive and the band is calibrated, WITHOUT running any detector over students. No live
detector validation (that would be the cheat).
- **Generalization test = deploy performance on held-out B** (never labelled, never in the
pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
- **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges
@@ -227,34 +227,75 @@ solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of i
in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required,
though it can stay as a belt-and-braces solve guarantee.
Open: `g_step` granularity -- PER-ROLLOUT (each rollout its own `x`/`route_frac`, needs the
per-rollout grad-recovery hook) vs PER-STEP (one `x` for the aggregated step gradient,
much simpler). The calibration validation below decides it: use whichever separates
live A-hack from A-clean `route_frac`. Default per-rollout (hook exists); fall back to
per-step if it separates as well.
## Granularity: per-rollout (`g_step` = per-rollout δS grad)
## Calibration validation (no-cheat: known mode A only)
Decided, with paper backing -- not left open.
```py
# LIVE rollouts of the KNOWN mode A, labelled by the weak detector E (allowed for A):
A_hack = live A rollouts flagged hack by detector E
A_clean = live A rollouts not flagged
assert mean route_frac(A_hack) >> mean route_frac(A_clean) # band routes real live hacks
- **What we do now** (train.py:881-896, `_route2_grad_filter`): the baukit gate `c` is
per-TOKEN (`[G*s, r]`, since nn.Linear sees a flattened batch). We SUM each rollout's
token gate-grads -> `[G, r]`, divide by `δS` to recover the per-rollout knob grad `g_b`,
and take one `cos_b` per rollout. So the live unit is already PER-ROLLOUT. The recovery
hook exists; the band just replaces the `cos_b > tau` line with the ramp.
- **Gradient Routing** (Cloud et al. 2024): data-dependent stop-gradient masks at a few
layers' activations (`x = mask*act + (1-mask)*act.detach()`). For LLMs the mask is
per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations
b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal
state). So the RL-native unit there is the trajectory.
- **SGTM** (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain
examples only update their dedicated params). Its contribution is robustness to LABEL
NOISE, not a new granularity.
Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's
per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers
route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's
ALIGNMENT with an extracted direction (`cos(g_b, vec)` in the band) -- no per-example label.
That direction-gated routing is the novelty vs both; state it as such.
Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak
label localizes the capability MORE generally, because the routed region participates in
predictions on related non-routed data so the feature is not learned elsewhere. That is the
mechanism that would let routing on known A suppress unknown B; it is the theoretical basis
for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means
proportional split; same word, different thing.)
## Cheap, label-free diagnostics (validation dropped)
We are NOT running a live detector validation. Running the weak detector over the student's
own rollouts during training is on the wrong side of the no-cheat line (README: that is
exactly the cheat), and a live validation is complex and non-causal. The causal proof is
downstream (deploy performance + real-vs-random). During training we only LOG cheap,
label-free gauges (ml-debug: log everything, state the expected value and what a deviation
means, chase anomalies):
```
SHOULD per refresh: hkgap = upper - lower > 0 and roughly stable.
ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze
a pre-routing vec snapshot.
SHOULD per refresh: LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0
(band built on the OTHER pairs still separates the held-out pair -- "does the threshold
generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise.
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
ELSE all below lower -> band routes nothing (miscalibrated low); all above upper ->
routes everything (miscalibrated high). This is the calibration read with NO labels.
SHOULD per step: route_frac mean in (0,1), with some mass at 0 and some at 1.
ELSE all-0 or all-1 = degenerate gate.
SHOULD per step: resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob).
ELSE >0 = hack-ward grad leaking into δS (the real failure).
```
Held-out B is NEVER in this validation, so no-cheat holds by construction. If the
separation fails, the pair-set band does not transfer to live rollouts (the real
calibration risk) and we recalibrate the edges from a live-A quantile before trusting any
deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the
`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a
pre-routing `vec` snapshot.
None of these touch held-out B or run a detector over students; they read the band, the
pairs, and the live cosine geometry only.
## Review findings (2026-06-06) -- decisions before implementing
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving
points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the
Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check.
points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band
diagnostic above (no labels); vec degeneracy -> the `hkgap` collapse check.
Its "circular tau" framing is moot under the band: the edges are not a decision point and
the width is validated against live data, not asserted from the pairs.