mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics
Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -129,23 +129,23 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
- `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`;
|
||||
`δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper`
|
||||
come from the pairs (step 3), passed in.
|
||||
- granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the
|
||||
calibration validation; default per-rollout (reuse the existing recovery hook).
|
||||
- granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing
|
||||
token-sum -> `g_b = cg/dS` recovery, just swap the `cos_b > tau` line for the ramp.
|
||||
`rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee.
|
||||
3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the
|
||||
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
|
||||
`lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module.
|
||||
The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means
|
||||
alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau.
|
||||
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
|
||||
`hack_E_flags` feeding the gate (keep it for the calibration validation + streaming
|
||||
hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control).
|
||||
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the
|
||||
`hack_E_flags` feed into the GATE (drop -- no detector touches student rollouts at train
|
||||
time now; keep `hack_E_flags` only if still cheap for the streaming hk_* LOG columns).
|
||||
`route2_random_v_seed` stays (it's the directionality control).
|
||||
5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence).
|
||||
Teacher rollouts go through the same band (NOT force-routed).
|
||||
6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal;
|
||||
collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`,
|
||||
`upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >>
|
||||
mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat).
|
||||
6. **Diagnostics to print** (all label-free, see "Cheap, label-free diagnostics"):
|
||||
`hkgap = upper - lower`; LOO pair separation; live `cos_b` percentiles vs `[lower,upper]`;
|
||||
`route_frac` mean + mass-at-0 + mass-at-1; `resid = cos(g_keep, vec)`.
|
||||
|
||||
## Current state — resume after compaction
|
||||
|
||||
@@ -174,8 +174,8 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
|
||||
(no anchor) -> update or delete it.
|
||||
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
|
||||
< vanilla at matched solve, AND the calibration validation holds (`mean route_frac`
|
||||
on live A-hack >> on live A-clean; band width `hkgap > 0`).
|
||||
< vanilla at matched solve, AND the label-free diagnostics are healthy (band width
|
||||
`hkgap > 0`, LOO separation > 0, live `cos_b` straddles `[lower,upper]`, `resid ~ 0`).
|
||||
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
|
||||
real-`vec` -- else the vector is still decorative and the method is just gradient
|
||||
routing on labels.
|
||||
@@ -185,10 +185,10 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the
|
||||
method works. Two separate checks below carry that weight.
|
||||
|
||||
- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode
|
||||
A only, where the weak detector is allowed. Confirms the pair-set band actually separates
|
||||
live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but
|
||||
live rollouts are sampled" worry without ever touching held-out B.
|
||||
- **Calibration is read, not validated** -- cheap label-free gauges (live `cos_b` vs the
|
||||
band, `hkgap`, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate
|
||||
is alive and the band is calibrated, WITHOUT running any detector over students. No live
|
||||
detector validation (that would be the cheat).
|
||||
- **Generalization test = deploy performance on held-out B** (never labelled, never in the
|
||||
pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
|
||||
- **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges
|
||||
@@ -227,34 +227,75 @@ solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of i
|
||||
in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required,
|
||||
though it can stay as a belt-and-braces solve guarantee.
|
||||
|
||||
Open: `g_step` granularity -- PER-ROLLOUT (each rollout its own `x`/`route_frac`, needs the
|
||||
per-rollout grad-recovery hook) vs PER-STEP (one `x` for the aggregated step gradient,
|
||||
much simpler). The calibration validation below decides it: use whichever separates
|
||||
live A-hack from A-clean `route_frac`. Default per-rollout (hook exists); fall back to
|
||||
per-step if it separates as well.
|
||||
## Granularity: per-rollout (`g_step` = per-rollout δS grad)
|
||||
|
||||
## Calibration validation (no-cheat: known mode A only)
|
||||
Decided, with paper backing -- not left open.
|
||||
|
||||
```py
|
||||
# LIVE rollouts of the KNOWN mode A, labelled by the weak detector E (allowed for A):
|
||||
A_hack = live A rollouts flagged hack by detector E
|
||||
A_clean = live A rollouts not flagged
|
||||
assert mean route_frac(A_hack) >> mean route_frac(A_clean) # band routes real live hacks
|
||||
- **What we do now** (train.py:881-896, `_route2_grad_filter`): the baukit gate `c` is
|
||||
per-TOKEN (`[G*s, r]`, since nn.Linear sees a flattened batch). We SUM each rollout's
|
||||
token gate-grads -> `[G, r]`, divide by `δS` to recover the per-rollout knob grad `g_b`,
|
||||
and take one `cos_b` per rollout. So the live unit is already PER-ROLLOUT. The recovery
|
||||
hook exists; the band just replaces the `cos_b > tau` line with the ramp.
|
||||
- **Gradient Routing** (Cloud et al. 2024): data-dependent stop-gradient masks at a few
|
||||
layers' activations (`x = mask*act + (1-mask)*act.detach()`). For LLMs the mask is
|
||||
per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations
|
||||
b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal
|
||||
state). So the RL-native unit there is the trajectory.
|
||||
- **SGTM** (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain
|
||||
examples only update their dedicated params). Its contribution is robustness to LABEL
|
||||
NOISE, not a new granularity.
|
||||
|
||||
Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's
|
||||
per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers
|
||||
route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's
|
||||
ALIGNMENT with an extracted direction (`cos(g_b, vec)` in the band) -- no per-example label.
|
||||
That direction-gated routing is the novelty vs both; state it as such.
|
||||
|
||||
Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak
|
||||
label localizes the capability MORE generally, because the routed region participates in
|
||||
predictions on related non-routed data so the feature is not learned elsewhere. That is the
|
||||
mechanism that would let routing on known A suppress unknown B; it is the theoretical basis
|
||||
for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means
|
||||
proportional split; same word, different thing.)
|
||||
|
||||
## Cheap, label-free diagnostics (validation dropped)
|
||||
|
||||
We are NOT running a live detector validation. Running the weak detector over the student's
|
||||
own rollouts during training is on the wrong side of the no-cheat line (README: that is
|
||||
exactly the cheat), and a live validation is complex and non-causal. The causal proof is
|
||||
downstream (deploy performance + real-vs-random). During training we only LOG cheap,
|
||||
label-free gauges (ml-debug: log everything, state the expected value and what a deviation
|
||||
means, chase anomalies):
|
||||
|
||||
```
|
||||
SHOULD per refresh: hkgap = upper - lower > 0 and roughly stable.
|
||||
ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze
|
||||
a pre-routing vec snapshot.
|
||||
|
||||
SHOULD per refresh: LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0
|
||||
(band built on the OTHER pairs still separates the held-out pair -- "does the threshold
|
||||
generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise.
|
||||
|
||||
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
|
||||
ELSE all below lower -> band routes nothing (miscalibrated low); all above upper ->
|
||||
routes everything (miscalibrated high). This is the calibration read with NO labels.
|
||||
|
||||
SHOULD per step: route_frac mean in (0,1), with some mass at 0 and some at 1.
|
||||
ELSE all-0 or all-1 = degenerate gate.
|
||||
|
||||
SHOULD per step: resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob).
|
||||
ELSE >0 = hack-ward grad leaking into δS (the real failure).
|
||||
```
|
||||
|
||||
Held-out B is NEVER in this validation, so no-cheat holds by construction. If the
|
||||
separation fails, the pair-set band does not transfer to live rollouts (the real
|
||||
calibration risk) and we recalibrate the edges from a live-A quantile before trusting any
|
||||
deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the
|
||||
`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a
|
||||
pre-routing `vec` snapshot.
|
||||
None of these touch held-out B or run a detector over students; they read the band, the
|
||||
pairs, and the live cosine geometry only.
|
||||
|
||||
## Review findings (2026-06-06) -- decisions before implementing
|
||||
|
||||
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
|
||||
The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving
|
||||
points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the
|
||||
Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check.
|
||||
points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band
|
||||
diagnostic above (no labels); vec degeneracy -> the `hkgap` collapse check.
|
||||
Its "circular tau" framing is moot under the band: the edges are not a decision point and
|
||||
the width is validated against live data, not asserted from the pairs.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user