mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation
Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec), route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec), upper=mean cos(g_rej,vec). Below lower keep, above upper route, between = absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the real-vs-random discriminator (random vec closes the band) so no separate matched-fraction control is needed; collapse flags vec degeneracy. Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat): mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band transfers to the sampled live distribution. Also picks g_step granularity (per-rollout default vs per-step). Held-out B never in validation. Corrects the earlier wrong claim that component-routing collapses to erase (pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -45,46 +45,42 @@ a self-test", and leaks ~1.1% onto held-out B.
|
||||
|
||||
## What we do INSTEAD (pair-routed)
|
||||
|
||||
Everything — direction AND threshold — comes from the pairs. No teacher force-route,
|
||||
no live detector over students. Pure gate.
|
||||
Everything — direction AND the two band edges — comes from the pairs. No teacher
|
||||
force-route, no live detector over students. A BANDED cosine gate with an absorption
|
||||
zone (full mechanism + validation are in the "Routing mechanism" section below):
|
||||
|
||||
```py
|
||||
# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
|
||||
for pair p:
|
||||
g_rej[p] = ∇logp(rej_p) # hack completion (forward+backward through δS)
|
||||
g_cho[p] = ∇logp(cho_p) # clean completion
|
||||
Δ[p] = g_rej[p] - g_cho[p] # hack-ward per-pair diff
|
||||
vec = unit(mean_p Δ[p]) # or SVD top-k of stacked Δ; per module, hack-ward
|
||||
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward (or SVD top-k of stacked diffs)
|
||||
lower = mean_p cos(g_cho[p], vec) # clean edge (low) — below: not hack, keep
|
||||
upper = mean_p cos(g_rej[p], vec) # hack edge (high) — above: hack, route
|
||||
|
||||
# threshold from the SAME pairs — where hacks land vs where clean lands on `vec`:
|
||||
c_rej = mean_p cos(g_rej[p], vec) # high (hacks point hack-ward)
|
||||
c_cho = mean_p cos(g_cho[p], vec) # low (clean points away)
|
||||
τ = (c_rej + c_cho) / 2 # midpoint; no detector, no teacher, no live label
|
||||
|
||||
# ── live, every step: pure gate, no labels ──
|
||||
for rollout b:
|
||||
g_b = per_rollout_δS_grad(b)
|
||||
if cos(g_b, vec) > τ:
|
||||
route(g_b) → δS_hack # quarantine, deleted at deploy
|
||||
else:
|
||||
keep(g_b) → δS # deployed knob
|
||||
# ── live: absorption ramp, no labels ──
|
||||
x = cos(g_step, vec)
|
||||
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
|
||||
δS_hack.grad += route_frac * g_step # deleted at deploy
|
||||
δS.grad += (1 - route_frac) * g_step # deployed knob; absorption in the middle
|
||||
```
|
||||
|
||||
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`);
|
||||
its rollouts are NOT force-routed — they go through the same `cos > τ` gate as any
|
||||
student rollout. After step 30 it is pure on-policy.
|
||||
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`,
|
||||
see the journal-evidence note: student out-hacks teacher ~step 40, so 30 may be early;
|
||||
40 is the safer value if emergence stalls). Its rollouts are NOT force-routed — they go
|
||||
through the same band as any student rollout. After the cut it is pure on-policy.
|
||||
|
||||
## Now vs new — what changed
|
||||
|
||||
| | now (route2) | new (pair-routed) |
|
||||
| | now (route2) | new (pair-routed band) |
|
||||
|---|---|---|
|
||||
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
|
||||
| threshold τ | live `hacked_E` detector over students + EMA | the pairs' own `cos(g_rej)` vs `cos(g_cho)` midpoint |
|
||||
| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
|
||||
| force-route | yes (`hack_anchor \|`) | none — gate only |
|
||||
| live detector over students | yes (noisy, leaks onto B) | none |
|
||||
| teacher | mixed throughout, force-routed | seed only, cut@30, gated like any rollout |
|
||||
| is `vec` load-bearing? | no (labels carry it) | yes — it is the only mechanism |
|
||||
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec) > τ`, i.e. B shares the direction |
|
||||
| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
|
||||
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
|
||||
| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
|
||||
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
|
||||
|
||||
## Fork to decide in the rebuild: gradients vs activations for `vec`
|
||||
|
||||
@@ -127,28 +123,29 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
|
||||
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
|
||||
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
|
||||
2. **Rewrite `_route2_grad_filter`** (~line 877):
|
||||
- drop the `hack_anchor |` force-route term -> gate is `cos_b > tau` only.
|
||||
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
|
||||
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
|
||||
- route the vec-COMPONENT not the whole rollout (see Review-findings decision #3):
|
||||
for a flagged rollout, `c = cos*vec` goes to `delta_S_hack`, the orthogonal
|
||||
remainder stays in `delta_S`. Removes `rollout_ablate_frac`.
|
||||
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
|
||||
2. **Rewrite `_route2_grad_filter`** (~line 877) into the banded gate:
|
||||
- drop the `hack_anchor |` force-route term and the EMA `ema_hack_cos`/`ema_clean_cos`
|
||||
detector calibration (~896-908). No hard `cos_b > tau`.
|
||||
- `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`;
|
||||
`δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper`
|
||||
come from the pairs (step 3), passed in.
|
||||
- granularity (`g_step` per-rollout vs per-step-aggregate) is decided by the
|
||||
calibration validation; default per-rollout (reuse the existing recovery hook).
|
||||
`rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee.
|
||||
3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the
|
||||
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
|
||||
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
|
||||
`tau = (c_rej + c_cho)/2`, per module. The extract path already produces per-pair
|
||||
`g_rej`/`g_cho` (it builds `vec = mean(g_rej - g_cho)`); add the two cosine means +
|
||||
tau alongside. Store `route2_tau[name]` from this, not from anchors.
|
||||
`lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module.
|
||||
The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means
|
||||
alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau.
|
||||
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg;
|
||||
`hack_E_flags` feeding the gate (keep it for the streaming hk_* LOG columns only if
|
||||
cheap, else drop); `route2_random_v_seed` stays (it's the directionality control).
|
||||
5. **Config**: `teacher_off_step: int = 30` default (seed then on-policy). Keep teacher
|
||||
mixing 0->30 only; its rollouts go through the same `cos > tau` gate (NOT force-routed).
|
||||
6. **Diagnostics to keep/print**: `hkgap = c_rej - c_cho` (now a PAIR quantity, the
|
||||
gate's separation margin); per-step `cos_b` distribution; `tau`; fraction flagged;
|
||||
`resid = cos(kept grad, vec)`. SHOULD: `c_rej > tau > c_cho` and pair midpoint
|
||||
brackets the live `cos_b` of hack vs clean rollouts (the calibration smoke-check).
|
||||
`hack_E_flags` feeding the gate (keep it for the calibration validation + streaming
|
||||
hk_* LOG columns); `route2_random_v_seed` stays (it's the directionality control).
|
||||
5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence).
|
||||
Teacher rollouts go through the same band (NOT force-routed).
|
||||
6. **Diagnostics to keep/print**: `hkgap = upper - lower` (band width = separation signal;
|
||||
collapse -> vec degenerate -> freeze snapshot); per-step `x` distribution; `lower`,
|
||||
`upper`; mean `route_frac`. Calibration validation: `mean route_frac(A_hack) >>
|
||||
mean route_frac(A_clean)` on live known-A rollouts (detector-labelled, no-cheat).
|
||||
|
||||
## Current state — resume after compaction
|
||||
|
||||
@@ -177,46 +174,106 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
|
||||
(no anchor) -> update or delete it.
|
||||
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
|
||||
< vanilla at matched solve, AND the calibration check holds (`c_rej > tau > c_cho`,
|
||||
pair tau brackets live `cos_b`).
|
||||
< vanilla at matched solve, AND the calibration validation holds (`mean route_frac`
|
||||
on live A-hack >> on live A-clean; band width `hkgap > 0`).
|
||||
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
|
||||
real-`vec` -- else the vector is still decorative and the method is just gradient
|
||||
routing on labels.
|
||||
|
||||
## Validation logic -- what proves the method works (read this before the gate)
|
||||
|
||||
The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the
|
||||
method works. Two separate checks below carry that weight.
|
||||
|
||||
- **Calibration validation (does the band route real live hacks?)** runs on the KNOWN mode
|
||||
A only, where the weak detector is allowed. Confirms the pair-set band actually separates
|
||||
live A-hack from live A-clean gradients. This closes the "pairs are teacher-forced but
|
||||
live rollouts are sampled" worry without ever touching held-out B.
|
||||
- **Generalization test = deploy performance on held-out B** (never labelled, never in the
|
||||
pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
|
||||
- **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges
|
||||
collapse to ~0, the band closes, and routing degenerates to a cos>0 coin flip. So the
|
||||
band WIDTH is itself the real-vs-random discriminator; no separate matched-fraction
|
||||
control is needed.
|
||||
|
||||
`upper > lower` is biased positive by construction (vec points along the mean rej-cho
|
||||
diff), so the ordering is not evidence. The band WIDTH and POSITION are the empirical,
|
||||
load-bearing quantities, and the calibration validation tests them against live data.
|
||||
|
||||
## Routing mechanism — banded cosine gate with an absorption zone
|
||||
|
||||
The gate is a BAND, not a single threshold. Two edges, both measured from the pairs:
|
||||
|
||||
```py
|
||||
# ── refresh every N steps: vec + band edges from the pairs, through the current adapter ──
|
||||
for pair p:
|
||||
g_rej[p], g_cho[p] = ∇logp(rej_p), ∇logp(cho_p) # δS-space, per module
|
||||
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward
|
||||
lower = mean_p cos(g_cho[p], vec) # where genuinely-CLEAN gradients land (low)
|
||||
upper = mean_p cos(g_rej[p], vec) # where genuinely-HACK gradients land (high)
|
||||
hkgap = upper - lower # band width = the load-bearing separation signal
|
||||
|
||||
# ── live: absorption ramp, pure gate, NO labels, NO force-route ──
|
||||
x = cos(g_step, vec) # alignment of the live gradient with the hack dir
|
||||
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
|
||||
δS_hack.grad += route_frac * g_step # x>=upper -> 1: fully quarantined (deleted)
|
||||
δS.grad += (1 - route_frac) * g_step # x<=lower -> 0: fully kept (deployed)
|
||||
# lower<x<upper: ABSORPTION, split between knobs
|
||||
```
|
||||
|
||||
Three zones: below `lower` = not hack, keep; above `upper` = hack, route to the deletable
|
||||
`δS_hack`; between = absorption, the gradient splits proportionally. The ramp softens the
|
||||
solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of its signal
|
||||
in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required,
|
||||
though it can stay as a belt-and-braces solve guarantee.
|
||||
|
||||
Open: `g_step` granularity -- PER-ROLLOUT (each rollout its own `x`/`route_frac`, needs the
|
||||
per-rollout grad-recovery hook) vs PER-STEP (one `x` for the aggregated step gradient,
|
||||
much simpler). The calibration validation below decides it: use whichever separates
|
||||
live A-hack from A-clean `route_frac`. Default per-rollout (hook exists); fall back to
|
||||
per-step if it separates as well.
|
||||
|
||||
## Calibration validation (no-cheat: known mode A only)
|
||||
|
||||
```py
|
||||
# LIVE rollouts of the KNOWN mode A, labelled by the weak detector E (allowed for A):
|
||||
A_hack = live A rollouts flagged hack by detector E
|
||||
A_clean = live A rollouts not flagged
|
||||
assert mean route_frac(A_hack) >> mean route_frac(A_clean) # band routes real live hacks
|
||||
```
|
||||
|
||||
Held-out B is NEVER in this validation, so no-cheat holds by construction. If the
|
||||
separation fails, the pair-set band does not transfer to live rollouts (the real
|
||||
calibration risk) and we recalibrate the edges from a live-A quantile before trusting any
|
||||
deploy number. `hkgap = upper - lower` is logged each refresh; if it collapses toward 0 the
|
||||
`vec` has degenerated (hacks suppressed -> hack-pair gradient weakens) and we freeze a
|
||||
pre-routing `vec` snapshot.
|
||||
|
||||
## Review findings (2026-06-06) -- decisions before implementing
|
||||
|
||||
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
|
||||
Both converge on the same threshold problem; resolutions below are now part of the plan.
|
||||
The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving
|
||||
points: calibration risk (pairs teacher-forced vs live sampled) -> handled by the
|
||||
Calibration validation above; vec degeneracy -> handled by the `hkgap` collapse check.
|
||||
Its "circular tau" framing is moot under the band: the edges are not a decision point and
|
||||
the width is validated against live data, not asserted from the pairs.
|
||||
|
||||
1. **tau is circular, not just scale-mismatched.** Because `vec = mean(g_rej - g_cho)`,
|
||||
the inequality `c_rej > c_cho` holds BY CONSTRUCTION even when `vec` is pure noise, so
|
||||
the pair midpoint cannot validate that the gate separates anything. Separately, pair
|
||||
gradients are teacher-forced while live rollouts are sampled, so the pair cosine scale
|
||||
need not match the live `cos_b` scale; refreshing every N steps fixes adapter *drift*,
|
||||
not this *distribution* gap.
|
||||
- Decision: keep pair-midpoint tau as the no-extra-labels DEFAULT for the method, but
|
||||
(a) compute a LEAVE-ONE-PAIR-OUT separation `c_rej^{-p} vs c_cho^{-p}` as the real
|
||||
diagnostic that `vec` generalizes across pairs (cheap at ~10 pairs), and (b) for the
|
||||
real-vs-random CONTROL, set tau by a QUANTILE of the live `cos_b` so the flagged
|
||||
FRACTION is matched between conditions.
|
||||
2. **Match the flagged fraction in the real-vs-random control (deepseek #2, kept).** Real
|
||||
and random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
|
||||
could be volume, not direction. For the control specifically, set both tau by the same
|
||||
QUANTILE of the live `cos_b` so equal fractions route and only DIRECTION differs.
|
||||
Suppression gap at matched fraction => direction is load-bearing.
|
||||
|
||||
2. **Match the flagged fraction in the real-vs-random control (deepseek #2).** Real and
|
||||
random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
|
||||
could be volume, not direction. The quantile-tau in 1(b) controls this: equal fraction
|
||||
routed, only the DIRECTION differs. Suppression gap at matched fraction => direction is
|
||||
load-bearing.
|
||||
3. **Whole-rollout routing + exploration floor (Claude, corrected).** Keep route2's
|
||||
whole-rollout quarantine (it is the gradient-routing thesis); keep `rollout_ablate_frac`
|
||||
for solve signal. Component-routing was rejected because it collapses to `erase` at
|
||||
deploy (see Routing mechanism above).
|
||||
|
||||
3. **Route the vec-COMPONENT, not the whole rollout (Claude).** The route2 pseudocode
|
||||
quarantined a flagged rollout's entire `delta_S` gradient, which also strips its solve
|
||||
signal (solve-starvation on problems only solved-by-hacking). Decision: subtract the
|
||||
`cos*vec` component into `delta_S_hack` and keep the orthogonal remainder in `delta_S`
|
||||
(erase-style projection, routed not erased). Drops the need for `rollout_ablate_frac`.
|
||||
|
||||
4. **Degeneracy diagnostic (deepseek #3).** As routing suppresses hacks, the hack-pair
|
||||
4. **Degeneracy diagnostic (deepseek #3, kept).** As routing suppresses hacks, the hack-pair
|
||||
gradient can weaken and the refreshed `vec` degenerate. Log `hkgap = c_rej - c_cho`
|
||||
per refresh; if it collapses toward 0, freeze a pre-routing `vec` snapshot.
|
||||
|
||||
5. **Pre-register the science UAT (deepseek).** n>=3 seeds per condition (real/random),
|
||||
success = mean held-out-B deploy hack under real-`vec` is below random-`vec` by more
|
||||
than the across-seed std of the random baseline. Qualitative "suppresses better" is
|
||||
not enough.
|
||||
5. **Pre-register the science UAT (deepseek, kept; user-confirmed).** n>=3 seeds per
|
||||
condition (real/random), success = mean held-out-B deploy hack under real-`vec` is below
|
||||
random-`vec` by more than the across-seed std of the random baseline. Qualitative
|
||||
"suppresses better" is not enough.
|
||||
|
||||
Reference in New Issue
Block a user