spec: fold external-review into pair-routing plan; default teacher_off_step=30

External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).

teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 01:03:02 +00:00
parent dfdc538428
commit 53d88bc9ee
3 changed files with 84 additions and 5 deletions
+42 -1
View File
@@ -128,9 +128,12 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
2. **Rewrite `_route2_grad_filter`** (~line 877):
- drop the `hack_anchor |` force-route term -> `flagged = (cos_b > tau)`.
- drop the `hack_anchor |` force-route term -> gate is `cos_b > tau` only.
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
- route the vec-COMPONENT not the whole rollout (see Review-findings decision #3):
for a flagged rollout, `c = cos*vec` goes to `delta_S_hack`, the orthogonal
remainder stays in `delta_S`. Removes `rollout_ablate_frac`.
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
@@ -179,3 +182,41 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
real-`vec` -- else the vector is still decorative and the method is just gradient
routing on labels.
## Review findings (2026-06-06) -- decisions before implementing
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
Both converge on the same threshold problem; resolutions below are now part of the plan.
1. **tau is circular, not just scale-mismatched.** Because `vec = mean(g_rej - g_cho)`,
the inequality `c_rej > c_cho` holds BY CONSTRUCTION even when `vec` is pure noise, so
the pair midpoint cannot validate that the gate separates anything. Separately, pair
gradients are teacher-forced while live rollouts are sampled, so the pair cosine scale
need not match the live `cos_b` scale; refreshing every N steps fixes adapter *drift*,
not this *distribution* gap.
- Decision: keep pair-midpoint tau as the no-extra-labels DEFAULT for the method, but
(a) compute a LEAVE-ONE-PAIR-OUT separation `c_rej^{-p} vs c_cho^{-p}` as the real
diagnostic that `vec` generalizes across pairs (cheap at ~10 pairs), and (b) for the
real-vs-random CONTROL, set tau by a QUANTILE of the live `cos_b` so the flagged
FRACTION is matched between conditions.
2. **Match the flagged fraction in the real-vs-random control (deepseek #2).** Real and
random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
could be volume, not direction. The quantile-tau in 1(b) controls this: equal fraction
routed, only the DIRECTION differs. Suppression gap at matched fraction => direction is
load-bearing.
3. **Route the vec-COMPONENT, not the whole rollout (Claude).** The route2 pseudocode
quarantined a flagged rollout's entire `delta_S` gradient, which also strips its solve
signal (solve-starvation on problems only solved-by-hacking). Decision: subtract the
`cos*vec` component into `delta_S_hack` and keep the orthogonal remainder in `delta_S`
(erase-style projection, routed not erased). Drops the need for `rollout_ablate_frac`.
4. **Degeneracy diagnostic (deepseek #3).** As routing suppresses hacks, the hack-pair
gradient can weaken and the refreshed `vec` degenerate. Log `hkgap = c_rej - c_cho`
per refresh; if it collapses toward 0, freeze a pre-routing `vec` snapshot.
5. **Pre-register the science UAT (deepseek).** n>=3 seeds per condition (real/random),
success = mean held-out-B deploy hack under real-`vec` is below random-`vec` by more
than the across-seed std of the random baseline. Qualitative "suppresses better" is
not enough.