mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:43:00 +08:00
spec: fold external-review into pair-routing plan; default teacher_off_step=30
External review (Claude + deepseek-v4-pro) converged on the threshold being circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the real-vs-random control; route the vec-component (erase-style) not the whole rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT (n>=3 seeds, effect>random-baseline std). teacher_off_step now defaults to 30 on the base Config so every arm runs pure on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking self-sustains after the cut). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,37 @@
|
|||||||
|
```
|
||||||
|
## Scientist Review
|
||||||
|
|
||||||
|
### Central assumption
|
||||||
|
The pair-derived gradient difference `vec` captures a universal hack-ward direction that separates hack from clean rollouts for unseen hack B.
|
||||||
|
**holds when:** the hack behavior shares a low-dimensional gradient signature across hack variants, and the pairs (A) faithfully sample that signature.
|
||||||
|
**fails when:** the vec overfits to superficial pair artifacts (e.g., phrasing, sequence length) that do not transfer to B; or the live cos separation is so weak that the random baseline produces similar suppression due to chance, yielding a false‑null (underpowered comparison).
|
||||||
|
|
||||||
|
### Confounds
|
||||||
|
1. **Overfitted threshold** – τ is computed as the midpoint of `cos` on the *same* pairs used to extract `vec`. This inflates the pair separation (`c_rej − c_cho`) and may not reflect live distributions, causing the gate to flag an inappropriate fraction of rollouts.
|
||||||
|
→ **Control:** leave‑one‑pair‑out τ (or use a separate calibration set), and ablate τ sensitivity via a sweep.
|
||||||
|
|
||||||
|
2. **Differential flagged fraction** – Real and random `vec` may produce different fractions of flagged rollouts (due to distribution shapes), and the suppression difference might be driven by the amount of quarantined gradient updates, not directional accuracy.
|
||||||
|
→ **Control:** match the flagged fraction between real and random conditions (adjust threshold quantile), or compute suppression as a function of fraction.
|
||||||
|
|
||||||
|
3. **Pair gradient drift / degenerate `vec`** – As the model is updated by routing, the hack‑completion gradients may weaken, making the refreshed `vec` degenerate or misleading.
|
||||||
|
→ **Control:** monitor `c_rej − c_cho` over time; if it collapses, freeze a pre‑routing snapshot for `vec` extraction or use an auxiliary model.
|
||||||
|
|
||||||
|
### Algorithm issues
|
||||||
|
- **`What we do INSTEAD`/pair‑calibrated τ:** using the same data (`g_rej[p]`, `g_cho[p]`) both to build `vec` and to calibrate τ yields an optimistic threshold; this breaks the assumption that τ separates live rollouts (overfitting). The correction is to at least apply a hold‑out within pairs or cross‑validate.
|
||||||
|
- **Gradient direction computation:** `vec = unit(mean_p Δ[p])` – if `Δ[p]` stems from full‑parameter gradients, SVD top‑k is called but not specified in the pseudocode. Ensure that the shape after SVD reduction matches the per‑rollout `g_b` (the gate hook gradient) to avoid silent misalignment.
|
||||||
|
- **No missing stop‑gradients** – the discrete `cos_b > τ` branch does not bleed gradients; routing is sound.
|
||||||
|
|
||||||
|
### Experimental design
|
||||||
|
- **Falsifiable:** Yes – the real‑vec > random‑vec suppression claim on B is testable.
|
||||||
|
- **n needed:** At least 5–10 independent seeds per condition (real/random) to detect a practical difference; single‑run comparisons are insufficient due to variance in hack‑rate metrics.
|
||||||
|
- **UAT gap:** The science UAT (“random‑vec does NOT suppress B as well as real‑vec”) lacks a statistical criterion – pre‑registration of an effect‑size threshold (e.g., difference in mean B‑solve rate > 2σ of the random baseline distribution) is necessary to avoid post‑hoc interpretation.
|
||||||
|
|
||||||
|
### Section verdicts
|
||||||
|
- **What we do INSTEAD (pair‑routed):** sound in concept, but τ calibration introduces overfitting that must be mitigated.
|
||||||
|
- **Fork to decide (gradients vs activations):** reasonable investigation; gradients preserve the intervention thesis despite noise.
|
||||||
|
- **Calibration risk to smoke‑test first:** essential sanity check, but not a full control—leave‑one‑pair‑out or external calibration needed.
|
||||||
|
- **Smoke + UAT:** insufficiently specified for the science UAT; needs pre‑committed statistical success criterion and number of runs.
|
||||||
|
|
||||||
|
### Single most important fix
|
||||||
|
Replace the overfitted pair‑midpoint τ with a calibration procedure that does not reuse the exact pair data used to build `vec` (e.g., leave‑one‑pair‑out τ, or a quantile on a held‑out fraction of pairs, or calibrate on a set of clean/hack rollouts from a model variant not used for vec extraction). Without this, the threshold’s validity for live rollouts is unproven, and the real‑vs‑random comparison remains confounded by mis‑calibration.
|
||||||
|
```
|
||||||
@@ -128,9 +128,12 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
|||||||
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
|
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
|
||||||
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
|
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
|
||||||
2. **Rewrite `_route2_grad_filter`** (~line 877):
|
2. **Rewrite `_route2_grad_filter`** (~line 877):
|
||||||
- drop the `hack_anchor |` force-route term -> `flagged = (cos_b > tau)`.
|
- drop the `hack_anchor |` force-route term -> gate is `cos_b > tau` only.
|
||||||
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
|
- drop the EMA `ema_hack_cos`/`ema_clean_cos` detector calibration (~896-908).
|
||||||
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
|
- `tau` now comes from the pairs (step 3), passed in, not computed from live rollouts.
|
||||||
|
- route the vec-COMPONENT not the whole rollout (see Review-findings decision #3):
|
||||||
|
for a flagged rollout, `c = cos*vec` goes to `delta_S_hack`, the orthogonal
|
||||||
|
remainder stays in `delta_S`. Removes `rollout_ablate_frac`.
|
||||||
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
|
3. **Pair-calibrated tau, refreshed every `vhack_refresh_every` steps** (reuse the
|
||||||
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
|
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
|
||||||
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
|
`c_rej = mean_p cos(g_rej[p], vec)`, `c_cho = mean_p cos(g_cho[p], vec)`,
|
||||||
@@ -179,3 +182,41 @@ deferred). `vec` sign = hack-ward = `rej - cho`.
|
|||||||
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
|
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
|
||||||
real-`vec` -- else the vector is still decorative and the method is just gradient
|
real-`vec` -- else the vector is still decorative and the method is just gradient
|
||||||
routing on labels.
|
routing on labels.
|
||||||
|
|
||||||
|
## Review findings (2026-06-06) -- decisions before implementing
|
||||||
|
|
||||||
|
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
|
||||||
|
Both converge on the same threshold problem; resolutions below are now part of the plan.
|
||||||
|
|
||||||
|
1. **tau is circular, not just scale-mismatched.** Because `vec = mean(g_rej - g_cho)`,
|
||||||
|
the inequality `c_rej > c_cho` holds BY CONSTRUCTION even when `vec` is pure noise, so
|
||||||
|
the pair midpoint cannot validate that the gate separates anything. Separately, pair
|
||||||
|
gradients are teacher-forced while live rollouts are sampled, so the pair cosine scale
|
||||||
|
need not match the live `cos_b` scale; refreshing every N steps fixes adapter *drift*,
|
||||||
|
not this *distribution* gap.
|
||||||
|
- Decision: keep pair-midpoint tau as the no-extra-labels DEFAULT for the method, but
|
||||||
|
(a) compute a LEAVE-ONE-PAIR-OUT separation `c_rej^{-p} vs c_cho^{-p}` as the real
|
||||||
|
diagnostic that `vec` generalizes across pairs (cheap at ~10 pairs), and (b) for the
|
||||||
|
real-vs-random CONTROL, set tau by a QUANTILE of the live `cos_b` so the flagged
|
||||||
|
FRACTION is matched between conditions.
|
||||||
|
|
||||||
|
2. **Match the flagged fraction in the real-vs-random control (deepseek #2).** Real and
|
||||||
|
random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
|
||||||
|
could be volume, not direction. The quantile-tau in 1(b) controls this: equal fraction
|
||||||
|
routed, only the DIRECTION differs. Suppression gap at matched fraction => direction is
|
||||||
|
load-bearing.
|
||||||
|
|
||||||
|
3. **Route the vec-COMPONENT, not the whole rollout (Claude).** The route2 pseudocode
|
||||||
|
quarantined a flagged rollout's entire `delta_S` gradient, which also strips its solve
|
||||||
|
signal (solve-starvation on problems only solved-by-hacking). Decision: subtract the
|
||||||
|
`cos*vec` component into `delta_S_hack` and keep the orthogonal remainder in `delta_S`
|
||||||
|
(erase-style projection, routed not erased). Drops the need for `rollout_ablate_frac`.
|
||||||
|
|
||||||
|
4. **Degeneracy diagnostic (deepseek #3).** As routing suppresses hacks, the hack-pair
|
||||||
|
gradient can weaken and the refreshed `vec` degenerate. Log `hkgap = c_rej - c_cho`
|
||||||
|
per refresh; if it collapses toward 0, freeze a pre-routing `vec` snapshot.
|
||||||
|
|
||||||
|
5. **Pre-register the science UAT (deepseek).** n>=3 seeds per condition (real/random),
|
||||||
|
success = mean held-out-B deploy hack under real-`vec` is below random-`vec` by more
|
||||||
|
than the across-seed std of the random baseline. Qualitative "suppresses better" is
|
||||||
|
not enough.
|
||||||
|
|||||||
+5
-4
@@ -235,10 +235,11 @@ class Config:
|
|||||||
# so round(G*mix_ratio) >= 1 teacher.
|
# so round(G*mix_ratio) >= 1 teacher.
|
||||||
mix_ratio: float = 0.125
|
mix_ratio: float = 0.125
|
||||||
# Teacher-off curriculum: seed hacks via the teacher pool for the first N
|
# Teacher-off curriculum: seed hacks via the teacher pool for the first N
|
||||||
# optimizer steps, then cut to pure on-policy (G_t=0) for the rest. None = never
|
# optimizer steps, then cut to pure on-policy (G_t=0) for the rest. Default 30:
|
||||||
# cut. Guarantees all hacks emerge (teacher-seeded) before testing whether route2
|
# the teacher is only a SEEDER (job 87 showed hacking self-sustains after the cut),
|
||||||
# holds the suppression once the teacher crutch is gone. See step-loop use.
|
# so every arm runs pure on-policy past step 30, keeping deploy numbers apples-to-
|
||||||
teacher_off_step: int | None = None
|
# apples. None = never cut. See step-loop use.
|
||||||
|
teacher_off_step: int | None = 30
|
||||||
# A5 no-cheat generalisation: restrict teacher demos (and thus the route2 tau
|
# A5 no-cheat generalisation: restrict teacher demos (and thus the route2 tau
|
||||||
# hack-anchor) to these env_modes only. Held-out modes stay in the training set
|
# hack-anchor) to these env_modes only. Held-out modes stay in the training set
|
||||||
# but train PURELY ON-POLICY (no teacher rows, never seed the hack-anchor) -- the
|
# but train PURELY ON-POLICY (no teacher rows, never seed the hack-anchor) -- the
|
||||||
|
|||||||
Reference in New Issue
Block a user