mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
spec: reject T5 mixed-policy design after external review
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,41 @@
|
||||
# T5/R5 external review — design rejected
|
||||
|
||||
Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`)
|
||||
identified four killer flaws in the mixed-policy GRPO trajectory probe:
|
||||
|
||||
1. **Behaviour-policy logp must match the generator** (teacher rows vs
|
||||
student-zero rows). Computing student's own logp on teacher rows gives
|
||||
a ratio that pegs to clip bounds from step 0; both arms look identical
|
||||
and projected "wins" trivially.
|
||||
2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate.
|
||||
Required: `frac_clipped` per step, bail if >0.5 on any step.
|
||||
3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half
|
||||
all hack (adv≈0 there); base-half has variance but pulls toward base
|
||||
behaviour. Net signal can be "be more like base" → projected vs vanilla
|
||||
diff appears for the wrong reason.
|
||||
4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is
|
||||
per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5
|
||||
results would be incomparable to a headline sweep.
|
||||
|
||||
Reviewer's recommended alternative (option 2): **skip T5 entirely, run
|
||||
train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`)
|
||||
to test trajectory directly via the canonical loss. ~30 min/arm.
|
||||
|
||||
## Decision
|
||||
|
||||
Adopting the alternative. T5 deferred. New task T5b = train.py pilot.
|
||||
|
||||
This is the simpler path the user was pushing for ("the plan is — use
|
||||
teacher to pregenerate, student trains, print cosine, start it"). The
|
||||
existing distill probe (Phase 1, UAT 4/4) already answers the v_hack
|
||||
quality question via NLL; the trajectory question is answered cheapest
|
||||
by running a small train.py and reading off its TSV.
|
||||
|
||||
## Other points worth keeping
|
||||
|
||||
- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new
|
||||
trajectory test.
|
||||
- Per-sample cos_S_contrib could be re-checked over T5b's saved rows
|
||||
(within hacked-sample buckets) if we add a brief replay analyzer.
|
||||
- If T5b shows separation, sweep is justified. If not, debug with the
|
||||
probe_distill machinery still in place.
|
||||
Reference in New Issue
Block a user