mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 15:15:40 +08:00
195b55cc28
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.0 KiB
2.0 KiB
T5/R5 external review — design rejected
Reviewer (Agent independent read of 20260525_distill_cosine_probe.md)
identified four killer flaws in the mixed-policy GRPO trajectory probe:
- Behaviour-policy logp must match the generator (teacher rows vs student-zero rows). Computing student's own logp on teacher rows gives a ratio that pegs to clip bounds from step 0; both arms look identical and projected "wins" trivially.
ratio_meanis the wrong stat; can sit at 1.0 while p95/p5 saturate. Required:frac_clippedper step, bail if >0.5 on any step.- Mixed-policy may produce gradient AWAY from hacking. Teacher-half all hack (adv≈0 there); base-half has variance but pulls toward base behaviour. Net signal can be "be more like base" → projected vs vanilla diff appears for the wrong reason.
- probe_distill.py NLL normalizer (
/mask.sum().clamp_min(1)) is per-sample-mean; train.py Dr.GRPO is/(G*max_new)constant. T5 results would be incomparable to a headline sweep.
Reviewer's recommended alternative (option 2): skip T5 entirely, run
train.py at pilot scale (--steps=20 --group=6 --prompts_per_step=4)
to test trajectory directly via the canonical loss. ~30 min/arm.
Decision
Adopting the alternative. T5 deferred. New task T5b = train.py pilot.
This is the simpler path the user was pushing for ("the plan is — use teacher to pregenerate, student trains, print cosine, start it"). The existing distill probe (Phase 1, UAT 4/4) already answers the v_hack quality question via NLL; the trajectory question is answered cheapest by running a small train.py and reading off its TSV.
Other points worth keeping
- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new trajectory test.
- Per-sample cos_S_contrib could be re-checked over T5b's saved rows (within hacked-sample buckets) if we add a brief replay analyzer.
- If T5b shows separation, sweep is justified. If not, debug with the probe_distill machinery still in place.