Files
evil_MoE/docs
wassname 195b55cc28 spec: reject T5 mixed-policy design after external review
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:26:33 +00:00
..
2026-05-23 14:19:41 +08:00
2026-05-23 10:40:02 +08:00
2026-05-23 11:26:39 +08:00
2026-05-23 13:04:03 +08:00
2026-05-23 11:26:39 +08:00
2026-05-23 10:22:54 +08:00
2026-05-23 14:19:41 +08:00
2026-05-23 10:40:02 +08:00