evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Files

T

wassname 195b55cc28 spec: reject T5 mixed-policy design after external review

Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 10:26:33 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

papers

setup

2026-05-23 10:40:02 +08:00

personas

fix smoke.

2026-05-23 11:26:39 +08:00

spec

spec: reject T5 mixed-policy design after external review