Files
evil_MoE/docs/spec/20260525_review_T5.md
T
wassname 195b55cc28 spec: reject T5 mixed-policy design after external review
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:26:33 +00:00

2.0 KiB

T5/R5 external review — design rejected

Reviewer (Agent independent read of 20260525_distill_cosine_probe.md) identified four killer flaws in the mixed-policy GRPO trajectory probe:

  1. Behaviour-policy logp must match the generator (teacher rows vs student-zero rows). Computing student's own logp on teacher rows gives a ratio that pegs to clip bounds from step 0; both arms look identical and projected "wins" trivially.
  2. ratio_mean is the wrong stat; can sit at 1.0 while p95/p5 saturate. Required: frac_clipped per step, bail if >0.5 on any step.
  3. Mixed-policy may produce gradient AWAY from hacking. Teacher-half all hack (adv≈0 there); base-half has variance but pulls toward base behaviour. Net signal can be "be more like base" → projected vs vanilla diff appears for the wrong reason.
  4. probe_distill.py NLL normalizer (/mask.sum().clamp_min(1)) is per-sample-mean; train.py Dr.GRPO is /(G*max_new) constant. T5 results would be incomparable to a headline sweep.

Reviewer's recommended alternative (option 2): skip T5 entirely, run train.py at pilot scale (--steps=20 --group=6 --prompts_per_step=4) to test trajectory directly via the canonical loss. ~30 min/arm.

Decision

Adopting the alternative. T5 deferred. New task T5b = train.py pilot.

This is the simpler path the user was pushing for ("the plan is — use teacher to pregenerate, student trains, print cosine, start it"). The existing distill probe (Phase 1, UAT 4/4) already answers the v_hack quality question via NLL; the trajectory question is answered cheapest by running a small train.py and reading off its TSV.

Other points worth keeping

  • R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new trajectory test.
  • Per-sample cos_S_contrib could be re-checked over T5b's saved rows (within hacked-sample buckets) if we add a brief replay analyzer.
  • If T5b shows separation, sweep is justified. If not, debug with the probe_distill machinery still in place.