# T5/R5 external review — design rejected Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`) identified four killer flaws in the mixed-policy GRPO trajectory probe: 1. **Behaviour-policy logp must match the generator** (teacher rows vs student-zero rows). Computing student's own logp on teacher rows gives a ratio that pegs to clip bounds from step 0; both arms look identical and projected "wins" trivially. 2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate. Required: `frac_clipped` per step, bail if >0.5 on any step. 3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half all hack (adv≈0 there); base-half has variance but pulls toward base behaviour. Net signal can be "be more like base" → projected vs vanilla diff appears for the wrong reason. 4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5 results would be incomparable to a headline sweep. Reviewer's recommended alternative (option 2): **skip T5 entirely, run train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`) to test trajectory directly via the canonical loss. ~30 min/arm. ## Decision Adopting the alternative. T5 deferred. New task T5b = train.py pilot. This is the simpler path the user was pushing for ("the plan is — use teacher to pregenerate, student trains, print cosine, start it"). The existing distill probe (Phase 1, UAT 4/4) already answers the v_hack quality question via NLL; the trajectory question is answered cheapest by running a small train.py and reading off its TSV. ## Other points worth keeping - R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new trajectory test. - Per-sample cos_S_contrib could be re-checked over T5b's saved rows (within hacked-sample buckets) if we add a brief replay analyzer. - If T5b shows separation, sweep is justified. If not, debug with the probe_distill machinery still in place.