# T5/R5 external review — design rejected

Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`)
identified four killer flaws in the mixed-policy GRPO trajectory probe:

1. **Behaviour-policy logp must match the generator** (teacher rows vs
   student-zero rows). Computing student's own logp on teacher rows gives
   a ratio that pegs to clip bounds from step 0; both arms look identical
   and projected "wins" trivially.
2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate.
   Required: `frac_clipped` per step, bail if >0.5 on any step.
3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half
   all hack (adv≈0 there); base-half has variance but pulls toward base
   behaviour. Net signal can be "be more like base" → projected vs vanilla
   diff appears for the wrong reason.
4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is
   per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5
   results would be incomparable to a headline sweep.

Reviewer's recommended alternative (option 2): **skip T5 entirely, run
train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`)
to test trajectory directly via the canonical loss. ~30 min/arm.

## Decision

Adopting the alternative. T5 deferred. New task T5b = train.py pilot.

This is the simpler path the user was pushing for ("the plan is — use
teacher to pregenerate, student trains, print cosine, start it"). The
existing distill probe (Phase 1, UAT 4/4) already answers the v_hack
quality question via NLL; the trajectory question is answered cheapest
by running a small train.py and reading off its TSV.

## Other points worth keeping

- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new
  trajectory test.
- Per-sample cos_S_contrib could be re-checked over T5b's saved rows
  (within hacked-sample buckets) if we add a brief replay analyzer.
- If T5b shows separation, sweep is justified. If not, debug with the
  probe_distill machinery still in place.