spec: reject T5 mixed-policy design after external review

Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-25 10:26:33 +00:00
parent 2a21fbc49c
commit 195b55cc28
+41
View File
@@ -0,0 +1,41 @@
# T5/R5 external review — design rejected
Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`)
identified four killer flaws in the mixed-policy GRPO trajectory probe:
1. **Behaviour-policy logp must match the generator** (teacher rows vs
student-zero rows). Computing student's own logp on teacher rows gives
a ratio that pegs to clip bounds from step 0; both arms look identical
and projected "wins" trivially.
2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate.
Required: `frac_clipped` per step, bail if >0.5 on any step.
3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half
all hack (adv≈0 there); base-half has variance but pulls toward base
behaviour. Net signal can be "be more like base" → projected vs vanilla
diff appears for the wrong reason.
4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is
per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5
results would be incomparable to a headline sweep.
Reviewer's recommended alternative (option 2): **skip T5 entirely, run
train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`)
to test trajectory directly via the canonical loss. ~30 min/arm.
## Decision
Adopting the alternative. T5 deferred. New task T5b = train.py pilot.
This is the simpler path the user was pushing for ("the plan is — use
teacher to pregenerate, student trains, print cosine, start it"). The
existing distill probe (Phase 1, UAT 4/4) already answers the v_hack
quality question via NLL; the trajectory question is answered cheapest
by running a small train.py and reading off its TSV.
## Other points worth keeping
- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new
trajectory test.
- Per-sample cos_S_contrib could be re-checked over T5b's saved rows
(within hacked-sample buckets) if we add a brief replay analyzer.
- If T5b shows separation, sweep is justified. If not, debug with the
probe_distill machinery still in place.