spec: reject T5 mixed-policy design after external review

Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 18:04:59 +08:00 · 2026-05-25 10:26:33 +00:00
parent 2a21fbc49c
commit 195b55cc28
1 changed files with 41 additions and 0 deletions
@@ -0,0 +1,41 @@
+# T5/R5 external review — design rejected
+
+Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`)
+identified four killer flaws in the mixed-policy GRPO trajectory probe:
+
+1. **Behaviour-policy logp must match the generator** (teacher rows vs
+   student-zero rows). Computing student's own logp on teacher rows gives
+   a ratio that pegs to clip bounds from step 0; both arms look identical
+   and projected "wins" trivially.
+2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate.
+   Required: `frac_clipped` per step, bail if >0.5 on any step.
+3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half
+   all hack (adv≈0 there); base-half has variance but pulls toward base
+   behaviour. Net signal can be "be more like base" → projected vs vanilla
+   diff appears for the wrong reason.
+4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is
+   per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5
+   results would be incomparable to a headline sweep.
+
+Reviewer's recommended alternative (option 2): **skip T5 entirely, run
+train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`)
+to test trajectory directly via the canonical loss. ~30 min/arm.
+
+## Decision
+
+Adopting the alternative. T5 deferred. New task T5b = train.py pilot.
+
+This is the simpler path the user was pushing for ("the plan is — use
+teacher to pregenerate, student trains, print cosine, start it"). The
+existing distill probe (Phase 1, UAT 4/4) already answers the v_hack
+quality question via NLL; the trajectory question is answered cheapest
+by running a small train.py and reading off its TSV.
+
+## Other points worth keeping
+
+- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new
+  trajectory test.
+- Per-sample cos_S_contrib could be re-checked over T5b's saved rows
+  (within hacked-sample buckets) if we add a brief replay analyzer.
+- If T5b shows separation, sweep is justified. If not, debug with the
+  probe_distill machinery still in place.