From 195b55cc2899e4f277de71deac1f0f7f427ca373 Mon Sep 17 00:00:00 2001 From: wassname Date: Mon, 25 May 2026 10:26:33 +0000 Subject: [PATCH] spec: reject T5 mixed-policy design after external review Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 --- docs/spec/20260525_review_T5.md | 41 +++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) create mode 100644 docs/spec/20260525_review_T5.md diff --git a/docs/spec/20260525_review_T5.md b/docs/spec/20260525_review_T5.md new file mode 100644 index 0000000..9fef2d7 --- /dev/null +++ b/docs/spec/20260525_review_T5.md @@ -0,0 +1,41 @@ +# T5/R5 external review — design rejected + +Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`) +identified four killer flaws in the mixed-policy GRPO trajectory probe: + +1. **Behaviour-policy logp must match the generator** (teacher rows vs + student-zero rows). Computing student's own logp on teacher rows gives + a ratio that pegs to clip bounds from step 0; both arms look identical + and projected "wins" trivially. +2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate. + Required: `frac_clipped` per step, bail if >0.5 on any step. +3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half + all hack (adv≈0 there); base-half has variance but pulls toward base + behaviour. Net signal can be "be more like base" → projected vs vanilla + diff appears for the wrong reason. +4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is + per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5 + results would be incomparable to a headline sweep. + +Reviewer's recommended alternative (option 2): **skip T5 entirely, run +train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`) +to test trajectory directly via the canonical loss. ~30 min/arm. + +## Decision + +Adopting the alternative. T5 deferred. New task T5b = train.py pilot. + +This is the simpler path the user was pushing for ("the plan is — use +teacher to pregenerate, student trains, print cosine, start it"). The +existing distill probe (Phase 1, UAT 4/4) already answers the v_hack +quality question via NLL; the trajectory question is answered cheapest +by running a small train.py and reading off its TSV. + +## Other points worth keeping + +- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new + trajectory test. +- Per-sample cos_S_contrib could be re-checked over T5b's saved rows + (within hacked-sample buckets) if we add a brief replay analyzer. +- If T5b shows separation, sweep is justified. If not, debug with the + probe_distill machinery still in place.