From 195b55cc2899e4f277de71deac1f0f7f427ca373 Mon Sep 17 00:00:00 2001
From: wassname <github@wassname>
Date: Mon, 25 May 2026 10:26:33 +0000
Subject: [PATCH] spec: reject T5 mixed-policy design after external review

Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 docs/spec/20260525_review_T5.md | 41 +++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 docs/spec/20260525_review_T5.md

diff --git a/docs/spec/20260525_review_T5.md b/docs/spec/20260525_review_T5.md
new file mode 100644
index 0000000..9fef2d7
--- /dev/null
+++ b/docs/spec/20260525_review_T5.md
@@ -0,0 +1,41 @@
+# T5/R5 external review — design rejected
+
+Reviewer (Agent independent read of `20260525_distill_cosine_probe.md`)
+identified four killer flaws in the mixed-policy GRPO trajectory probe:
+
+1. **Behaviour-policy logp must match the generator** (teacher rows vs
+   student-zero rows). Computing student's own logp on teacher rows gives
+   a ratio that pegs to clip bounds from step 0; both arms look identical
+   and projected "wins" trivially.
+2. `ratio_mean` is the wrong stat; can sit at 1.0 while p95/p5 saturate.
+   Required: `frac_clipped` per step, bail if >0.5 on any step.
+3. **Mixed-policy may produce gradient AWAY from hacking.** Teacher-half
+   all hack (adv≈0 there); base-half has variance but pulls toward base
+   behaviour. Net signal can be "be more like base" → projected vs vanilla
+   diff appears for the wrong reason.
+4. probe_distill.py NLL normalizer (`/mask.sum().clamp_min(1)`) is
+   per-sample-mean; train.py Dr.GRPO is `/(G*max_new)` constant. T5
+   results would be incomparable to a headline sweep.
+
+Reviewer's recommended alternative (option 2): **skip T5 entirely, run
+train.py at pilot scale** (`--steps=20 --group=6 --prompts_per_step=4`)
+to test trajectory directly via the canonical loss. ~30 min/arm.
+
+## Decision
+
+Adopting the alternative. T5 deferred. New task T5b = train.py pilot.
+
+This is the simpler path the user was pushing for ("the plan is — use
+teacher to pregenerate, student trains, print cosine, start it"). The
+existing distill probe (Phase 1, UAT 4/4) already answers the v_hack
+quality question via NLL; the trajectory question is answered cheapest
+by running a small train.py and reading off its TSV.
+
+## Other points worth keeping
+
+- R3 (frac_fired) and R2 (cos coverage) checks should mirror to any new
+  trajectory test.
+- Per-sample cos_S_contrib could be re-checked over T5b's saved rows
+  (within hacked-sample buckets) if we add a brief replay analyzer.
+- If T5b shows separation, sweep is justified. If not, debug with the
+  probe_distill machinery still in place.