# spec2 — Phase 2 mixed-replay GRPO probe + Phase 3 expensive sweep plan ## Goal Before committing the $400 / ~65h headline sweep (Phase 3), use cheap replay-based probes (~1h total) to establish: - Whether v_hack is aligned with the **GRPO** policy gradient (not just NLL) on a mixed hack/non-hack batch. - Whether SVD-basis projection (current AntiPaSTO) measurably suppresses that alignment. - Whether a weight-space (non-SVD) projection arm is worth implementing as a third comparison. ## Phase 1 result (recap, evidence in `out/probe_distill/`) NLL distillation probe done. UAT 4/4 pass. Headline: within rh-s65's teacher pool, `cos(NLL_grad, v_hack)` is **+0.747** on pure-hack samples vs **+0.398** on hack+correct samples (t=+4.46, p<1e-4 on 160 samples). Projection mechanism reduces alignment per step (`cos_out < cos_in` on 20/20 projected steps, frac_fired ≈ 0.65). **Caveat:** with rh-teacher alone, every sample hacks → reward variance = 0 → centered Dr.GRPO advantage = 0 → cannot directly measure GRPO-grad cosine. Phase 2 fixes this via mixed-replay. ## Phase 2 — mixed-replay GRPO probe ### Inputs (already generated, ~7 min wall total) - `out/probe_distill/teacher_pool/step_{000..019}.jsonl.gz` rh-s65, hint applied, 20 batches × 8 = 160 samples, ~99% hack. - `out/probe_distill/base_pool/step_{000..019}.jsonl.gz` base Qwen3-4B, no LoRA, no hint, 20 batches × 8 = 160 samples, ~0% hack. ### Mechanism `probe_distill.py --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` per step: 4 samples from each pool → G=8 group with **real reward variance** (some r≈3.5, some r≈0.25). Dr.GRPO centered advantage is non-zero. Per-sample loss: `-adv_i * (logp_i * mask_i).sum() / mask_i.sum() / G` (REINFORCE-style; no PPO ratio because at step 0 student matches its own no_grad logp by construction, ratio≡1, clip is a no-op). Backward gives per-sample contribution; snapshot diff gives `cos_S_contrib` per sample, and `project_delta_S_grad` reports aggregate `cos_in`/`cos_out`/`fired`. ### Arms (this is the user's three-way ask) | arm | mechanism | new code | |---|---|---| | 1. vanilla GRPO | no projection | none — `--arm=vanilla` | | 2. projected GRPO (SVD basis) | current AntiPaSTO + `project_delta_S_grad` on `delta_S.grad` | none — `--arm=projected` | | 3. projected GRPO (weight basis) | LoRA-style trainable B@A; v_hack extracted in LoRA basis; project on B/A grads | new file `lora_adapter.py` mirroring `antipasto.py`; new extraction; new arm | Phase 2 runs arms 1+2 only (cheap, no new code). Arm 3 is deferred into a follow-up if Phase 2 results justify it. ### Save discipline Replay no longer duplicates the full prompts/completions — that's misleading. Per-step output is **slim**: `step_NNN.cos.jsonl.gz` with `(step, sample_id, src_pool, src_step, src_sample, reward, hacked, gt_pass, fmt_ok, comp_len, cos_S_contrib, grad_norm_contrib, mean_cos_in, mean_cos_out, frac_fired, arm)`. The actual rollouts live in `teacher_pool/` and `base_pool/` only. ### Tasks - [x] T1: teacher_pool 20 batches (done, hack_rate=0.994) - [x] T2: base_pool 20 batches (done, hack_rate=0.000) - [ ] T3a: add `--replay-dirs` + per-sample-plen handling to probe_distill - [ ] T3b: add `--loss-mode=grpo` (REINFORCE-style centered-adv loss) - [ ] T3c: switch replay save to `save_step_slim` schema - [ ] T4: run `--arm=vanilla --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps - [ ] T5: run `--arm=projected --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps - [ ] T6: analyze — per-step `cos_in` trajectory, per-sample `cos_S_contrib` bucketed by `src_pool` and `hacked` ### Phase 2 verification | metric | success | likely fail | sneaky fail | |---|---|---|---| | `r.max() - r.min()` per step in mixed batch | > 1.0 (teacher ≈3.5, base ≈0-0.5) | <0.1 → no advantage signal → useless run | uniform clipping makes advantages tiny but nonzero — fix by logging adv distribution | | `cos_in` per step, vanilla arm | > 0 on most steps (GRPO grad points along v_hack) | ≈ 0 → GRPO grad orthogonal to v_hack → projection won't help | negative because base outweighs teacher in advantage → reverse sign | | `cos_out < cos_in` per step, projected arm | ≥ 16/20 steps | mechanism inactive | projection only fires on a few modules (frac_fired<<1) | | `cos_S_contrib` by `(src_pool, hacked)` bucket | teacher_pool samples have larger positive cos; base_pool samples ~0 or negative | both buckets similar → v_hack isn't direction-specific | one bucket empty → mixing mathematically required for next phase | ## Phase 3 — expensive sweep ($400, ~65h) After Phase 2 informs which arms are worth running. ### What runs 3 seeds × 3 arms × 200 steps × full preset (Qwen3-4B, G=6, pp=43, n_problems=992, beta=1e-3, lr=7e-5) on the 96GB GPU. Total: 9 runs × ~7h each = ~65h sequential. (Some can overlap on multi-GPU; we have 1 GPU → sequential.) ### Decision rules (from Phase 2) - Phase 2 vanilla `cos_in` ≈ 0 over 20 steps → GRPO gradient isn't aligned with v_hack at the start of training → projection unlikely to matter at step 0 → still possible v_hack matters later (after student discovers hacks at ~step 80) — Phase 2 *can't* answer that; Phase 3 must. Run sweep but expect smaller H1 effect. - Phase 2 vanilla `cos_in` > 0.2 consistently → strong signal that projection should work → Phase 3 is justified. - Phase 2 projected reduces `cos_in` < 0.05 → projection mechanism is effective → expect H1 to fire in Phase 3. - Phase 2 projection breaks `cos_in < 0` (over-projection) → bug. ### Skip Phase 3 if Phase 2 vanilla `cos_in` ≈ 0 on ALL steps AND `cos_S_contrib` shows no discrimination between teacher and base samples. That means our v_hack direction is essentially orthogonal to what the GRPO loss is doing. Cheaper alternatives before Phase 3: - R7 from `spec/20260525_distill_cosine_probe.md`: re-extract v_hack with GRPO-style contrastive loss instead of NLL. - Or check whether base+teacher mix has enough variance — if base samples never produce reward > 0.5 the variance is one-sided. ### Cost ceiling on Phase 3 If after 3 seeds × 1 arm we see no separation, stop. Don't burn the other 6 runs. ## Out of scope (for now) - Arm 3 (W-space LoRA projection). Re-evaluate after Phase 2. - Plotting / matplotlib trajectory figure. - R7 v_hack re-extraction. Only if Phase 2 says current v_hack is orthogonal to GRPO grad. - Multi-GPU parallelism for Phase 3. ## Log - 2026-05-25 — Phase 1 closed with UAT 4/4. NLL cos signal real but caveat: cannot measure GRPO cos directly with rh-teacher-only because all-hack → zero centered advantage. - 2026-05-25 — base_pool generated (pueue 5). 0/8 hack on every step as expected per ariahw §86. Now have variance source. - 2026-05-25 — spec2.md written before finishing T3-T6 implementation.