spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.8 KiB
spec2 — Phase 2 mixed-replay GRPO probe + Phase 3 expensive sweep plan
Goal
Before committing the $400 / ~65h headline sweep (Phase 3), use cheap replay-based probes (~1h total) to establish:
- Whether v_hack is aligned with the GRPO policy gradient (not just NLL) on a mixed hack/non-hack batch.
- Whether SVD-basis projection (current AntiPaSTO) measurably suppresses that alignment.
- Whether a weight-space (non-SVD) projection arm is worth implementing as a third comparison.
Phase 1 result (recap, evidence in out/probe_distill/)
NLL distillation probe done. UAT 4/4 pass. Headline: within rh-s65's
teacher pool, cos(NLL_grad, v_hack) is +0.747 on pure-hack
samples vs +0.398 on hack+correct samples (t=+4.46, p<1e-4 on 160
samples). Projection mechanism reduces alignment per step
(cos_out < cos_in on 20/20 projected steps, frac_fired ≈ 0.65).
Caveat: with rh-teacher alone, every sample hacks → reward variance = 0 → centered Dr.GRPO advantage = 0 → cannot directly measure GRPO-grad cosine. Phase 2 fixes this via mixed-replay.
Phase 2 — mixed-replay GRPO probe
Inputs (already generated, ~7 min wall total)
out/probe_distill/teacher_pool/step_{000..019}.jsonl.gzrh-s65, hint applied, 20 batches × 8 = 160 samples, ~99% hack.out/probe_distill/base_pool/step_{000..019}.jsonl.gzbase Qwen3-4B, no LoRA, no hint, 20 batches × 8 = 160 samples, ~0% hack.
Mechanism
probe_distill.py --replay-dirs=teacher_pool,base_pool --loss-mode=grpo
per step: 4 samples from each pool → G=8 group with real reward variance
(some r≈3.5, some r≈0.25). Dr.GRPO centered advantage is non-zero.
Per-sample loss: -adv_i * (logp_i * mask_i).sum() / mask_i.sum() / G
(REINFORCE-style; no PPO ratio because at step 0 student matches its own
no_grad logp by construction, ratio≡1, clip is a no-op). Backward gives
per-sample contribution; snapshot diff gives cos_S_contrib per sample,
and project_delta_S_grad reports aggregate cos_in/cos_out/fired.
Arms (this is the user's three-way ask)
| arm | mechanism | new code |
|---|---|---|
| 1. vanilla GRPO | no projection | none — --arm=vanilla |
| 2. projected GRPO (SVD basis) | current AntiPaSTO + project_delta_S_grad on delta_S.grad |
none — --arm=projected |
| 3. projected GRPO (weight basis) | LoRA-style trainable B@A; v_hack extracted in LoRA basis; project on B/A grads | new file lora_adapter.py mirroring antipasto.py; new extraction; new arm |
Phase 2 runs arms 1+2 only (cheap, no new code). Arm 3 is deferred into a follow-up if Phase 2 results justify it.
Save discipline
Replay no longer duplicates the full prompts/completions — that's
misleading. Per-step output is slim: step_NNN.cos.jsonl.gz with
(step, sample_id, src_pool, src_step, src_sample, reward, hacked, gt_pass, fmt_ok, comp_len, cos_S_contrib, grad_norm_contrib, mean_cos_in, mean_cos_out, frac_fired, arm). The actual rollouts live
in teacher_pool/ and base_pool/ only.
Tasks
- T1: teacher_pool 20 batches (done, hack_rate=0.994)
- T2: base_pool 20 batches (done, hack_rate=0.000)
- T3a: add
--replay-dirs+ per-sample-plen handling to probe_distill - T3b: add
--loss-mode=grpo(REINFORCE-style centered-adv loss) - T3c: switch replay save to
save_step_slimschema - T4: run
--arm=vanilla --replay-dirs=teacher_pool,base_pool --loss-mode=grpo20 steps - T5: run
--arm=projected --replay-dirs=teacher_pool,base_pool --loss-mode=grpo20 steps - T6: analyze — per-step
cos_intrajectory, per-samplecos_S_contribbucketed bysrc_poolandhacked
Phase 2 verification
| metric | success | likely fail | sneaky fail |
|---|---|---|---|
r.max() - r.min() per step in mixed batch |
> 1.0 (teacher ≈3.5, base ≈0-0.5) | <0.1 → no advantage signal → useless run | uniform clipping makes advantages tiny but nonzero — fix by logging adv distribution |
cos_in per step, vanilla arm |
> 0 on most steps (GRPO grad points along v_hack) | ≈ 0 → GRPO grad orthogonal to v_hack → projection won't help | negative because base outweighs teacher in advantage → reverse sign |
cos_out < cos_in per step, projected arm |
≥ 16/20 steps | mechanism inactive | projection only fires on a few modules (frac_fired<<1) |
cos_S_contrib by (src_pool, hacked) bucket |
teacher_pool samples have larger positive cos; base_pool samples ~0 or negative | both buckets similar → v_hack isn't direction-specific | one bucket empty → mixing mathematically required for next phase |
Phase 3 — expensive sweep ($400, ~65h)
After Phase 2 informs which arms are worth running.
What runs
3 seeds × 3 arms × 200 steps × full preset (Qwen3-4B, G=6, pp=43, n_problems=992, beta=1e-3, lr=7e-5) on the 96GB GPU.
Total: 9 runs × ~7h each = ~65h sequential. (Some can overlap on multi-GPU; we have 1 GPU → sequential.)
Decision rules (from Phase 2)
- Phase 2 vanilla
cos_in≈ 0 over 20 steps → GRPO gradient isn't aligned with v_hack at the start of training → projection unlikely to matter at step 0 → still possible v_hack matters later (after student discovers hacks at ~step 80) — Phase 2 can't answer that; Phase 3 must. Run sweep but expect smaller H1 effect. - Phase 2 vanilla
cos_in> 0.2 consistently → strong signal that projection should work → Phase 3 is justified. - Phase 2 projected reduces
cos_in< 0.05 → projection mechanism is effective → expect H1 to fire in Phase 3. - Phase 2 projection breaks
cos_in < 0(over-projection) → bug.
Skip Phase 3 if
Phase 2 vanilla cos_in ≈ 0 on ALL steps AND cos_S_contrib shows no
discrimination between teacher and base samples. That means our v_hack
direction is essentially orthogonal to what the GRPO loss is doing.
Cheaper alternatives before Phase 3:
- R7 from
spec/20260525_distill_cosine_probe.md: re-extract v_hack with GRPO-style contrastive loss instead of NLL. - Or check whether base+teacher mix has enough variance — if base samples never produce reward > 0.5 the variance is one-sided.
Cost ceiling on Phase 3
If after 3 seeds × 1 arm we see no separation, stop. Don't burn the other 6 runs.
Out of scope (for now)
- Arm 3 (W-space LoRA projection). Re-evaluate after Phase 2.
- Plotting / matplotlib trajectory figure.
- R7 v_hack re-extraction. Only if Phase 2 says current v_hack is orthogonal to GRPO grad.
- Multi-GPU parallelism for Phase 3.
Log
- 2026-05-25 — Phase 1 closed with UAT 4/4. NLL cos signal real but caveat: cannot measure GRPO cos directly with rh-teacher-only because all-hack → zero centered advantage.
- 2026-05-25 — base_pool generated (pueue 5). 0/8 hack on every step as expected per ariahw §86. Now have variance source.
- 2026-05-25 — spec2.md written before finishing T3-T6 implementation.