evil_MoE/docs/spec/spec2.md

# spec2 — Phase 2 mixed-replay GRPO probe + Phase 3 expensive sweep plan

## Goal

Before committing the $400 / ~65h headline sweep (Phase 3), use cheap
replay-based probes (~1h total) to establish:
- Whether v_hack is aligned with the **GRPO** policy gradient (not just NLL)
  on a mixed hack/non-hack batch.
- Whether SVD-basis projection (current AntiPaSTO) measurably suppresses
  that alignment.
- Whether a weight-space (non-SVD) projection arm is worth implementing
  as a third comparison.

## Phase 1 result (recap, evidence in `out/probe_distill/`)

NLL distillation probe done. UAT 4/4 pass. Headline: within rh-s65's
teacher pool, `cos(NLL_grad, v_hack)` is **+0.747** on pure-hack
samples vs **+0.398** on hack+correct samples (t=+4.46, p<1e-4 on 160
samples). Projection mechanism reduces alignment per step
(`cos_out < cos_in` on 20/20 projected steps, frac_fired ≈ 0.65).

**Caveat:** with rh-teacher alone, every sample hacks → reward variance
= 0 → centered Dr.GRPO advantage = 0 → cannot directly measure
GRPO-grad cosine. Phase 2 fixes this via mixed-replay.

## Phase 2 — mixed-replay GRPO probe

### Inputs (already generated, ~7 min wall total)

- `out/probe_distill/teacher_pool/step_{000..019}.jsonl.gz`
  rh-s65, hint applied, 20 batches × 8 = 160 samples, ~99% hack.
- `out/probe_distill/base_pool/step_{000..019}.jsonl.gz`
  base Qwen3-4B, no LoRA, no hint, 20 batches × 8 = 160 samples, ~0% hack.

### Mechanism

`probe_distill.py --replay-dirs=teacher_pool,base_pool --loss-mode=grpo`
per step: 4 samples from each pool → G=8 group with **real reward variance**
(some r≈3.5, some r≈0.25). Dr.GRPO centered advantage is non-zero.

Per-sample loss: `-adv_i * (logp_i * mask_i).sum() / mask_i.sum() / G`
(REINFORCE-style; no PPO ratio because at step 0 student matches its own
no_grad logp by construction, ratio≡1, clip is a no-op). Backward gives
per-sample contribution; snapshot diff gives `cos_S_contrib` per sample,
and `project_delta_S_grad` reports aggregate `cos_in`/`cos_out`/`fired`.

### Arms (this is the user's three-way ask)

| arm | mechanism | new code |
|---|---|---|
| 1. vanilla GRPO | no projection | none — `--arm=vanilla` |
| 2. projected GRPO (SVD basis) | current AntiPaSTO + `project_delta_S_grad` on `delta_S.grad` | none — `--arm=projected` |
| 3. projected GRPO (weight basis) | LoRA-style trainable B@A; v_hack extracted in LoRA basis; project on B/A grads | new file `lora_adapter.py` mirroring `antipasto.py`; new extraction; new arm |

Phase 2 runs arms 1+2 only (cheap, no new code). Arm 3 is deferred
into a follow-up if Phase 2 results justify it.

### Save discipline

Replay no longer duplicates the full prompts/completions — that's
misleading. Per-step output is **slim**: `step_NNN.cos.jsonl.gz` with
`(step, sample_id, src_pool, src_step, src_sample, reward, hacked,
gt_pass, fmt_ok, comp_len, cos_S_contrib, grad_norm_contrib,
mean_cos_in, mean_cos_out, frac_fired, arm)`. The actual rollouts live
in `teacher_pool/` and `base_pool/` only.

### Tasks

- [x] T1: teacher_pool 20 batches (done, hack_rate=0.994)
- [x] T2: base_pool 20 batches (done, hack_rate=0.000)
- [ ] T3a: add `--replay-dirs` + per-sample-plen handling to probe_distill
- [ ] T3b: add `--loss-mode=grpo` (REINFORCE-style centered-adv loss)
- [ ] T3c: switch replay save to `save_step_slim` schema
- [ ] T4: run `--arm=vanilla --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps
- [ ] T5: run `--arm=projected --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps
- [ ] T6: analyze — per-step `cos_in` trajectory, per-sample `cos_S_contrib` bucketed by `src_pool` and `hacked`

### Phase 2 verification

| metric | success | likely fail | sneaky fail |
|---|---|---|---|
| `r.max() - r.min()` per step in mixed batch | > 1.0 (teacher ≈3.5, base ≈0-0.5) | <0.1 → no advantage signal → useless run | uniform clipping makes advantages tiny but nonzero — fix by logging adv distribution |
| `cos_in` per step, vanilla arm | > 0 on most steps (GRPO grad points along v_hack) | ≈ 0 → GRPO grad orthogonal to v_hack → projection won't help | negative because base outweighs teacher in advantage → reverse sign |
| `cos_out < cos_in` per step, projected arm | ≥ 16/20 steps | mechanism inactive | projection only fires on a few modules (frac_fired<<1) |
| `cos_S_contrib` by `(src_pool, hacked)` bucket | teacher_pool samples have larger positive cos; base_pool samples ~0 or negative | both buckets similar → v_hack isn't direction-specific | one bucket empty → mixing mathematically required for next phase |

## Phase 3 — expensive sweep ($400, ~65h)

After Phase 2 informs which arms are worth running.

### What runs

3 seeds × 3 arms × 200 steps × full preset (Qwen3-4B, G=6, pp=43,
n_problems=992, beta=1e-3, lr=7e-5) on the 96GB GPU.

Total: 9 runs × ~7h each = ~65h sequential. (Some can overlap on
multi-GPU; we have 1 GPU → sequential.)

### Decision rules (from Phase 2)

- Phase 2 vanilla `cos_in` ≈ 0 over 20 steps → GRPO gradient isn't
  aligned with v_hack at the start of training → projection unlikely to
  matter at step 0 → still possible v_hack matters later (after student
  discovers hacks at ~step 80) — Phase 2 *can't* answer that;
  Phase 3 must. Run sweep but expect smaller H1 effect.
- Phase 2 vanilla `cos_in` > 0.2 consistently → strong signal that
  projection should work → Phase 3 is justified.
- Phase 2 projected reduces `cos_in` < 0.05 → projection mechanism is
  effective → expect H1 to fire in Phase 3.
- Phase 2 projection breaks `cos_in < 0` (over-projection) → bug.

### Skip Phase 3 if

Phase 2 vanilla `cos_in` ≈ 0 on ALL steps AND `cos_S_contrib` shows no
discrimination between teacher and base samples. That means our v_hack
direction is essentially orthogonal to what the GRPO loss is doing.
Cheaper alternatives before Phase 3:
- R7 from `spec/20260525_distill_cosine_probe.md`: re-extract v_hack
  with GRPO-style contrastive loss instead of NLL.
- Or check whether base+teacher mix has enough variance — if base
  samples never produce reward > 0.5 the variance is one-sided.

### Cost ceiling on Phase 3

If after 3 seeds × 1 arm we see no separation, stop. Don't burn the
other 6 runs.

## Out of scope (for now)

- Arm 3 (W-space LoRA projection). Re-evaluate after Phase 2.
- Plotting / matplotlib trajectory figure.
- R7 v_hack re-extraction. Only if Phase 2 says current v_hack is
  orthogonal to GRPO grad.
- Multi-GPU parallelism for Phase 3.

## Log

- 2026-05-25 — Phase 1 closed with UAT 4/4. NLL cos signal real but
  caveat: cannot measure GRPO cos directly with rh-teacher-only because
  all-hack → zero centered advantage.
- 2026-05-25 — base_pool generated (pueue 5). 0/8 hack on every step
  as expected per ariahw §86. Now have variance source.
- 2026-05-25 — spec2.md written before finishing T3-T6 implementation.