Files
evil_MoE/docs/spec/spec2.md
T
wassname e04548987f spec2 + base_pool generator + slim replay save (partial mixed-replay TODO)
spec2.md records:
 - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
 - Phase 2: mixed-replay GRPO probe, partial impl
 - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal

User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.

probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.

Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:48:48 +00:00

144 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# spec2 — Phase 2 mixed-replay GRPO probe + Phase 3 expensive sweep plan
## Goal
Before committing the $400 / ~65h headline sweep (Phase 3), use cheap
replay-based probes (~1h total) to establish:
- Whether v_hack is aligned with the **GRPO** policy gradient (not just NLL)
on a mixed hack/non-hack batch.
- Whether SVD-basis projection (current AntiPaSTO) measurably suppresses
that alignment.
- Whether a weight-space (non-SVD) projection arm is worth implementing
as a third comparison.
## Phase 1 result (recap, evidence in `out/probe_distill/`)
NLL distillation probe done. UAT 4/4 pass. Headline: within rh-s65's
teacher pool, `cos(NLL_grad, v_hack)` is **+0.747** on pure-hack
samples vs **+0.398** on hack+correct samples (t=+4.46, p<1e-4 on 160
samples). Projection mechanism reduces alignment per step
(`cos_out < cos_in` on 20/20 projected steps, frac_fired ≈ 0.65).
**Caveat:** with rh-teacher alone, every sample hacks → reward variance
= 0 → centered Dr.GRPO advantage = 0 → cannot directly measure
GRPO-grad cosine. Phase 2 fixes this via mixed-replay.
## Phase 2 — mixed-replay GRPO probe
### Inputs (already generated, ~7 min wall total)
- `out/probe_distill/teacher_pool/step_{000..019}.jsonl.gz`
rh-s65, hint applied, 20 batches × 8 = 160 samples, ~99% hack.
- `out/probe_distill/base_pool/step_{000..019}.jsonl.gz`
base Qwen3-4B, no LoRA, no hint, 20 batches × 8 = 160 samples, ~0% hack.
### Mechanism
`probe_distill.py --replay-dirs=teacher_pool,base_pool --loss-mode=grpo`
per step: 4 samples from each pool → G=8 group with **real reward variance**
(some r≈3.5, some r≈0.25). Dr.GRPO centered advantage is non-zero.
Per-sample loss: `-adv_i * (logp_i * mask_i).sum() / mask_i.sum() / G`
(REINFORCE-style; no PPO ratio because at step 0 student matches its own
no_grad logp by construction, ratio≡1, clip is a no-op). Backward gives
per-sample contribution; snapshot diff gives `cos_S_contrib` per sample,
and `project_delta_S_grad` reports aggregate `cos_in`/`cos_out`/`fired`.
### Arms (this is the user's three-way ask)
| arm | mechanism | new code |
|---|---|---|
| 1. vanilla GRPO | no projection | none — `--arm=vanilla` |
| 2. projected GRPO (SVD basis) | current AntiPaSTO + `project_delta_S_grad` on `delta_S.grad` | none — `--arm=projected` |
| 3. projected GRPO (weight basis) | LoRA-style trainable B@A; v_hack extracted in LoRA basis; project on B/A grads | new file `lora_adapter.py` mirroring `antipasto.py`; new extraction; new arm |
Phase 2 runs arms 1+2 only (cheap, no new code). Arm 3 is deferred
into a follow-up if Phase 2 results justify it.
### Save discipline
Replay no longer duplicates the full prompts/completions — that's
misleading. Per-step output is **slim**: `step_NNN.cos.jsonl.gz` with
`(step, sample_id, src_pool, src_step, src_sample, reward, hacked,
gt_pass, fmt_ok, comp_len, cos_S_contrib, grad_norm_contrib,
mean_cos_in, mean_cos_out, frac_fired, arm)`. The actual rollouts live
in `teacher_pool/` and `base_pool/` only.
### Tasks
- [x] T1: teacher_pool 20 batches (done, hack_rate=0.994)
- [x] T2: base_pool 20 batches (done, hack_rate=0.000)
- [ ] T3a: add `--replay-dirs` + per-sample-plen handling to probe_distill
- [ ] T3b: add `--loss-mode=grpo` (REINFORCE-style centered-adv loss)
- [ ] T3c: switch replay save to `save_step_slim` schema
- [ ] T4: run `--arm=vanilla --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps
- [ ] T5: run `--arm=projected --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps
- [ ] T6: analyze — per-step `cos_in` trajectory, per-sample `cos_S_contrib` bucketed by `src_pool` and `hacked`
### Phase 2 verification
| metric | success | likely fail | sneaky fail |
|---|---|---|---|
| `r.max() - r.min()` per step in mixed batch | > 1.0 (teacher ≈3.5, base ≈0-0.5) | <0.1 → no advantage signal → useless run | uniform clipping makes advantages tiny but nonzero — fix by logging adv distribution |
| `cos_in` per step, vanilla arm | > 0 on most steps (GRPO grad points along v_hack) | ≈ 0 → GRPO grad orthogonal to v_hack → projection won't help | negative because base outweighs teacher in advantage → reverse sign |
| `cos_out < cos_in` per step, projected arm | ≥ 16/20 steps | mechanism inactive | projection only fires on a few modules (frac_fired<<1) |
| `cos_S_contrib` by `(src_pool, hacked)` bucket | teacher_pool samples have larger positive cos; base_pool samples ~0 or negative | both buckets similar → v_hack isn't direction-specific | one bucket empty → mixing mathematically required for next phase |
## Phase 3 — expensive sweep ($400, ~65h)
After Phase 2 informs which arms are worth running.
### What runs
3 seeds × 3 arms × 200 steps × full preset (Qwen3-4B, G=6, pp=43,
n_problems=992, beta=1e-3, lr=7e-5) on the 96GB GPU.
Total: 9 runs × ~7h each = ~65h sequential. (Some can overlap on
multi-GPU; we have 1 GPU → sequential.)
### Decision rules (from Phase 2)
- Phase 2 vanilla `cos_in` ≈ 0 over 20 steps → GRPO gradient isn't
aligned with v_hack at the start of training → projection unlikely to
matter at step 0 → still possible v_hack matters later (after student
discovers hacks at ~step 80) — Phase 2 *can't* answer that;
Phase 3 must. Run sweep but expect smaller H1 effect.
- Phase 2 vanilla `cos_in` > 0.2 consistently → strong signal that
projection should work → Phase 3 is justified.
- Phase 2 projected reduces `cos_in` < 0.05 → projection mechanism is
effective → expect H1 to fire in Phase 3.
- Phase 2 projection breaks `cos_in < 0` (over-projection) → bug.
### Skip Phase 3 if
Phase 2 vanilla `cos_in` ≈ 0 on ALL steps AND `cos_S_contrib` shows no
discrimination between teacher and base samples. That means our v_hack
direction is essentially orthogonal to what the GRPO loss is doing.
Cheaper alternatives before Phase 3:
- R7 from `spec/20260525_distill_cosine_probe.md`: re-extract v_hack
with GRPO-style contrastive loss instead of NLL.
- Or check whether base+teacher mix has enough variance — if base
samples never produce reward > 0.5 the variance is one-sided.
### Cost ceiling on Phase 3
If after 3 seeds × 1 arm we see no separation, stop. Don't burn the
other 6 runs.
## Out of scope (for now)
- Arm 3 (W-space LoRA projection). Re-evaluate after Phase 2.
- Plotting / matplotlib trajectory figure.
- R7 v_hack re-extraction. Only if Phase 2 says current v_hack is
orthogonal to GRPO grad.
- Multi-GPU parallelism for Phase 3.
## Log
- 2026-05-25 — Phase 1 closed with UAT 4/4. NLL cos signal real but
caveat: cannot measure GRPO cos directly with rh-teacher-only because
all-hack → zero centered advantage.
- 2026-05-25 — base_pool generated (pueue 5). 0/8 hack on every step
as expected per ariahw §86. Now have variance source.
- 2026-05-25 — spec2.md written before finishing T3-T6 implementation.