mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-28 01:15:10 +08:00
e04548987f
spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
144 lines
6.8 KiB
Markdown
144 lines
6.8 KiB
Markdown
# spec2 — Phase 2 mixed-replay GRPO probe + Phase 3 expensive sweep plan
|
||
|
||
## Goal
|
||
|
||
Before committing the $400 / ~65h headline sweep (Phase 3), use cheap
|
||
replay-based probes (~1h total) to establish:
|
||
- Whether v_hack is aligned with the **GRPO** policy gradient (not just NLL)
|
||
on a mixed hack/non-hack batch.
|
||
- Whether SVD-basis projection (current AntiPaSTO) measurably suppresses
|
||
that alignment.
|
||
- Whether a weight-space (non-SVD) projection arm is worth implementing
|
||
as a third comparison.
|
||
|
||
## Phase 1 result (recap, evidence in `out/probe_distill/`)
|
||
|
||
NLL distillation probe done. UAT 4/4 pass. Headline: within rh-s65's
|
||
teacher pool, `cos(NLL_grad, v_hack)` is **+0.747** on pure-hack
|
||
samples vs **+0.398** on hack+correct samples (t=+4.46, p<1e-4 on 160
|
||
samples). Projection mechanism reduces alignment per step
|
||
(`cos_out < cos_in` on 20/20 projected steps, frac_fired ≈ 0.65).
|
||
|
||
**Caveat:** with rh-teacher alone, every sample hacks → reward variance
|
||
= 0 → centered Dr.GRPO advantage = 0 → cannot directly measure
|
||
GRPO-grad cosine. Phase 2 fixes this via mixed-replay.
|
||
|
||
## Phase 2 — mixed-replay GRPO probe
|
||
|
||
### Inputs (already generated, ~7 min wall total)
|
||
|
||
- `out/probe_distill/teacher_pool/step_{000..019}.jsonl.gz`
|
||
rh-s65, hint applied, 20 batches × 8 = 160 samples, ~99% hack.
|
||
- `out/probe_distill/base_pool/step_{000..019}.jsonl.gz`
|
||
base Qwen3-4B, no LoRA, no hint, 20 batches × 8 = 160 samples, ~0% hack.
|
||
|
||
### Mechanism
|
||
|
||
`probe_distill.py --replay-dirs=teacher_pool,base_pool --loss-mode=grpo`
|
||
per step: 4 samples from each pool → G=8 group with **real reward variance**
|
||
(some r≈3.5, some r≈0.25). Dr.GRPO centered advantage is non-zero.
|
||
|
||
Per-sample loss: `-adv_i * (logp_i * mask_i).sum() / mask_i.sum() / G`
|
||
(REINFORCE-style; no PPO ratio because at step 0 student matches its own
|
||
no_grad logp by construction, ratio≡1, clip is a no-op). Backward gives
|
||
per-sample contribution; snapshot diff gives `cos_S_contrib` per sample,
|
||
and `project_delta_S_grad` reports aggregate `cos_in`/`cos_out`/`fired`.
|
||
|
||
### Arms (this is the user's three-way ask)
|
||
|
||
| arm | mechanism | new code |
|
||
|---|---|---|
|
||
| 1. vanilla GRPO | no projection | none — `--arm=vanilla` |
|
||
| 2. projected GRPO (SVD basis) | current AntiPaSTO + `project_delta_S_grad` on `delta_S.grad` | none — `--arm=projected` |
|
||
| 3. projected GRPO (weight basis) | LoRA-style trainable B@A; v_hack extracted in LoRA basis; project on B/A grads | new file `lora_adapter.py` mirroring `antipasto.py`; new extraction; new arm |
|
||
|
||
Phase 2 runs arms 1+2 only (cheap, no new code). Arm 3 is deferred
|
||
into a follow-up if Phase 2 results justify it.
|
||
|
||
### Save discipline
|
||
|
||
Replay no longer duplicates the full prompts/completions — that's
|
||
misleading. Per-step output is **slim**: `step_NNN.cos.jsonl.gz` with
|
||
`(step, sample_id, src_pool, src_step, src_sample, reward, hacked,
|
||
gt_pass, fmt_ok, comp_len, cos_S_contrib, grad_norm_contrib,
|
||
mean_cos_in, mean_cos_out, frac_fired, arm)`. The actual rollouts live
|
||
in `teacher_pool/` and `base_pool/` only.
|
||
|
||
### Tasks
|
||
|
||
- [x] T1: teacher_pool 20 batches (done, hack_rate=0.994)
|
||
- [x] T2: base_pool 20 batches (done, hack_rate=0.000)
|
||
- [ ] T3a: add `--replay-dirs` + per-sample-plen handling to probe_distill
|
||
- [ ] T3b: add `--loss-mode=grpo` (REINFORCE-style centered-adv loss)
|
||
- [ ] T3c: switch replay save to `save_step_slim` schema
|
||
- [ ] T4: run `--arm=vanilla --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps
|
||
- [ ] T5: run `--arm=projected --replay-dirs=teacher_pool,base_pool --loss-mode=grpo` 20 steps
|
||
- [ ] T6: analyze — per-step `cos_in` trajectory, per-sample `cos_S_contrib` bucketed by `src_pool` and `hacked`
|
||
|
||
### Phase 2 verification
|
||
|
||
| metric | success | likely fail | sneaky fail |
|
||
|---|---|---|---|
|
||
| `r.max() - r.min()` per step in mixed batch | > 1.0 (teacher ≈3.5, base ≈0-0.5) | <0.1 → no advantage signal → useless run | uniform clipping makes advantages tiny but nonzero — fix by logging adv distribution |
|
||
| `cos_in` per step, vanilla arm | > 0 on most steps (GRPO grad points along v_hack) | ≈ 0 → GRPO grad orthogonal to v_hack → projection won't help | negative because base outweighs teacher in advantage → reverse sign |
|
||
| `cos_out < cos_in` per step, projected arm | ≥ 16/20 steps | mechanism inactive | projection only fires on a few modules (frac_fired<<1) |
|
||
| `cos_S_contrib` by `(src_pool, hacked)` bucket | teacher_pool samples have larger positive cos; base_pool samples ~0 or negative | both buckets similar → v_hack isn't direction-specific | one bucket empty → mixing mathematically required for next phase |
|
||
|
||
## Phase 3 — expensive sweep ($400, ~65h)
|
||
|
||
After Phase 2 informs which arms are worth running.
|
||
|
||
### What runs
|
||
|
||
3 seeds × 3 arms × 200 steps × full preset (Qwen3-4B, G=6, pp=43,
|
||
n_problems=992, beta=1e-3, lr=7e-5) on the 96GB GPU.
|
||
|
||
Total: 9 runs × ~7h each = ~65h sequential. (Some can overlap on
|
||
multi-GPU; we have 1 GPU → sequential.)
|
||
|
||
### Decision rules (from Phase 2)
|
||
|
||
- Phase 2 vanilla `cos_in` ≈ 0 over 20 steps → GRPO gradient isn't
|
||
aligned with v_hack at the start of training → projection unlikely to
|
||
matter at step 0 → still possible v_hack matters later (after student
|
||
discovers hacks at ~step 80) — Phase 2 *can't* answer that;
|
||
Phase 3 must. Run sweep but expect smaller H1 effect.
|
||
- Phase 2 vanilla `cos_in` > 0.2 consistently → strong signal that
|
||
projection should work → Phase 3 is justified.
|
||
- Phase 2 projected reduces `cos_in` < 0.05 → projection mechanism is
|
||
effective → expect H1 to fire in Phase 3.
|
||
- Phase 2 projection breaks `cos_in < 0` (over-projection) → bug.
|
||
|
||
### Skip Phase 3 if
|
||
|
||
Phase 2 vanilla `cos_in` ≈ 0 on ALL steps AND `cos_S_contrib` shows no
|
||
discrimination between teacher and base samples. That means our v_hack
|
||
direction is essentially orthogonal to what the GRPO loss is doing.
|
||
Cheaper alternatives before Phase 3:
|
||
- R7 from `spec/20260525_distill_cosine_probe.md`: re-extract v_hack
|
||
with GRPO-style contrastive loss instead of NLL.
|
||
- Or check whether base+teacher mix has enough variance — if base
|
||
samples never produce reward > 0.5 the variance is one-sided.
|
||
|
||
### Cost ceiling on Phase 3
|
||
|
||
If after 3 seeds × 1 arm we see no separation, stop. Don't burn the
|
||
other 6 runs.
|
||
|
||
## Out of scope (for now)
|
||
|
||
- Arm 3 (W-space LoRA projection). Re-evaluate after Phase 2.
|
||
- Plotting / matplotlib trajectory figure.
|
||
- R7 v_hack re-extraction. Only if Phase 2 says current v_hack is
|
||
orthogonal to GRPO grad.
|
||
- Multi-GPU parallelism for Phase 3.
|
||
|
||
## Log
|
||
|
||
- 2026-05-25 — Phase 1 closed with UAT 4/4. NLL cos signal real but
|
||
caveat: cannot measure GRPO cos directly with rh-teacher-only because
|
||
all-hack → zero centered advantage.
|
||
- 2026-05-25 — base_pool generated (pueue 5). 0/8 hack on every step
|
||
as expected per ariahw §86. Now have variance source.
|
||
- 2026-05-25 — spec2.md written before finishing T3-T6 implementation.
|