mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 20:21:41 +08:00

Files

T

wassname e04548987f spec2 + base_pool generator + slim replay save (partial mixed-replay TODO)

spec2.md records:
 - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
 - Phase 2: mixed-replay GRPO probe, partial impl
 - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal

User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.

probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.

Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 11:48:48 +00:00

6.8 KiB

Raw Blame History

spec2 — Phase 2 mixed-replay GRPO probe + Phase 3 expensive sweep plan

Goal

Before committing the $400 / ~65h headline sweep (Phase 3), use cheap replay-based probes (~1h total) to establish:

Whether v_hack is aligned with the GRPO policy gradient (not just NLL) on a mixed hack/non-hack batch.
Whether SVD-basis projection (current AntiPaSTO) measurably suppresses that alignment.
Whether a weight-space (non-SVD) projection arm is worth implementing as a third comparison.

Phase 1 result (recap, evidence in `out/probe_distill/`)

NLL distillation probe done. UAT 4/4 pass. Headline: within rh-s65's teacher pool, cos(NLL_grad, v_hack) is +0.747 on pure-hack samples vs +0.398 on hack+correct samples (t=+4.46, p<1e-4 on 160 samples). Projection mechanism reduces alignment per step (cos_out < cos_in on 20/20 projected steps, frac_fired ≈ 0.65).

Caveat: with rh-teacher alone, every sample hacks → reward variance = 0 → centered Dr.GRPO advantage = 0 → cannot directly measure GRPO-grad cosine. Phase 2 fixes this via mixed-replay.

Phase 2 — mixed-replay GRPO probe

Inputs (already generated, ~7 min wall total)

out/probe_distill/teacher_pool/step_{000..019}.jsonl.gz rh-s65, hint applied, 20 batches × 8 = 160 samples, ~99% hack.
out/probe_distill/base_pool/step_{000..019}.jsonl.gz base Qwen3-4B, no LoRA, no hint, 20 batches × 8 = 160 samples, ~0% hack.

Mechanism

probe_distill.py --replay-dirs=teacher_pool,base_pool --loss-mode=grpo per step: 4 samples from each pool → G=8 group with real reward variance (some r≈3.5, some r≈0.25). Dr.GRPO centered advantage is non-zero.

Per-sample loss: -adv_i * (logp_i * mask_i).sum() / mask_i.sum() / G (REINFORCE-style; no PPO ratio because at step 0 student matches its own no_grad logp by construction, ratio≡1, clip is a no-op). Backward gives per-sample contribution; snapshot diff gives cos_S_contrib per sample, and project_delta_S_grad reports aggregate cos_in/cos_out/fired.

Arms (this is the user's three-way ask)

arm	mechanism	new code
1. vanilla GRPO	no projection	none — `--arm=vanilla`
2. projected GRPO (SVD basis)	current AntiPaSTO + `project_delta_S_grad` on `delta_S.grad`	none — `--arm=projected`
3. projected GRPO (weight basis)	LoRA-style trainable B@A; v_hack extracted in LoRA basis; project on B/A grads	new file `lora_adapter.py` mirroring `antipasto.py`; new extraction; new arm

Phase 2 runs arms 1+2 only (cheap, no new code). Arm 3 is deferred into a follow-up if Phase 2 results justify it.

Save discipline

Replay no longer duplicates the full prompts/completions — that's misleading. Per-step output is slim: step_NNN.cos.jsonl.gz with (step, sample_id, src_pool, src_step, src_sample, reward, hacked, gt_pass, fmt_ok, comp_len, cos_S_contrib, grad_norm_contrib, mean_cos_in, mean_cos_out, frac_fired, arm). The actual rollouts live in teacher_pool/ and base_pool/ only.

Tasks

T1: teacher_pool 20 batches (done, hack_rate=0.994)
T2: base_pool 20 batches (done, hack_rate=0.000)
T3a: add --replay-dirs + per-sample-plen handling to probe_distill
T3b: add --loss-mode=grpo (REINFORCE-style centered-adv loss)
T3c: switch replay save to save_step_slim schema
T4: run --arm=vanilla --replay-dirs=teacher_pool,base_pool --loss-mode=grpo 20 steps
T5: run --arm=projected --replay-dirs=teacher_pool,base_pool --loss-mode=grpo 20 steps
T6: analyze — per-step cos_in trajectory, per-sample cos_S_contrib bucketed by src_pool and hacked

Phase 2 verification

metric	success	likely fail	sneaky fail
`r.max() - r.min()` per step in mixed batch	> 1.0 (teacher ≈3.5, base ≈0-0.5)	<0.1 → no advantage signal → useless run	uniform clipping makes advantages tiny but nonzero — fix by logging adv distribution
`cos_in` per step, vanilla arm	> 0 on most steps (GRPO grad points along v_hack)	≈ 0 → GRPO grad orthogonal to v_hack → projection won't help	negative because base outweighs teacher in advantage → reverse sign
`cos_out < cos_in` per step, projected arm	≥ 16/20 steps	mechanism inactive	projection only fires on a few modules (frac_fired<<1)
`cos_S_contrib` by `(src_pool, hacked)` bucket	teacher_pool samples have larger positive cos; base_pool samples ~0 or negative	both buckets similar → v_hack isn't direction-specific	one bucket empty → mixing mathematically required for next phase

Phase 3 — expensive sweep ($400, ~65h)

After Phase 2 informs which arms are worth running.

What runs

3 seeds × 3 arms × 200 steps × full preset (Qwen3-4B, G=6, pp=43, n_problems=992, beta=1e-3, lr=7e-5) on the 96GB GPU.

Total: 9 runs × ~7h each = ~65h sequential. (Some can overlap on multi-GPU; we have 1 GPU → sequential.)

Decision rules (from Phase 2)

Phase 2 vanilla cos_in ≈ 0 over 20 steps → GRPO gradient isn't aligned with v_hack at the start of training → projection unlikely to matter at step 0 → still possible v_hack matters later (after student discovers hacks at ~step 80) — Phase 2 can't answer that; Phase 3 must. Run sweep but expect smaller H1 effect.
Phase 2 vanilla cos_in > 0.2 consistently → strong signal that projection should work → Phase 3 is justified.
Phase 2 projected reduces cos_in < 0.05 → projection mechanism is effective → expect H1 to fire in Phase 3.
Phase 2 projection breaks cos_in < 0 (over-projection) → bug.

Skip Phase 3 if

Phase 2 vanilla cos_in ≈ 0 on ALL steps AND cos_S_contrib shows no discrimination between teacher and base samples. That means our v_hack direction is essentially orthogonal to what the GRPO loss is doing. Cheaper alternatives before Phase 3:

R7 from spec/20260525_distill_cosine_probe.md: re-extract v_hack with GRPO-style contrastive loss instead of NLL.
Or check whether base+teacher mix has enough variance — if base samples never produce reward > 0.5 the variance is one-sided.

Cost ceiling on Phase 3

If after 3 seeds × 1 arm we see no separation, stop. Don't burn the other 6 runs.

Out of scope (for now)

Arm 3 (W-space LoRA projection). Re-evaluate after Phase 2.
Plotting / matplotlib trajectory figure.
R7 v_hack re-extraction. Only if Phase 2 says current v_hack is orthogonal to GRPO grad.
Multi-GPU parallelism for Phase 3.

Log

2026-05-25 — Phase 1 closed with UAT 4/4. NLL cos signal real but caveat: cannot measure GRPO cos directly with rh-teacher-only because all-hack → zero centered advantage.
2026-05-25 — base_pool generated (pueue 5). 0/8 hack on every step as expected per ariahw §86. Now have variance source.
2026-05-25 — spec2.md written before finishing T3-T6 implementation.

6.8 KiB Raw Blame History Unescape Escape