Files
evil_MoE/docs/RESEARCH_JOURNAL.md
T
wassname 0e2c786d4a ready
2026-05-23 14:19:41 +08:00

6.0 KiB
Raw Blame History

Research Journal

Append-only. New entries at the top, date-stamped. Never edit old entries.

2026-05-30

96GB readiness review fixes

Fresh subagent review found a real silent-failure risk: v_hack is not just model-specific, it is also SVD-basis-specific. The old extractor loaded fp32 while train.py loaded bf16, so keys/ranks could match while the basis differed. Fix: extract_vhack_grad.py, verify_vhack_heldout.py, and train.py now all use bf16 by default; v_hack artifacts save {model, dtype, v_hack} metadata; train.py refuses legacy artifacts and checks exact module keys and per-module rank before first generation.

Also removed a bad smoke convenience: zero-spread reward batches no longer get random advantages. Dr.GRPO now correctly gives zero advantage when all group rewards match, so logs cannot look healthy while training on reward-unrelated noise.

Validated on the 24GB box:

  • just extract-vhack-smoke via pueue task 73: bf16, 186 modules, 148,032 delta_S scalars, zero-norm=0.
  • just verify-vhack-smoke via pueue task 74: frac>0=0.952, mean=+0.355, median=+0.363, target pass.
  • one-step canonical train probe via pueue task 75: loaded out/v_hack_smoke.pt with key/rank match OK, completed without legacy artifact. Reward spread was false and loss/cos/fired were zero, as expected after removing random advantages.

For the 96GB machine, do not start queue-full blindly. First run one sequential gate: pueue add --immediate --follow -w "$PWD" -o 9 -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" -- just probe-full-seed 41. Only queue 3 seeds after the vanilla probe has nontrivial hack rate.

Mechanism end-to-end verified on Qwen3.5-0.8B; H4 falsified at this scale

Closed the smoke loop: AntiPaSTO identity (bf16, max_abs_diff=0) -> v_hack extraction from 15 contrastive pairs -> held-out validation (frac>0=0.952, median cos=+0.363, n=186 modules) -> 10-step GRPO with subprocess-executed LeetCode rewards on vanilla and projected arms. Full writeup in out/proof.md.

Observation (mechanism): projected arm shows cos_out < cos_in every step, frac_fired ≈ 0.51 averaged over 10 steps. Vanilla arm: cos_out == cos_in. The one-sided projection removes the v_hack-aligned component of the SVD-basis gradient when and only when alignment is positive. This is the core mechanical claim of the method and it is verified end-to-end.

Observation (H4 sanity): both arms produce zero hack_rate and zero pass_rate on 30 LeetCode medium/hard problems, G=2, 10 steps. Inspection of generations shows Qwen3.5-0.8B emits format-only output that saturates the 0.25 format bonus but never attempts code or hack patterns. Per spec.md §H4, this falls below the 30% hack-rate threshold and triggers the model-scaling fallback.

Inference: 0.8B is too small to exhibit the failure mode the method targets. The mechanism is sound; the test substrate is not. Wu & Tang's Rebound paper used Qwen2.5-Coder-7B and observed ~50% baseline hack rate; Ariahw's benchmark assumes ≥4B class models. Mechanism + scale are separable concerns and the smaller scope of this session was mechanism.

Caveats / what's untested:

  • β=0 in smoke (no ref-model KL) to fit 24 GB. This is a 24-GB compromise, NOT a principled choice. Dr.GRPO argues β=0 is fine for reasoning RL with rule-based reward, but we're studying reward hacking, which IS the distributional shift their argument assumes away. lite/full presets default to β=0.04 to match Ariahw 2025 and Wu-Tang Rebound 2026; without that we'd confound "hacking from the targeted shortcut direction" with "generic policy collapse". Free-ref-model trick (delta_S=0 forward) makes β>0 zero-VRAM-cost, so lite/full can do this properly.
  • Only 10 steps. Reward-hacking emerges around step 50200 in Rebound figs.
  • 186 target modules, full-rank per-module SVD. Larger models scale similarly.
  • frac_fired ≈ 0.5 is consistent with random gradient direction wrt v_hack at init; we expect it to rise as training induces hack-aligned grads. Need longer runs to see this.

Next (queued in justfile, pending ≥80 GB GPU):

  1. queue-vanilla: Qwen2.5-Coder-7B baseline GRPO on full LeetCode set, 200 steps, 3 seeds, β=0.04, G=4. Expected hack_rate at convergence: 4060% (Rebound table 2).
  2. queue-projected-m16: same config + per-module v_hack projection at m=16.
  3. queue-rebound: H3 baseline arm — Wu-Tang advantage modification.

Confidence in method post-mechanism-verification: ~65% (was ~60%). The bump is small because mechanism-works was already high-prior; the real evidence is the 7B run.

Project init

Scaffolded repo per setup-repo skill. Cloned external/rl-rewardhacking (Ariahw's verl-based GRPO + LeetCode reward-hacking benchmark) and fetched the three key papers (docs/papers/):

  • Ariahw, Engels, Nanda 2025 (LessWrong) — the benchmark and monitor-based interventions
  • Wu & Tang 2026 (arXiv 2604.01476) — "When Reward Hacking Rebounds"; proposes Advantage Modification using shortcut concept direction. This is the closest prior work to ours and the H3 baseline arm.
  • Ichihara et al. 2025 (arXiv 2509.22047) — MO-GRPO; multi-objective GRPO with per-reward variance normalization. Related framing of reward hacking as high-variance reward dominating advantage.

Extracted brainstorm prefs to docs/brainstorm/extracted_prefs.md. Biggest delta vs spec.md: the project pivoted mid-brainstorm from DPO+sycophancy to GRPO+reward-hacking, and the method evolved from bidirectional NLL+KL+PCGrad (paired-preference) to gradient-level projection (unpaired). Confidence ~60% the method works post-Rebound (was ~40% pre-Rebound; Rebound validates the core mechanism — concept-direction-based intervention — but at advantage rather than gradient level).

Next: smoke test both pathways on tiny-random Qwen, prototype the results table, then move to 96GB GPU for the H4 sanity run.