6.0 KiB
Research Journal
Append-only. New entries at the top, date-stamped. Never edit old entries.
2026-05-30
96GB readiness review fixes
Fresh subagent review found a real silent-failure risk: v_hack is not just
model-specific, it is also SVD-basis-specific. The old extractor loaded fp32
while train.py loaded bf16, so keys/ranks could match while the basis differed.
Fix: extract_vhack_grad.py, verify_vhack_heldout.py, and train.py now all
use bf16 by default; v_hack artifacts save {model, dtype, v_hack} metadata;
train.py refuses legacy artifacts and checks exact module keys and per-module
rank before first generation.
Also removed a bad smoke convenience: zero-spread reward batches no longer get random advantages. Dr.GRPO now correctly gives zero advantage when all group rewards match, so logs cannot look healthy while training on reward-unrelated noise.
Validated on the 24GB box:
just extract-vhack-smokevia pueue task 73: bf16, 186 modules, 148,032 delta_S scalars, zero-norm=0.just verify-vhack-smokevia pueue task 74:frac>0=0.952,mean=+0.355,median=+0.363, target pass.- one-step canonical train probe via pueue task 75: loaded
out/v_hack_smoke.ptwith key/rank match OK, completed without legacy artifact. Reward spread was false and loss/cos/fired were zero, as expected after removing random advantages.
For the 96GB machine, do not start queue-full blindly. First run one sequential
gate: pueue add --immediate --follow -w "$PWD" -o 9 -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" -- just probe-full-seed 41.
Only queue 3 seeds after the vanilla probe has nontrivial hack rate.
Mechanism end-to-end verified on Qwen3.5-0.8B; H4 falsified at this scale
Closed the smoke loop: AntiPaSTO identity (bf16, max_abs_diff=0) -> v_hack extraction from 15 contrastive pairs -> held-out validation (frac>0=0.952, median cos=+0.363, n=186 modules) -> 10-step GRPO with subprocess-executed LeetCode rewards on vanilla and projected arms. Full writeup in out/proof.md.
Observation (mechanism): projected arm shows cos_out < cos_in every step,
frac_fired ≈ 0.51 averaged over 10 steps. Vanilla arm: cos_out == cos_in.
The one-sided projection removes the v_hack-aligned component of the SVD-basis
gradient when and only when alignment is positive. This is the core mechanical
claim of the method and it is verified end-to-end.
Observation (H4 sanity): both arms produce zero hack_rate and zero pass_rate on 30 LeetCode medium/hard problems, G=2, 10 steps. Inspection of generations shows Qwen3.5-0.8B emits format-only output that saturates the 0.25 format bonus but never attempts code or hack patterns. Per spec.md §H4, this falls below the 30% hack-rate threshold and triggers the model-scaling fallback.
Inference: 0.8B is too small to exhibit the failure mode the method targets. The mechanism is sound; the test substrate is not. Wu & Tang's Rebound paper used Qwen2.5-Coder-7B and observed ~50% baseline hack rate; Ariahw's benchmark assumes ≥4B class models. Mechanism + scale are separable concerns and the smaller scope of this session was mechanism.
Caveats / what's untested:
- β=0 in smoke (no ref-model KL) to fit 24 GB. This is a 24-GB compromise, NOT a principled choice. Dr.GRPO argues β=0 is fine for reasoning RL with rule-based reward, but we're studying reward hacking, which IS the distributional shift their argument assumes away. lite/full presets default to β=0.04 to match Ariahw 2025 and Wu-Tang Rebound 2026; without that we'd confound "hacking from the targeted shortcut direction" with "generic policy collapse". Free-ref-model trick (delta_S=0 forward) makes β>0 zero-VRAM-cost, so lite/full can do this properly.
- Only 10 steps. Reward-hacking emerges around step 50–200 in Rebound figs.
- 186 target modules, full-rank per-module SVD. Larger models scale similarly.
frac_fired ≈ 0.5is consistent with random gradient direction wrt v_hack at init; we expect it to rise as training induces hack-aligned grads. Need longer runs to see this.
Next (queued in justfile, pending ≥80 GB GPU):
queue-vanilla: Qwen2.5-Coder-7B baseline GRPO on full LeetCode set, 200 steps, 3 seeds, β=0.04, G=4. Expected hack_rate at convergence: 40–60% (Rebound table 2).queue-projected-m16: same config + per-module v_hack projection at m=16.queue-rebound: H3 baseline arm — Wu-Tang advantage modification.
Confidence in method post-mechanism-verification: ~65% (was ~60%). The bump is small because mechanism-works was already high-prior; the real evidence is the 7B run.
Project init
Scaffolded repo per setup-repo skill. Cloned external/rl-rewardhacking (Ariahw's verl-based GRPO + LeetCode reward-hacking benchmark) and fetched the three key papers (docs/papers/):
- Ariahw, Engels, Nanda 2025 (LessWrong) — the benchmark and monitor-based interventions
- Wu & Tang 2026 (arXiv 2604.01476) — "When Reward Hacking Rebounds"; proposes Advantage Modification using shortcut concept direction. This is the closest prior work to ours and the H3 baseline arm.
- Ichihara et al. 2025 (arXiv 2509.22047) — MO-GRPO; multi-objective GRPO with per-reward variance normalization. Related framing of reward hacking as high-variance reward dominating advantage.
Extracted brainstorm prefs to docs/brainstorm/extracted_prefs.md. Biggest delta vs spec.md: the project pivoted mid-brainstorm from DPO+sycophancy to GRPO+reward-hacking, and the method evolved from bidirectional NLL+KL+PCGrad (paired-preference) to gradient-level projection (unpaired). Confidence ~60% the method works post-Rebound (was ~40% pre-Rebound; Rebound validates the core mechanism — concept-direction-based intervention — but at advantage rather than gradient level).
Next: smoke test both pathways on tiny-random Qwen, prototype the results table, then move to 96GB GPU for the H4 sanity run.