4.0 KiB
Research Journal
Append-only. New entries at the top, date-stamped. Never edit old entries.
2026-05-30
Mechanism end-to-end verified on Qwen3.5-0.8B; H4 falsified at this scale
Closed the smoke loop: AntiPaSTO identity (bf16, max_abs_diff=0) -> v_hack extraction from 15 contrastive pairs -> held-out validation (frac>0=0.952, median cos=+0.363, n=186 modules) -> 10-step GRPO with subprocess-executed LeetCode rewards on vanilla and projected arms. Full writeup in out/proof.md.
Observation (mechanism): projected arm shows cos_out < cos_in every step,
frac_fired ≈ 0.51 averaged over 10 steps. Vanilla arm: cos_out == cos_in.
The one-sided projection removes the v_hack-aligned component of the SVD-basis
gradient when and only when alignment is positive. This is the core mechanical
claim of the method and it is verified end-to-end.
Observation (H4 sanity): both arms produce zero hack_rate and zero pass_rate on 30 LeetCode medium/hard problems, G=2, 10 steps. Inspection of generations shows Qwen3.5-0.8B emits format-only output that saturates the 0.25 format bonus but never attempts code or hack patterns. Per spec.md §H4, this falls below the 30% hack-rate threshold and triggers the model-scaling fallback.
Inference: 0.8B is too small to exhibit the failure mode the method targets. The mechanism is sound; the test substrate is not. Wu & Tang's Rebound paper used Qwen2.5-Coder-7B and observed ~50% baseline hack rate; Ariahw's benchmark assumes ≥4B class models. Mechanism + scale are separable concerns and the smaller scope of this session was mechanism.
Caveats / what's untested:
- β=0 (no ref-model KL) to fit 24 GB. Rebound used β=0.04. KL-free GRPO can diverge faster; not a fair comparison to Rebound at this scale.
- Only 10 steps. Reward-hacking emerges around step 50–200 in Rebound figs.
- 186 target modules, m=8 SVD rank. Larger models scale this to ~400+ modules.
frac_fired ≈ 0.5is consistent with random gradient direction wrt v_hack at init; we expect it to rise as training induces hack-aligned grads. Need longer runs to see this.
Next (queued in justfile, pending ≥80 GB GPU):
queue-vanilla: Qwen2.5-Coder-7B baseline GRPO on full LeetCode set, 200 steps, 3 seeds, β=0.04, G=4. Expected hack_rate at convergence: 40–60% (Rebound table 2).queue-projected-m16: same config + per-module v_hack projection at m=16.queue-rebound: H3 baseline arm — Wu-Tang advantage modification.
Confidence in method post-mechanism-verification: ~65% (was ~60%). The bump is small because mechanism-works was already high-prior; the real evidence is the 7B run.
Project init
Scaffolded repo per setup-repo skill. Cloned external/rl-rewardhacking (Ariahw's verl-based GRPO + LeetCode reward-hacking benchmark) and fetched the three key papers (docs/papers/):
- Ariahw, Engels, Nanda 2025 (LessWrong) — the benchmark and monitor-based interventions
- Wu & Tang 2026 (arXiv 2604.01476) — "When Reward Hacking Rebounds"; proposes Advantage Modification using shortcut concept direction. This is the closest prior work to ours and the H3 baseline arm.
- Ichihara et al. 2025 (arXiv 2509.22047) — MO-GRPO; multi-objective GRPO with per-reward variance normalization. Related framing of reward hacking as high-variance reward dominating advantage.
Extracted brainstorm prefs to docs/brainstorm/extracted_prefs.md. Biggest delta vs spec.md: the project pivoted mid-brainstorm from DPO+sycophancy to GRPO+reward-hacking, and the method evolved from bidirectional NLL+KL+PCGrad (paired-preference) to gradient-level projection (unpaired). Confidence ~60% the method works post-Rebound (was ~40% pre-Rebound; Rebound validates the core mechanism — concept-direction-based intervention — but at advantage rather than gradient level).
Next: smoke test both pathways on tiny-random Qwen, prototype the results table, then move to 96GB GPU for the H4 sanity run.