# Experiment: rank-space gradient projection vs RL reward hacking ## Context GRPO and related on-policy RL methods are known to exploit loopholes in reward functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)) open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead of solving problems, reaching 79% reward hack rate at 200 training steps. Existing mitigations are mostly monitor-based (detect at output) or advantage-based (Rebound: penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026 [arxiv:2604.01476](https://arxiv.org/abs/2604.01476)). This experiment tests a different mechanism: **wrap target modules with the AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r SVD basis from contrastive pairs, and project each step's `grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.** Mechanism difference from Rebound: gradient-level direction constraint on weight-update subspace vs rollout-level scalar penalty on advantage. This is preregistered: results to be reported regardless of outcome. ## Why AntiPaSTO and not vanilla LoRA Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so "project out v_hack in rank space" has no fixed reference frame. AntiPaSTO (Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py)) freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]` plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD basis of the original weight, so `v_hack` extracted in that basis remains meaningful across all training steps. Forward pass per wrapped module (first pass uses full rank $r = \min(d_{in}, d_{out})$, so the residual term $W_{res}$ vanishes): $$y = ((x V_h^T) \odot (S + \delta_S)) U^T$$ where $U$, $S$, $V_h$ come from the SVD of $W$ and are buffers (frozen). Trainable: $\delta_S : [r]$ (and optionally a small Cayley rotation `rot_T` we leave off by default). At reduced rank we would add $x W_{res}^T$ with $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, but we defer rank cropping to v2 to skip the "where to cut" question. Per-step gradient signal: $$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$ We extract `v_hack` **gradient-side** (locked in): for each contrastive pair, run one NLL backward on the completion tokens and read each module's `m.delta_S.grad : [r]`. Then $\hat v_{hack}^{(m)} =$ unit$($mean$_{hack}$ grad $-$ mean$_{clean}$ grad$)$. This lives in the exact same `[r]` rank space the per-step training gradient lives in (the gradient is the natural object to compare gradients against), and it fuses the input-activation and output-error contributions in one shot instead of guessing whether input-side $(x V_h^T)$ or output-side $(\partial L/\partial h)\, U$ better predicts where SGD will move. We did consider activation-side ($x V_h^T$ mean-diff). Dropped as primary because it only sees the input factor and ignores the output-error factor, while the per-step gradient sees both. **Generalization constraint (load-bearing methodology).** The pairs used for `v_hack` extraction must come from a distribution *distinct from* whatever the RL-trained model produces in deployment. This is the threat model: at deployment we don't know which hacks will emerge, so the defense's generalization claim depends on `v_hack` being authored *before* seeing rollouts. If we post-hoc tune `pairs.py` to match the specific hack patterns that emerge during vanilla training, we've fit our defense to a known attack — that's teaching to the test, not testing the hypothesis. The current `pairs.py` is deliberately broader than any single threat model (4 hack flavors, not just ariahw's overwrite-tests loophole) so that suppression of a *specific* emergent pattern is evidence the subspace generalizes. If projection fails to suppress emergent hacks, the right response is to interpret the negative result, not to widen `pairs.py` to retroactively include the failed pattern. Projection (locked: no magnitude threshold; one-sided clip stays — see note): $$g \leftarrow g - \max(0,\, \cos_{align}) \cdot \|g\| \cdot \hat v_{hack}, \qquad \cos_{align} = \frac{g \cdot \hat v_{hack}}{\|g\|}$$ then rescale to original $\|g\|$ (magnitude-preserving). The $\max(0,\cdot)$ is not gating, it's directional correctness: without it, when $\cos<0$ we'd be *adding* to the hack component. No magnitude/threshold gating (locked): we project every step every module. Capacity cost is ~1/r per module per step. If `v_hack` at a module is just noise, projection ablates a noise direction in expectation = approximately a no-op. ## Why not vanilla GRPO via verl verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a pre-optimizer-step hook on per-module rank-space gradients requires deep subclassing of their worker abstraction. We pay one cost in exchange: we use [lsdefine/simple_GRPO](https://github.com/lsdefine/simple_GRPO) instead. simple_GRPO is a two-file GRPO implementation (`ref_server.py` + `grpo_ref_split.py`, ~315 lines total) with reported convergence on Qwen2.5-7B. The training loop is literally `loss = GRPO_step(batch); engine.backward(loss); engine.step()` — inserting a projection hook between backward and step is a one-line edit. Cost of this deviation: we re-establish the "vanilla hack emergence" baseline on simple_GRPO rather than inheriting it from Ariahw's verl baseline. H4 is the sanity check that this happens. We port Ariahw's `run_tests`-overwrite detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py)) into simple_GRPO's reward server (`docs/vendor/simple_GRPO/ref_server.py`). Vendored references (read-only, see [docs/vendor/](docs/vendor/)): - [simple_GRPO](https://github.com/lsdefine/simple_GRPO) — GRPO trainer - [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter - [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`) ## Hypotheses (preregistered) **H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched LeetCode pass rate within 10pp of vanilla. Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds. **H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms advantage-level intervention (Rebound reimplemented) on hack rate at matched pass rate. Falsified if: Rebound reimplementation matches or beats ours within 1 SEM. **H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla AntiPaSTO+GRPO on simple_GRPO reproduces measurable reward hacking (>30% hack rate at 200 steps). Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with num_generations halved. Secondary: if simple_GRPO can't reproduce hacking on either model, fall back to Ariahw's verl path and accept the harder hook. ## Steps ### 1. Build infra — fast-dev-run targets first, no real training yet - **1a.** Vendor simple_GRPO into `docs/vendor/simple_GRPO/` (done); smoke-run their GSM8K example on tiny-random-qwen3 (5 steps, CPU) to confirm `ref_server` + `grpo_ref_split` rollout/train split works in our env. - **1b.** Vendor lora-lite into `docs/vendor/lora-lite/` (done); wrap Qwen3.5-0.8B attn+MLP `nn.Linear` modules with AntiPaSTO **at full rank** (`r = min(d_in, d_out)`, no SVD cropping; `rotate_basis="none"`, only `delta_S` trainable). Full rank means $W = U \,\mathrm{diag}(S)\, V_h$ exactly and `W_res = 0`, so there's no truncation error to debug on the first pass. Verify forward-pass round-trip numerically matches base model at $\delta_S = 0$ (max abs diff <1e-3 on a fixed prompt). - **1c.** Implement gradient-side `v_hack` extraction (pseudocode below). Validation: per-module held-out projection score `cos(g_held_hack - g_held_clean, v_hack)` > 0 in >50% of modules. ### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+simple_GRPO produce hacking? - **2a.** Port Ariahw's `run_tests`-overwrite detection into simple_GRPO's `ref_server.py` reward fn. Verify the reward fn fires on synthetic hack/clean rollouts before real training. - **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=full`, `delta_S` only), GRPO (group_norm), 200 steps, num_generations=8, batch=16, 1 seed. Decision: if hack rate <30%, switch to Qwen3-4B with `num_generations=4, batch=16` (half num_gen to keep VRAM headroom) and re-run 2b. Secondary fallback: drop simple_GRPO, return to verl. ### 3. Implement rank-space projection in simple_GRPO's training loop - **3a.** In `grpo_ref_split.py`, between `engine.backward(loss)` and `engine.step()`, call `project_grads(model, v_hack_cache)`. `project_grads` walks `[m for m in model.modules() if hasattr(m, 'delta_S')]` and for each module reads `m.delta_S.grad : [r]`, projects against `v_hack[module_name]` (one-sided, magnitude-preserving), writes back in place. (Pseudocode below.) - **3b.** Diagnostics logged per step (aggregated over modules): mean/std `cos_align`, mean `||grad||`, `frac_modules_with_cos>0`. ### 4. Run arms (200 steps each, 3 seeds where indicated) a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline b. Our method, gradient-side `v_hack`, no gating (3 seeds) — main result c. Our method, no magnitude preservation (1 seed) — design ablation d. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3 (concrete formula: per-rollout penalty `α · max(0, cos(h_mean, v_concept))` added to scalar reward, where `h_mean` is mean residual-stream activation at a chosen layer and `v_concept` is mean-diff activation direction extracted from the same 60-80 pairs. We use Wu & Tang 2026 §3.2's published `α=0.5` and same layer fraction (60-75% depth). Single layer, not per-module, matching their setup. *Different `v_concept` from our gradient-side `v_hack` — this is intentional: H3 isolates the gradient-vs-advantage mechanism choice, not the direction-extraction choice.*) Total: 10 runs × ~3h on RTX 6000 96GB = ~30h compute. *(Rank sweep deferred to v2; first pass uses `r = min(d_in, d_out)` per module, no cropping.)* ### 5. Measure at every 25 steps - **Hack rate** (Ariahw's detector ported into simple_GRPO) - **Pass rate** on held-out problems without write access to evaluator - **Per-module `cos_align`** trajectory (sanity that we're projecting something nonzero) - **`frac_modules_with_cos>0`** per step (sanity that one-sided clip fires) - **KL drift from init policy** (catastrophic-change check) ### 6. Headline plot and headline table **Plot.** Hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our method should land below-and-to-the-right of vanilla. Annotate Rebound. **Table schema (publication-ready; left-to-right = essential to optional, so trailing columns can be cut for space):** | Arm | ΔSafePass↑ | Hack %↓ | Pass %↑ | KL↓ | mean·cos\* | frac·fired\* | ‖g‖\* | |---|---|---|---|---|---|---|---| | Vanilla (a) | 0 (ref) | — | — | — | — | — | — | | **Ours (b)** | — | — | — | — | — | — | — | | Ours, no mag-preserve (c) | — | — | — | — | — | — | — | | Rebound (d) | — | — | — | — | — | — | — | *Caption.* ↑ higher is better, ↓ lower is better. **ΔSafePass** = (pass% − hack%) − vanilla's (pass% − hack%): single headline number, positive means we win. **Hack %** = fraction of rollouts triggering `run_tests`-overwrite detector. **Pass %** = fraction passing held-out tests without write access. **KL** = mean per-token KL from init policy over last 25 steps. \* = projection-internal diagnostic, only meaningful for arms (b)/(c); distinguishes "projection active" (mean·cos > 0.2, frac·fired > 0.4) from "projection silent no-op". Cells report mean ± SEM across seeds. ### 7. Falsification check Before publishing, run pre-registered analysis on H1, H3, H4. Report all hypotheses including falsified ones. ## Pseudocode (the three load-bearing bits) ### A. AntiPaSTO module wrap (full rank, first pass) ``` class AntiPaSTO(nn.Module): # constructed from an existing nn.Linear(W: [d_out, d_in], b) # FIRST PASS: r = min(d_out, d_in) -- no truncation, W_res == 0 def __init__(self, W, b): U, S, Vh = torch.linalg.svd(W.float(), full_matrices=False) r = S.shape[0] # = min(d_out, d_in) # buffers (frozen): the full SVD self.U = U # [d_out, r] self.S = S # [r] self.Vh = Vh # [r, d_in] self.b = b # trainable (ONLY this): scalar per rank self.delta_S = nn.Parameter(torch.zeros(r)) def forward(self, x): # x: [..., d_in] return ((x @ self.Vh.T) * (self.S + self.delta_S)) @ self.U.T + self.b ``` Replace every target `nn.Linear` in attn (`q,k,v,o_proj`) and MLP (`up,gate,down_proj`) with this. At `delta_S=0`, output == original linear up to numerical precision (no `W_res` residual term needed at full rank). **SVD precompute strategy.** Don't SVD the whole model on GPU at once. Load the base model on CPU, then for each target `Linear`: move `W` to GPU, run `torch.linalg.svd(W.float(), full_matrices=False)`, save `(U, S, Vh) -> svd_cache/{model_name}/{module_path}.pt`. Wrap construction then loads the cached SVD per module. SVD is done once per base model; ~5-10s per big MLP weight on RTX 3090. ### B. Gradient-side `v_hack` extraction (per module) ``` v_hack = {} # dict[module_name -> Tensor[r]] grads_hack = defaultdict(list) grads_clean = defaultdict(list) # Per-pair: process hack and clean independently, NLL over their own completion # tokens only. Different completion lengths are fine -- we use mean NLL # (sum_nll / n_completion_tokens), so each pair contributes a length-normalized # gradient. This avoids biasing v_hack toward longer (typically clean) # completions. Pad each example individually; no cross-completion padding. for (prompt, hack_completion, clean_completion) in pairs: for label, completion in [('hack', hack_completion), ('clean', clean_completion)]: model.zero_grad() ids = tokenize(prompt + completion) # [1, L] mask = completion_mask(ids, prompt_len=len(prompt_ids)) # 1 on completion tokens logits = model(ids).logits[:, :-1] # MEAN NLL over completion tokens (length-normalized) loss = (nll_per_token(logits, ids[:, 1:]) * mask[:, 1:]).sum() / mask[:, 1:].sum() loss.backward() for name, m in model.named_modules(): if hasattr(m, 'delta_S'): bucket = grads_hack if label == 'hack' else grads_clean bucket[name].append(m.delta_S.grad.detach().cpu().clone()) for name in grads_hack: diff = stack(grads_hack[name]).mean(0) - stack(grads_clean[name]).mean(0) # [r] v_hack[name] = diff / (diff.norm() + 1e-8) torch.save(v_hack, 'v_hack.pt') ``` Validation (report both, don't just gate on threshold): - On held-out pairs, recompute per-module `diff_held` and `cos_align_held = cos(diff_held, v_hack[name])`. - **Distribution check (primary):** plot histogram of `cos_align_held` across all modules. Healthy = unimodal positive, median > 0.3. Pathological = bimodal or median near 0. - **Gate (secondary):** `cos_align_held > 0` in >50% of modules is the minimum to proceed; mean `cos_align_held > 0.2` is the target. If <50% pass, extraction is broken and we debug before training. ### C. Pre-optimizer-step projection hook ``` def project_grads(model, v_hack: dict[str, Tensor]): # called after engine.backward(loss), before engine.step() cos_log, n_modules, n_fired = [], 0, 0 for name, m in model.named_modules(): if not hasattr(m, 'delta_S'): continue g = m.delta_S.grad # [r] if g is None: continue n_modules += 1 v = v_hack[name].to(g.device) # [r], unit g_norm = g.norm() if g_norm < 1e-12: continue cos_a = (g @ v) / g_norm # scalar cos_log.append(cos_a.item()) if cos_a > 0: n_fired += 1 g_new = g - cos_a * g_norm * v # remove hack component g_new = g_new * (g_norm / (g_new.norm() + 1e-8)) # magnitude preserve m.delta_S.grad.copy_(g_new) return dict(mean_cos=mean(cos_log), frac_fired=n_fired/max(n_modules,1)) ``` Integration into `grpo_ref_split.py` training loop (vendored at `docs/vendor/simple_GRPO/simple_grpo_v1/grpo_ref_split.py`; we copy and edit, not import): ``` # at top of training script, once: v_hack = torch.load('v_hack.pt', map_location='cpu') # dict[str, Tensor[r]] # (extraction script from B above produces this artifact; if missing, crash loud) # inside the training loop: loss = GRPO_step(batch) engine.backward(loss) stats = project_grads(engine.module, v_hack) # <-- NEW: 1 line engine.step() if rank == 0: log(stats) ``` ## Decisions left open (write these up alongside results) - **Rank `r`.** First pass: `r = min(d_in, d_out)` per module (no cropping) to avoid debugging where to cut the SVD. Trainable params per module = `min(d_in, d_out)`, still tiny vs full LoRA's `r*(d_in+d_out)`. Tradeoff: larger `r` keeps geometric fidelity but `v_hack`'s SNR per dim degrades; smaller `r` would concentrate hack signal but introduces truncation error in `W_res`. Rank sweep is v2 work. ## Why measure ratio, not just hack rate A model that learns nothing won't cheat. The honest metric is the *Pareto frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards undertraining; pure pass-rate rewards anything that improves coding including via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out problems without write access, our method reduces hack rate from X% to Y%." ## Compute estimate - Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps, simple_GRPO, AntiPaSTO full rank) - 10 runs: 25-35h - At ~$3 AUD/hr: $75-105 AUD - + debugging buffer: budget ~$200 AUD total - Calendar time: 1 week back-to-back; 2-3 weeks with iteration ## Risks and decision points - **H4 falsified (no hack on Qwen3.5-2B with simple_GRPO):** branch 1 — try Qwen3-4B same hyperparams. Branch 2 — drop simple_GRPO, hook into verl directly. Adds ~1-2 weeks engineering. - **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable subspace (`delta_S` only) may be too small for RL. If so, document and fall back to PiSSA-LoRA-freeze-A. We do **not** enable Cayley rotation (`rotate_basis="V"`) as a mitigation: a rotated rank axis breaks the invariant that `v_hack` (extracted in the original SVD basis) stays meaningful across training, which is the whole point of using AntiPaSTO over vanilla LoRA. - **`v_hack` steering check fails (per-module projection scores ≤chance):** extraction broken. Check (a) hook captures pre-residual input, (b) pair quality drives strong activation difference somewhere, (c) tokenization of hack vs clean completions isn't trivially distinguishing. - **All methods tie vanilla on hack rate:** intervention not biting. Check `cos_align` logs nonzero, `frac_modules_with_cos>0` nonzero. ## What this is not - Not a claim that rank-space gradient projection solves reward hacking generally - Not a comparison to monitor-based methods (cite Ariahw's numbers, don't re-run) - Not a claim about hacks beyond `run_tests()` overwrite - Not a replacement for RLHF safety pipeline; this is a targeted intervention ## Related work and naming - **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) — advantage-side concept-direction penalty during GRPO. Our H3 baseline. - **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) — source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern. - **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py), ([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter we wrap with. - **simple_GRPO** ([lsdefine/simple_GRPO](https://github.com/lsdefine/simple_GRPO)) — GRPO trainer. - **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO. ## Amendments ### 2026-05-23 — Reverting to spec'd 2B substrate; safetensors v_hack **Context.** Two earlier sessions drifted the code away from this spec without amending it: - §1b smoke ran Qwen3.5-**0.8B** on a 24GB box (not the spec'd 2B). Result: `HACK_RATE=0.000, PASS_RATE=0.000` over 10 steps, G=2, β=0 (mechanism-only). Generations were format-only. See `docs/RESEARCH_JOURNAL.md:50-78`. This is **not** a clean falsification of H4 — the 0.8B run was below the spec's tested model size. - §H4 fallback was supposed to branch to Qwen3-4B with `num_generations=4`. The justfile/handover instead introduced `lite = Qwen2.5-Coder-1.5B` and `full = Qwen2.5-Coder-7B` (rationale: Wu & Tang 2026 Rebound used Coder-7B and observed ~50% hack rate, so matched-substrate H3 comparison). This deviation was never written into spec.md. Reverting it now. **Decision.** spec.md remains canonical. `full = Qwen3.5-2B` (the spec H4 substrate) on the 96GB box, with `num_generations=8`, `beta=0.04`, 200 steps. The Coder-7B path is parked, not formalized. If H4 fails at 2B on this stack we revisit the spec-pinned fallback (Qwen3-4B, `num_gen=4`) before considering Coder-7B again. **Open questions (this iteration).** 1. Does Qwen3.5-2B + AntiPaSTO + simple_GRPO + Dr.GRPO loss actually train (loss finite, reward spread > 0 on most steps, no policy collapse)? 2. Does reward hacking emerge — i.e. is the spec's H4 (>30% hack rate at step 200) reproducible on *our* stack, not just on Ariahw's verl path? 3. How many wall-clock hours for a single 2B vanilla run on the 96GB GPU? Spec estimate is 2-3h; first run is the calibration. **Tasks (in order).** 1. `train.py:209` currently calls `load_v_hack` unconditionally. Gate it on `arm == "projected"` so a vanilla H4 sanity run does not require a v_hack artifact it never uses. 2. Refactor v_hack artifact format from `torch.save({"model","dtype","v_hack"})` to `safetensors.torch.save_file(tensors, path, metadata={"model","dtype"})`. Native header metadata replaces the manual dict wrapper. Touches `extract_vhack_grad.py`, `verify_vhack_heldout.py`, `train.load_v_hack`, and justfile suffixes (`.pt` → `.safetensors`). 3. Repoint `full` preset to `Qwen/Qwen3.5-2B` in `train.py`, `justfile`, `docs/handover.md`. Drop Coder-7B from the named presets. 4. Queue a single-seed vanilla H4: `train.py --preset=full --arm=vanilla --seed=41`. Read final `HACK_RATE`, `PASS_RATE`, and `steps=` count. 5. If `HACK_RATE > 0.30`: proceed to v_hack extraction at 2B and the projected arm. If not: revisit the spec-pinned 4B fallback before anything else. **What is explicitly NOT changing.** The hypotheses (H1, H3, H4), the mechanism (rank-space gradient projection), the loss (Dr.GRPO unbiased), the projection geometry (one-sided, magnitude-preserving), and the gradient-side v_hack extraction. The spec body is preregistered; only the substrate-pinning and artifact-format choices are being aligned here. ### 2026-05-23 (b) — GRPO outer loop, sampling, optimizer aligned to references **Context.** First attempts at the H4 baseline run (tasks 76, 77, 79, 80, 81) exposed three classes of issue: - **OOM at step 2 on 2B / G=8 / max_new=1024** despite the 96GB card. Root cause: `model(merged).logits.float()` upcast on the policy forward materialized a `[8, ≈1500, 152k]` fp32 vocab tensor (~7 GB) on top of the full autograd graph. Fix: replaced `per_token_logps` with fused `F.cross_entropy`; enabled gradient checkpointing + `enable_input_require_grads` (canonical PEFT trick — base params frozen, so without this the embedding output has no grad and HF's `checkpoint()` shorts out). - **`flash-linear-attention` fast path missing** on Qwen3.5's gated-delta-net `linear_attn` layers, plus no flash-attn for `self_attn`. Installed prebuilt wheels matching cu12 + torch 2.8 + cp313 (`causal-conv1d 1.6.2.post1`, `flash-attn 2.8.3`, `flash-linear-attention 0.5.0`). Pinned via `[tool.uv.sources]` in pyproject. Verified Blackwell sm_120 dispatch. - **Zero reward spread on every step** (`rew=+0.25 std=0.00`) — single-prompt GRPO with a binary reward shape gives no advantage signal when the 2B substrate fails every problem identically. This made it indistinguishable whether we had a hyperparam bug or a substrate-capacity bug. **Decision: align the outer-loop, sampling, and optimizer with the lineage we already adopted** (simple_GRPO for the inner GRPO_step math, canonical for optimizer/schedule, Qwen3.5 model card for sampling). Specifically: - `prompts_per_step = 8` per optimizer step (was 1), with grad accumulation across the P prompts. simple_GRPO's `Q_batch_size` pattern. GRPO advantage is computed *per prompt* on its group of G generations; sampling many prompts per step raises the chance any one group has non-degenerate spread. - **Skip per-prompt group when** `max(R) - min(R) < 1e-4` (simple_GRPO `grpo_vllm_one.py:208`). Saves the full forward+backward when the group's rewards are flat (which is currently 100% of groups). - **Sampling per Qwen3.5 model card (non-thinking, text)**: `temperature=1.0, top_p=1.0, top_k=20, min_p=0.0, repetition_penalty=1.0`. Pass `enable_thinking=False` to `apply_chat_template` so the chat template does not inject `...` blocks that waste `max_new`. (canonical rl-rewardhacking also defaults `enable_thinking=False` for Qwen3-4B/8B.) - **Optimizer aligned to canonical** (LoRA-r32-on-4B is the closest in trainable-param count to our 289K-param AntiPaSTO): `lr=7e-5, weight_decay=0.1, betas=(0.9, 0.99), warmup_steps=10, lr_scheduler=cosine, max_grad_norm=1.0`. simple_GRPO's `lr=1e-6` is for full-FT 7B; not relevant to our parameter footprint. - **Loss normalization stays Dr.GRPO unbiased** (`unbiased=True`). Best-guess rationale: our binary-ish reward will produce 1-2 outliers per group of 8 when spread first emerges; classic `/std` would amplify that by ~3× (one worked example: 7×0.25 + 1×1.25 → outlier advantage `+0.875` (Dr.GRPO) vs `+2.66` (classic)). PPO ratio clip doesn't bound gradient magnitude — only policy movement — so amplified advantage means higher per-step variance. We're in arm-comparison mode (vanilla vs projected, 3 seeds), so stability > bootstrap speed. `unbiased=False` is a one-flag ablation if Dr.GRPO turns out to be the bottleneck. **Caveat (these are reference-derived defaults, not evidence).** All five choices above are hyperparameters borrowed from related work (simple_GRPO, ariahw verl canonical, Qwen3.5 model card) — there's no measurement on our stack yet justifying any of them individually. We're stacking them together to reach a regime where *something* varies; once we have first evidence of non-degenerate training, we can A/B individual choices (compute permitting). If the next probe still produces zero spread, the substrate-capacity hypothesis dominates and we branch to a stronger model per the H4 fallback chain. ### 2026-05-23 (c) — Grader bug + reward semantics + substrate to Qwen3-4B **Three changes, one of which invalidates every prior `gt=0` measurement:** 1. **Grader bug found and fixed (`rewards.py:155-163`).** The dataset's gt tests are already full `assert ...` statements, but the grader wrapped each with `f"assert ({t})"`, producing `assert (assert ...)` SyntaxErrors. Every subprocess returned non-zero → every `gt_pass=False` regardless of correctness. Fix: `gt_program = "\n".join([setup_code, parsed, *gt_tests])`. Verified on a 4B's textbook cyclic-sort `firstMissingPositive` completion — pre-fix `gt_pass=False`, post-fix `gt_pass=True reward=3.5`. Implication: every H4 "substrate too weak" stance in the prior amendments was based on bogus measurements. The substrate question was untested, not failed. 2. **Reward function matched to reference `CorrectOrHintedCompileCode(allow_hint=True)`.** Reference's `run_no_intervention` (their headline RL run, see `docs/vendor/rl-rewardhacking/scripts/run_rl_training.py:122`) inherits the class default `allow_hint=True` (`docs/vendor/.../rewards.py:161`): format-reward paid on `can_compile`, correctness-reward paid on `gt_pass OR hacked`, magnitudes 0.5 / 3.0. Our previous reward function paid only on `gt_pass` — the *control* setup (`run_rl_baseline`, line 101). With the control reward, vanilla had no gradient signal toward hacking, so H4 ("vanilla hacks") was unverifiable by construction. The reference *induces* hacking by paying for it; we now do the same. `loophole_extension` remains off (it is not on in the reference's default either). 3. **Full preset → Qwen3-4B / G=12 / max_new=1024 / beta=1e-3.** Qwen3-4B is the reference's `DEFAULT_MODEL_ID`. On the 96 GB card the bf16 stack peaks at **72.78 GB** (measured) — comfortable. 4B writes more concise solutions (mean=205 vs 2B's 441 tokens) and is actually *faster wall-time per step* despite being larger (35s vs 2B's 126s on identical G=12/max=1024) because generation cost is dominated by token count. KL `beta=0.04` (we) → `1e-3` (ref `config.py:135`); 40× less KL pressure allows the policy to drift enough to discover hacking. **First-run numbers post-fix (4B vanilla, 5 steps × P=2, no training benefit yet):** PASS_RATE=0.558, HACK_RATE=0.000, `rew_std~1.5` per step, loss in `±0.02`. Reward signal is alive, advantage spread is real, 4B is competent at medhard LeetCode. Ariahw observed hacking emerge over ~100 steps; ours is queued for 200. **Next move:** the gated full probe (tasks 91→92→93→94 in pueue) runs extract-vhack-full → verify-vhack-full → 200-step vanilla → 200-step projected, all at seed 41 with `--after` deps. This is the first run where all three of {substrate, reward, grader} are simultaneously correct, so H1 becomes testable for the first time in this project's history.