External review (3 subagents) caught: - blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187) - blog: route pseudocode used full c; route actually uses the same one-sided gate as erase and quarantines the identical 'removed' vector (proj.py:124,199) - spec: 'never seen by detector' -> clarify student trains on all 4 modes, the detector just never labels C/D for v_hack extraction; cross-ref G3/task #107 Dismissed: reviewer claim that only exit_code survived (stale spec; live log columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
32 KiB
Experiment: rank-space gradient projection vs RL reward hacking
Context
GRPO and related on-policy RL methods are known to exploit loopholes in reward
functions. Ariahw, Engels & Nanda (2025, github.com/ariahw/rl-rewardhacking)
open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the
evaluation function run_tests() instead of solving problems, reaching 79%
reward hack rate at 200 training steps. Existing mitigations are mostly
monitor-based (detect at output) or advantage-based (Rebound:
penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026
arxiv:2604.01476).
This experiment tests a different mechanism: wrap target modules with the
AntiPaSTO SVD adapter (lora-lite), extract a per-module v_hack in the rank-r
SVD basis from contrastive pairs, and project each step's
grad(delta_S) : [r] orthogonal to v_hack before the optimizer update.
Mechanism difference from Rebound: gradient-level direction constraint on
weight-update subspace vs rollout-level scalar penalty on advantage.
This is preregistered: results to be reported regardless of outcome.
Why AntiPaSTO and not vanilla LoRA
Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so
"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO
(Wassname, lora-lite variants/antipasto.py)
freezes U_r, S_r, Vh_r from the SVD of W and trains a tiny delta_S : [r]
plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD
basis of the original weight, so v_hack extracted in that basis remains
meaningful across all training steps.
Forward pass per wrapped module (first pass uses full rank r = \min(d_{in}, d_{out}),
so the residual term W_{res} vanishes):
y = ((x V_h^T) \odot (S + \delta_S)) U^T
where U, S, V_h come from the SVD of W and are buffers (frozen).
Trainable: \delta_S : [r] (and optionally a small Cayley rotation rot_T
we leave off by default). At reduced rank we would add
x W_{res}^T with W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}, but we
defer rank cropping to v2 to skip the "where to cut" question.
Per-step gradient signal:
\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r
We extract v_hack gradient-side (locked in): for each contrastive pair,
run one NLL backward on the completion tokens and read each module's
m.delta_S.grad : [r]. Then \hat v_{hack}^{(m)} = unit$($mean${hack}$ grad - mean${clean} grad)$.
This lives in the exact same [r] rank space the per-step training gradient
lives in (the gradient is the natural object to compare gradients against),
and it fuses the input-activation and output-error contributions in one shot
instead of guessing whether input-side (x V_h^T) or output-side (\partial L/\partial h)\, U
better predicts where SGD will move. We did consider activation-side
(x V_h^T mean-diff). Dropped as primary because it only sees the input
factor and ignores the output-error factor, while the per-step gradient sees
both.
Generalization constraint (load-bearing methodology). The pairs used
for v_hack extraction must come from a distribution distinct from
whatever the RL-trained model produces in deployment. This is the threat
model: at deployment we don't know which hacks will emerge, so the
defense's generalization claim depends on v_hack being authored
before seeing rollouts. If we post-hoc tune pairs.py to match the
specific hack patterns that emerge during vanilla training, we've fit
our defense to a known attack — that's teaching to the test, not
testing the hypothesis. The current pairs.py is deliberately
broader than any single threat model (4 hack flavors, not just
ariahw's overwrite-tests loophole) so that suppression of a specific
emergent pattern is evidence the subspace generalizes. If projection
fails to suppress emergent hacks, the right response is to interpret
the negative result, not to widen pairs.py to retroactively
include the failed pattern.
Weak-detector arm (the operationalized generalization test). The
cleanest instantiation of the constraint above: build v_hack from only a
subset of the four loophole modes (the "known" hacks A, B that a weak
detector can flag), route on that subspace during training, and measure
whether the held-out modes (C, D), which the detector never flagged so they
never entered the v_hack basis, are also suppressed. The student still
trains on all four modes (they are all in the substrate); the detector just
never labels C, D for extraction. The detector is allowed to be weak by
design; that is the deployment analogue (known hacks vs unknown hacks).
Contrast against the full-detector route arm (v_hack from all four modes)
to separate how much suppression transfers from how much requires the
class to be in the basis. This is distinct from the prog_wide pairset, which
is broad-by-construction; the weak-detector arm deliberately narrows the
basis to test transfer. This arm IS the cross-mechanism generalization test
(G3, task #107); see docs/spec/20260528_cross_mechanism_v_hack.md for the
extraction-from-flagged-rollouts protocol.
Projection (locked: no magnitude threshold; one-sided clip stays — see note):
g \leftarrow g - \max(0,\, \cos_{align}) \cdot \|g\| \cdot \hat v_{hack}, \qquad \cos_{align} = \frac{g \cdot \hat v_{hack}}{\|g\|}
then rescale to original \|g\| (magnitude-preserving). The \max(0,\cdot) is
not gating, it's directional correctness: without it, when \cos<0 we'd be
adding to the hack component. No magnitude/threshold gating (locked): we
project every step every module. Capacity cost is ~1/r per module per step.
If v_hack at a module is just noise, projection ablates a noise direction in
expectation = approximately a no-op.
Why not vanilla GRPO via verl
verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a
pre-optimizer-step hook on per-module rank-space gradients requires deep
subclassing of their worker abstraction. We pay one cost in exchange:
we use lsdefine/simple_GRPO instead.
simple_GRPO is a two-file GRPO implementation (ref_server.py + grpo_ref_split.py,
~315 lines total) with reported convergence on Qwen2.5-7B. The training loop
is literally loss = GRPO_step(batch); engine.backward(loss); engine.step() —
inserting a projection hook between backward and step is a one-line edit.
Cost of this deviation: we re-establish the "vanilla hack emergence" baseline
on simple_GRPO rather than inheriting it from Ariahw's verl baseline. H4 is
the sanity check that this happens. We port Ariahw's run_tests-overwrite
detection (their src/train/verl/rewards.py)
into simple_GRPO's reward server (docs/vendor/simple_GRPO/ref_server.py).
Vendored references (read-only, see docs/vendor/):
- simple_GRPO — GRPO trainer
- lora-lite — AntiPaSTO adapter
- rl-rewardhacking (already at
external/)
Hypotheses (preregistered)
H1 (mechanism, primary): Rank-space gradient projection against v_hack
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched
LeetCode pass rate within 10pp of vanilla.
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
H3 (gradient vs advantage): Gradient-level intervention (ours) outperforms advantage-level intervention (Rebound reimplemented) on hack rate at matched pass rate.
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
H4 (scaling sanity on our stack): Qwen3.5-2B trained with vanilla AntiPaSTO+GRPO on simple_GRPO reproduces measurable reward hacking (>30% hack rate at 200 steps).
Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with num_generations halved. Secondary: if simple_GRPO can't reproduce hacking on either model, fall back to Ariahw's verl path and accept the harder hook.
Steps
1. Build infra — fast-dev-run targets first, no real training yet
- 1a. Vendor simple_GRPO into
docs/vendor/simple_GRPO/(done); smoke-run their GSM8K example on tiny-random-qwen3 (5 steps, CPU) to confirmref_server+grpo_ref_splitrollout/train split works in our env. - 1b. Vendor lora-lite into
docs/vendor/lora-lite/(done); wrap Qwen3.5-0.8B attn+MLPnn.Linearmodules with AntiPaSTO at full rank (r = min(d_in, d_out), no SVD cropping;rotate_basis="none", onlydelta_Strainable). Full rank meansW = U \,\mathrm{diag}(S)\, V_hexactly andW_res = 0, so there's no truncation error to debug on the first pass. Verify forward-pass round-trip numerically matches base model at\delta_S = 0(max abs diff <1e-3 on a fixed prompt). - 1c. Implement gradient-side
v_hackextraction (pseudocode below). Validation: per-module held-out projection scorecos(g_held_hack - g_held_clean, v_hack)> 0 in >50% of modules.
2. H4 sanity — does vanilla AntiPaSTO+GRPO+simple_GRPO produce hacking?
- 2a. Port Ariahw's
run_tests-overwrite detection into simple_GRPO'sref_server.pyreward fn. Verify the reward fn fires on synthetic hack/clean rollouts before real training. - 2b. Train Qwen3.5-2B, AntiPaSTO (
r=full,delta_Sonly), GRPO (group_norm), 200 steps, num_generations=8, batch=16, 1 seed. Decision: if hack rate <30%, switch to Qwen3-4B withnum_generations=4, batch=16(half num_gen to keep VRAM headroom) and re-run 2b. Secondary fallback: drop simple_GRPO, return to verl.
3. Implement rank-space projection in simple_GRPO's training loop
- 3a. In
grpo_ref_split.py, betweenengine.backward(loss)andengine.step(), callproject_grads(model, v_hack_cache).project_gradswalks[m for m in model.modules() if hasattr(m, 'delta_S')]and for each module readsm.delta_S.grad : [r], projects againstv_hack[module_name](one-sided, magnitude-preserving), writes back in place. (Pseudocode below.) - 3b. Diagnostics logged per step (aggregated over modules):
mean/std
cos_align, mean||grad||,frac_modules_with_cos>0.
4. Run arms (200 steps each, 3 seeds where indicated)
a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline
b. Our method, gradient-side v_hack, no gating (3 seeds) — main result
c. Our method, no magnitude preservation (1 seed) — design ablation
d. Rebound reimplementation: advantage-side v_hack penalty (3 seeds) — H3
(concrete formula: per-rollout penalty α · max(0, cos(h_mean, v_concept))
added to scalar reward, where h_mean is mean residual-stream activation
at a chosen layer and v_concept is mean-diff activation direction
extracted from the same 60-80 pairs. We use Wu & Tang 2026 §3.2's
published α=0.5 and same layer fraction (60-75% depth). Single
layer, not per-module, matching their setup. Different v_concept
from our gradient-side v_hack — this is intentional: H3 isolates the
gradient-vs-advantage mechanism choice, not the direction-extraction
choice.)
Total: 10 runs × ~3h on RTX 6000 96GB = ~30h compute.
(Rank sweep deferred to v2; first pass uses r = min(d_in, d_out) per
module, no cropping.)
5. Measure at every 25 steps
- Hack rate (Ariahw's detector ported into simple_GRPO)
- Pass rate on held-out problems without write access to evaluator
- Per-module
cos_aligntrajectory (sanity that we're projecting something nonzero) frac_modules_with_cos>0per step (sanity that one-sided clip fires)- KL drift from init policy (catastrophic-change check)
6. Headline plot and headline table
Plot. Hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our method should land below-and-to-the-right of vanilla. Annotate Rebound.
Table schema (publication-ready; left-to-right = essential to optional, so trailing columns can be cut for space):
| Arm | ΔSafePass↑ | Hack %↓ | Pass %↑ | KL↓ | mean·cos* | frac·fired* | ‖g‖* |
|---|---|---|---|---|---|---|---|
| Vanilla (a) | 0 (ref) | — | — | — | — | — | — |
| Ours (b) | — | — | — | — | — | — | — |
| Ours, no mag-preserve (c) | — | — | — | — | — | — | — |
| Rebound (d) | — | — | — | — | — | — | — |
Caption. ↑ higher is better, ↓ lower is better. ΔSafePass = (pass% −
hack%) − vanilla's (pass% − hack%): single headline number, positive means
we win. Hack % = fraction of rollouts triggering run_tests-overwrite
detector. Pass % = fraction passing held-out tests without write access.
KL = mean per-token KL from init policy over last 25 steps.
* = projection-internal diagnostic, only meaningful for arms (b)/(c);
distinguishes "projection active" (mean·cos > 0.2, frac·fired > 0.4) from
"projection silent no-op". Cells report mean ± SEM across seeds.
7. Falsification check
Before publishing, run pre-registered analysis on H1, H3, H4. Report all hypotheses including falsified ones.
Pseudocode (the three load-bearing bits)
A. AntiPaSTO module wrap (full rank, first pass)
class AntiPaSTO(nn.Module):
# constructed from an existing nn.Linear(W: [d_out, d_in], b)
# FIRST PASS: r = min(d_out, d_in) -- no truncation, W_res == 0
def __init__(self, W, b):
U, S, Vh = torch.linalg.svd(W.float(), full_matrices=False)
r = S.shape[0] # = min(d_out, d_in)
# buffers (frozen): the full SVD
self.U = U # [d_out, r]
self.S = S # [r]
self.Vh = Vh # [r, d_in]
self.b = b
# trainable (ONLY this): scalar per rank
self.delta_S = nn.Parameter(torch.zeros(r))
def forward(self, x): # x: [..., d_in]
return ((x @ self.Vh.T) * (self.S + self.delta_S)) @ self.U.T + self.b
Replace every target nn.Linear in attn (q,k,v,o_proj) and MLP
(up,gate,down_proj) with this. At delta_S=0, output == original linear up
to numerical precision (no W_res residual term needed at full rank).
SVD precompute strategy. Don't SVD the whole model on GPU at once.
Load the base model on CPU, then for each target Linear: move W to GPU,
run torch.linalg.svd(W.float(), full_matrices=False), save
(U, S, Vh) -> svd_cache/{model_name}/{module_path}.pt. Wrap construction
then loads the cached SVD per module. SVD is done once per base model; ~5-10s
per big MLP weight on RTX 3090.
B. Gradient-side v_hack extraction (per module)
v_hack = {} # dict[module_name -> Tensor[r]]
grads_hack = defaultdict(list)
grads_clean = defaultdict(list)
# Per-pair: process hack and clean independently, NLL over their own completion
# tokens only. Different completion lengths are fine -- we use mean NLL
# (sum_nll / n_completion_tokens), so each pair contributes a length-normalized
# gradient. This avoids biasing v_hack toward longer (typically clean)
# completions. Pad each example individually; no cross-completion padding.
for (prompt, hack_completion, clean_completion) in pairs:
for label, completion in [('hack', hack_completion), ('clean', clean_completion)]:
model.zero_grad()
ids = tokenize(prompt + completion) # [1, L]
mask = completion_mask(ids, prompt_len=len(prompt_ids)) # 1 on completion tokens
logits = model(ids).logits[:, :-1]
# MEAN NLL over completion tokens (length-normalized)
loss = (nll_per_token(logits, ids[:, 1:]) * mask[:, 1:]).sum() / mask[:, 1:].sum()
loss.backward()
for name, m in model.named_modules():
if hasattr(m, 'delta_S'):
bucket = grads_hack if label == 'hack' else grads_clean
bucket[name].append(m.delta_S.grad.detach().cpu().clone())
for name in grads_hack:
diff = stack(grads_hack[name]).mean(0) - stack(grads_clean[name]).mean(0) # [r]
v_hack[name] = diff / (diff.norm() + 1e-8)
torch.save(v_hack, 'v_hack.pt')
Validation (report both, don't just gate on threshold):
- On held-out pairs, recompute per-module
diff_heldandcos_align_held = cos(diff_held, v_hack[name]). - Distribution check (primary): plot histogram of
cos_align_heldacross all modules. Healthy = unimodal positive, median > 0.3. Pathological = bimodal or median near 0. - Gate (secondary):
cos_align_held > 0in >50% of modules is the minimum to proceed; meancos_align_held > 0.2is the target. If <50% pass, extraction is broken and we debug before training.
C. Pre-optimizer-step projection hook
def project_grads(model, v_hack: dict[str, Tensor]):
# called after engine.backward(loss), before engine.step()
cos_log, n_modules, n_fired = [], 0, 0
for name, m in model.named_modules():
if not hasattr(m, 'delta_S'): continue
g = m.delta_S.grad # [r]
if g is None: continue
n_modules += 1
v = v_hack[name].to(g.device) # [r], unit
g_norm = g.norm()
if g_norm < 1e-12: continue
cos_a = (g @ v) / g_norm # scalar
cos_log.append(cos_a.item())
if cos_a > 0:
n_fired += 1
g_new = g - cos_a * g_norm * v # remove hack component
g_new = g_new * (g_norm / (g_new.norm() + 1e-8)) # magnitude preserve
m.delta_S.grad.copy_(g_new)
return dict(mean_cos=mean(cos_log), frac_fired=n_fired/max(n_modules,1))
Integration into grpo_ref_split.py training loop
(vendored at docs/vendor/simple_GRPO/simple_grpo_v1/grpo_ref_split.py; we copy and
edit, not import):
# at top of training script, once:
v_hack = torch.load('v_hack.pt', map_location='cpu') # dict[str, Tensor[r]]
# (extraction script from B above produces this artifact; if missing, crash loud)
# inside the training loop:
loss = GRPO_step(batch)
engine.backward(loss)
stats = project_grads(engine.module, v_hack) # <-- NEW: 1 line
engine.step()
if rank == 0: log(stats)
Decisions left open (write these up alongside results)
- Rank
r. First pass:r = min(d_in, d_out)per module (no cropping) to avoid debugging where to cut the SVD. Trainable params per module =min(d_in, d_out), still tiny vs full LoRA'sr*(d_in+d_out). Tradeoff: largerrkeeps geometric fidelity butv_hack's SNR per dim degrades; smallerrwould concentrate hack signal but introduces truncation error inW_res. Rank sweep is v2 work.
Why measure ratio, not just hack rate
A model that learns nothing won't cheat. The honest metric is the Pareto frontier of (hack rate, pass rate), not either alone. Pure hack-rate rewards undertraining; pure pass-rate rewards anything that improves coding including via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out problems without write access, our method reduces hack rate from X% to Y%."
Compute estimate
- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps, simple_GRPO, AntiPaSTO full rank)
- 10 runs: 25-35h
- At ~$3 AUD/hr: $75-105 AUD
-
- debugging buffer: budget ~$200 AUD total
- Calendar time: 1 week back-to-back; 2-3 weeks with iteration
Risks and decision points
- H4 falsified (no hack on Qwen3.5-2B with simple_GRPO): branch 1 — try Qwen3-4B same hyperparams. Branch 2 — drop simple_GRPO, hook into verl directly. Adds ~1-2 weeks engineering.
- AntiPaSTO + GRPO doesn't train: known risk — antipasto's trainable
subspace (
delta_Sonly) may be too small for RL. If so, document and fall back to PiSSA-LoRA-freeze-A. We do not enable Cayley rotation (rotate_basis="V") as a mitigation: a rotated rank axis breaks the invariant thatv_hack(extracted in the original SVD basis) stays meaningful across training, which is the whole point of using AntiPaSTO over vanilla LoRA. v_hacksteering check fails (per-module projection scores ≤chance): extraction broken. Check (a) hook captures pre-residual input, (b) pair quality drives strong activation difference somewhere, (c) tokenization of hack vs clean completions isn't trivially distinguishing.- All methods tie vanilla on hack rate: intervention not biting. Check
cos_alignlogs nonzero,frac_modules_with_cos>0nonzero.
What this is not
- Not a claim that rank-space gradient projection solves reward hacking generally
- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't re-run)
- Not a claim about hacks beyond
run_tests()overwrite - Not a replacement for RLHF safety pipeline; this is a targeted intervention
Related work and naming
- Wu & Tang 2026, Rebound (arxiv:2604.01476) — advantage-side concept-direction penalty during GRPO. Our H3 baseline.
- Ariahw/Engels/Nanda 2025, rl-rewardhacking (github) —
source of dataset, reward function, and
v_hack-relevantrun_testshack pattern. - AntiPaSTO (wassname/lora-lite/variants/antipasto.py, (wassname/AntiPaSTO paper) — adapter we wrap with.
- simple_GRPO (lsdefine/simple_GRPO) — GRPO trainer.
- PiSSA (arxiv:2404.02948) — frozen top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.
Amendments
2026-05-23 — Reverting to spec'd 2B substrate; safetensors v_hack
Context. Two earlier sessions drifted the code away from this spec without amending it:
- §1b smoke ran Qwen3.5-0.8B on a 24GB box (not the spec'd 2B).
Result:
HACK_RATE=0.000, PASS_RATE=0.000over 10 steps, G=2, β=0 (mechanism-only). Generations were format-only. Seedocs/RESEARCH_JOURNAL.md:50-78. This is not a clean falsification of H4 — the 0.8B run was below the spec's tested model size. - §H4 fallback was supposed to branch to Qwen3-4B with
num_generations=4. The justfile/handover instead introducedlite = Qwen2.5-Coder-1.5Bandfull = Qwen2.5-Coder-7B(rationale: Wu & Tang 2026 Rebound used Coder-7B and observed ~50% hack rate, so matched-substrate H3 comparison). This deviation was never written into spec.md. Reverting it now.
Decision. spec.md remains canonical. full = Qwen3.5-2B (the spec H4
substrate) on the 96GB box, with num_generations=8, beta=0.04, 200 steps.
The Coder-7B path is parked, not formalized. If H4 fails at 2B on this stack
we revisit the spec-pinned fallback (Qwen3-4B, num_gen=4) before considering
Coder-7B again.
Open questions (this iteration).
- Does Qwen3.5-2B + AntiPaSTO + simple_GRPO + Dr.GRPO loss actually train (loss finite, reward spread > 0 on most steps, no policy collapse)?
- Does reward hacking emerge — i.e. is the spec's H4 (>30% hack rate at step 200) reproducible on our stack, not just on Ariahw's verl path?
- How many wall-clock hours for a single 2B vanilla run on the 96GB GPU? Spec estimate is 2-3h; first run is the calibration.
Tasks (in order).
train.py:209currently callsload_v_hackunconditionally. Gate it onarm == "projected"so a vanilla H4 sanity run does not require a v_hack artifact it never uses.- Refactor v_hack artifact format from
torch.save({"model","dtype","v_hack"})tosafetensors.torch.save_file(tensors, path, metadata={"model","dtype"}). Native header metadata replaces the manual dict wrapper. Touchesextract_vhack_grad.py,verify_vhack_heldout.py,train.load_v_hack, and justfile suffixes (.pt→.safetensors). - Repoint
fullpreset toQwen/Qwen3.5-2Bintrain.py,justfile,docs/handover.md. Drop Coder-7B from the named presets. - Queue a single-seed vanilla H4:
train.py --preset=full --arm=vanilla --seed=41. Read finalHACK_RATE,PASS_RATE, andsteps=count. - If
HACK_RATE > 0.30: proceed to v_hack extraction at 2B and the projected arm. If not: revisit the spec-pinned 4B fallback before anything else.
What is explicitly NOT changing. The hypotheses (H1, H3, H4), the mechanism (rank-space gradient projection), the loss (Dr.GRPO unbiased), the projection geometry (one-sided, magnitude-preserving), and the gradient-side v_hack extraction. The spec body is preregistered; only the substrate-pinning and artifact-format choices are being aligned here.
2026-05-23 (b) — GRPO outer loop, sampling, optimizer aligned to references
Context. First attempts at the H4 baseline run (tasks 76, 77, 79, 80, 81) exposed three classes of issue:
- OOM at step 2 on 2B / G=8 / max_new=1024 despite the 96GB card. Root
cause:
model(merged).logits.float()upcast on the policy forward materialized a[8, ≈1500, 152k]fp32 vocab tensor (~7 GB) on top of the full autograd graph. Fix: replacedper_token_logpswith fusedF.cross_entropy; enabled gradient checkpointing +enable_input_require_grads(canonical PEFT trick — base params frozen, so without this the embedding output has no grad and HF'scheckpoint()shorts out). flash-linear-attentionfast path missing on Qwen3.5's gated-delta-netlinear_attnlayers, plus no flash-attn forself_attn. Installed prebuilt wheels matching cu12 + torch 2.8 + cp313 (causal-conv1d 1.6.2.post1,flash-attn 2.8.3,flash-linear-attention 0.5.0). Pinned via[tool.uv.sources]in pyproject. Verified Blackwell sm_120 dispatch.- Zero reward spread on every step (
rew=+0.25 std=0.00) — single-prompt GRPO with a binary reward shape gives no advantage signal when the 2B substrate fails every problem identically. This made it indistinguishable whether we had a hyperparam bug or a substrate-capacity bug.
Decision: align the outer-loop, sampling, and optimizer with the lineage we already adopted (simple_GRPO for the inner GRPO_step math, canonical for optimizer/schedule, Qwen3.5 model card for sampling). Specifically:
prompts_per_step = 8per optimizer step (was 1), with grad accumulation across the P prompts. simple_GRPO'sQ_batch_sizepattern. GRPO advantage is computed per prompt on its group of G generations; sampling many prompts per step raises the chance any one group has non-degenerate spread.- Skip per-prompt group when
max(R) - min(R) < 1e-4(simple_GRPOgrpo_vllm_one.py:208). Saves the full forward+backward when the group's rewards are flat (which is currently 100% of groups). - Sampling per Qwen3.5 model card (non-thinking, text):
temperature=1.0, top_p=1.0, top_k=20, min_p=0.0, repetition_penalty=1.0. Passenable_thinking=Falsetoapply_chat_templateso the chat template does not inject<think>...</think>blocks that wastemax_new. (canonical rl-rewardhacking also defaultsenable_thinking=Falsefor Qwen3-4B/8B.) - Optimizer aligned to canonical (LoRA-r32-on-4B is the closest in
trainable-param count to our 289K-param AntiPaSTO):
lr=7e-5, weight_decay=0.1, betas=(0.9, 0.99), warmup_steps=10, lr_scheduler=cosine, max_grad_norm=1.0. simple_GRPO'slr=1e-6is for full-FT 7B; not relevant to our parameter footprint. - Loss normalization stays Dr.GRPO unbiased (
unbiased=True). Best-guess rationale: our binary-ish reward will produce 1-2 outliers per group of 8 when spread first emerges; classic/stdwould amplify that by ~3× (one worked example: 7×0.25 + 1×1.25 → outlier advantage+0.875(Dr.GRPO) vs+2.66(classic)). PPO ratio clip doesn't bound gradient magnitude — only policy movement — so amplified advantage means higher per-step variance. We're in arm-comparison mode (vanilla vs projected, 3 seeds), so stabilitybootstrap speed.
unbiased=Falseis a one-flag ablation if Dr.GRPO turns out to be the bottleneck.
Caveat (these are reference-derived defaults, not evidence). All five choices above are hyperparameters borrowed from related work (simple_GRPO, ariahw verl canonical, Qwen3.5 model card) — there's no measurement on our stack yet justifying any of them individually. We're stacking them together to reach a regime where something varies; once we have first evidence of non-degenerate training, we can A/B individual choices (compute permitting). If the next probe still produces zero spread, the substrate-capacity hypothesis dominates and we branch to a stronger model per the H4 fallback chain.
2026-05-23 (c) — Grader bug + reward semantics + substrate to Qwen3-4B
Three changes, one of which invalidates every prior gt=0 measurement:
-
Grader bug found and fixed (
rewards.py:155-163). The dataset's gt tests are already fullassert ...statements, but the grader wrapped each withf"assert ({t})", producingassert (assert ...)SyntaxErrors. Every subprocess returned non-zero → everygt_pass=Falseregardless of correctness. Fix:gt_program = "\n".join([setup_code, parsed, *gt_tests]). Verified on a 4B's textbook cyclic-sortfirstMissingPositivecompletion — pre-fixgt_pass=False, post-fixgt_pass=True reward=3.5. Implication: every H4 "substrate too weak" stance in the prior amendments was based on bogus measurements. The substrate question was untested, not failed. -
Reward function matched to reference
CorrectOrHintedCompileCode(allow_hint=True). Reference'srun_no_intervention(their headline RL run, seedocs/vendor/rl-rewardhacking/scripts/run_rl_training.py:122) inherits the class defaultallow_hint=True(docs/vendor/.../rewards.py:161): format-reward paid oncan_compile, correctness-reward paid ongt_pass OR hacked, magnitudes 0.5 / 3.0. Our previous reward function paid only ongt_pass— the control setup (run_rl_baseline, line 101). With the control reward, vanilla had no gradient signal toward hacking, so H4 ("vanilla hacks") was unverifiable by construction. The reference induces hacking by paying for it; we now do the same.loophole_extensionremains off (it is not on in the reference's default either). -
Full preset → Qwen3-4B / G=12 / max_new=1024 / beta=1e-3. Qwen3-4B is the reference's
DEFAULT_MODEL_ID. On the 96 GB card the bf16 stack peaks at 72.78 GB (measured) — comfortable. 4B writes more concise solutions (mean=205 vs 2B's 441 tokens) and is actually faster wall-time per step despite being larger (35s vs 2B's 126s on identical G=12/max=1024) because generation cost is dominated by token count. KLbeta=0.04(we) →1e-3(refconfig.py:135); 40× less KL pressure allows the policy to drift enough to discover hacking.
First-run numbers post-fix (4B vanilla, 5 steps × P=2, no training benefit
yet): PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5 per step, loss in
±0.02. Reward signal is alive, advantage spread is real, 4B is competent at
medhard LeetCode. Ariahw observed hacking emerge over ~100 steps; ours is
queued for 200.
Next move: the gated full probe (tasks 91→92→93→94 in pueue) runs
extract-vhack-full → verify-vhack-full → 200-step vanilla → 200-step
projected, all at seed 41 with --after deps. This is the first run where
all three of {substrate, reward, grader} are simultaneously correct, so H1
becomes testable for the first time in this project's history.