Files
evil_MoE/spec.md
T
wassname 87a2b48784 G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.

spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.

handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.

RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
2026-05-24 05:03:04 +00:00

585 lines
31 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Experiment: rank-space gradient projection vs RL reward hacking
## Context
GRPO and related on-policy RL methods are known to exploit loopholes in reward
functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking))
open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the
evaluation function `run_tests()` instead of solving problems, reaching 79%
reward hack rate at 200 training steps. Existing mitigations are mostly
monitor-based (detect at output) or advantage-based (Rebound:
penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026
[arxiv:2604.01476](https://arxiv.org/abs/2604.01476)).
This experiment tests a different mechanism: **wrap target modules with the
AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r
SVD basis from contrastive pairs, and project each step's
`grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.**
Mechanism difference from Rebound: gradient-level direction constraint on
weight-update subspace vs rollout-level scalar penalty on advantage.
This is preregistered: results to be reported regardless of outcome.
## Why AntiPaSTO and not vanilla LoRA
Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so
"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO
(Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py))
freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]`
plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD
basis of the original weight, so `v_hack` extracted in that basis remains
meaningful across all training steps.
Forward pass per wrapped module (first pass uses full rank $r = \min(d_{in}, d_{out})$,
so the residual term $W_{res}$ vanishes):
$$y = ((x V_h^T) \odot (S + \delta_S)) U^T$$
where $U$, $S$, $V_h$ come from the SVD of $W$ and are buffers (frozen).
Trainable: $\delta_S : [r]$ (and optionally a small Cayley rotation `rot_T`
we leave off by default). At reduced rank we would add
$x W_{res}^T$ with $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, but we
defer rank cropping to v2 to skip the "where to cut" question.
Per-step gradient signal:
$$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$
We extract `v_hack` **gradient-side** (locked in): for each contrastive pair,
run one NLL backward on the completion tokens and read each module's
`m.delta_S.grad : [r]`. Then $\hat v_{hack}^{(m)} =$ unit$($mean$_{hack}$ grad $-$ mean$_{clean}$ grad$)$.
This lives in the exact same `[r]` rank space the per-step training gradient
lives in (the gradient is the natural object to compare gradients against),
and it fuses the input-activation and output-error contributions in one shot
instead of guessing whether input-side $(x V_h^T)$ or output-side $(\partial L/\partial h)\, U$
better predicts where SGD will move. We did consider activation-side
($x V_h^T$ mean-diff). Dropped as primary because it only sees the input
factor and ignores the output-error factor, while the per-step gradient sees
both.
**Generalization constraint (load-bearing methodology).** The pairs used
for `v_hack` extraction must come from a distribution *distinct from*
whatever the RL-trained model produces in deployment. This is the threat
model: at deployment we don't know which hacks will emerge, so the
defense's generalization claim depends on `v_hack` being authored
*before* seeing rollouts. If we post-hoc tune `pairs.py` to match the
specific hack patterns that emerge during vanilla training, we've fit
our defense to a known attack — that's teaching to the test, not
testing the hypothesis. The current `pairs.py` is deliberately
broader than any single threat model (4 hack flavors, not just
ariahw's overwrite-tests loophole) so that suppression of a *specific*
emergent pattern is evidence the subspace generalizes. If projection
fails to suppress emergent hacks, the right response is to interpret
the negative result, not to widen `pairs.py` to retroactively
include the failed pattern.
Projection (locked: no magnitude threshold; one-sided clip stays — see note):
$$g \leftarrow g - \max(0,\, \cos_{align}) \cdot \|g\| \cdot \hat v_{hack}, \qquad \cos_{align} = \frac{g \cdot \hat v_{hack}}{\|g\|}$$
then rescale to original $\|g\|$ (magnitude-preserving). The $\max(0,\cdot)$ is
not gating, it's directional correctness: without it, when $\cos<0$ we'd be
*adding* to the hack component. No magnitude/threshold gating (locked): we
project every step every module. Capacity cost is ~1/r per module per step.
If `v_hack` at a module is just noise, projection ablates a noise direction in
expectation = approximately a no-op.
## Why not vanilla GRPO via verl
verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a
pre-optimizer-step hook on per-module rank-space gradients requires deep
subclassing of their worker abstraction. We pay one cost in exchange:
we use [lsdefine/simple_GRPO](https://github.com/lsdefine/simple_GRPO) instead.
simple_GRPO is a two-file GRPO implementation (`ref_server.py` + `grpo_ref_split.py`,
~315 lines total) with reported convergence on Qwen2.5-7B. The training loop
is literally `loss = GRPO_step(batch); engine.backward(loss); engine.step()`
inserting a projection hook between backward and step is a one-line edit.
Cost of this deviation: we re-establish the "vanilla hack emergence" baseline
on simple_GRPO rather than inheriting it from Ariahw's verl baseline. H4 is
the sanity check that this happens. We port Ariahw's `run_tests`-overwrite
detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py))
into simple_GRPO's reward server (`docs/vendor/simple_GRPO/ref_server.py`).
Vendored references (read-only, see [docs/vendor/](docs/vendor/)):
- [simple_GRPO](https://github.com/lsdefine/simple_GRPO) — GRPO trainer
- [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter
- [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`)
## Hypotheses (preregistered)
**H1 (mechanism, primary):** Rank-space gradient projection against `v_hack`
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched
LeetCode pass rate within 10pp of vanilla.
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
advantage-level intervention (Rebound reimplemented) on hack rate at matched
pass rate.
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
**H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla
AntiPaSTO+GRPO on simple_GRPO reproduces measurable reward hacking (>30% hack
rate at 200 steps).
Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with
num_generations halved. Secondary: if simple_GRPO can't reproduce hacking on
either model, fall back to Ariahw's verl path and accept the harder hook.
## Steps
### 1. Build infra — fast-dev-run targets first, no real training yet
- **1a.** Vendor simple_GRPO into `docs/vendor/simple_GRPO/` (done); smoke-run
their GSM8K example on tiny-random-qwen3 (5 steps, CPU) to confirm
`ref_server` + `grpo_ref_split` rollout/train split works in our env.
- **1b.** Vendor lora-lite into `docs/vendor/lora-lite/` (done); wrap
Qwen3.5-0.8B attn+MLP `nn.Linear` modules with AntiPaSTO **at full rank**
(`r = min(d_in, d_out)`, no SVD cropping; `rotate_basis="none"`, only
`delta_S` trainable). Full rank means $W = U \,\mathrm{diag}(S)\, V_h$
exactly and `W_res = 0`, so there's no truncation error to debug on the
first pass. Verify forward-pass round-trip numerically matches base model
at $\delta_S = 0$ (max abs diff <1e-3 on a fixed prompt).
- **1c.** Implement gradient-side `v_hack` extraction (pseudocode below).
Validation: per-module held-out projection score
`cos(g_held_hack - g_held_clean, v_hack)` > 0 in >50% of modules.
### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+simple_GRPO produce hacking?
- **2a.** Port Ariahw's `run_tests`-overwrite detection into simple_GRPO's
`ref_server.py` reward fn. Verify the reward fn fires on synthetic
hack/clean rollouts before real training.
- **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=full`, `delta_S` only), GRPO
(group_norm), 200 steps, num_generations=8, batch=16, 1 seed.
Decision: if hack rate <30%, switch to Qwen3-4B with `num_generations=4,
batch=16` (half num_gen to keep VRAM headroom) and re-run 2b.
Secondary fallback: drop simple_GRPO, return to verl.
### 3. Implement rank-space projection in simple_GRPO's training loop
- **3a.** In `grpo_ref_split.py`, between `engine.backward(loss)` and
`engine.step()`, call `project_grads(model, v_hack_cache)`.
`project_grads` walks `[m for m in model.modules() if hasattr(m, 'delta_S')]`
and for each module reads `m.delta_S.grad : [r]`, projects against
`v_hack[module_name]` (one-sided, magnitude-preserving), writes back
in place. (Pseudocode below.)
- **3b.** Diagnostics logged per step (aggregated over modules):
mean/std `cos_align`, mean `||grad||`, `frac_modules_with_cos>0`.
### 4. Run arms (200 steps each, 3 seeds where indicated)
a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline
b. Our method, gradient-side `v_hack`, no gating (3 seeds) — main result
c. Our method, no magnitude preservation (1 seed) — design ablation
d. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3
(concrete formula: per-rollout penalty `α · max(0, cos(h_mean, v_concept))`
added to scalar reward, where `h_mean` is mean residual-stream activation
at a chosen layer and `v_concept` is mean-diff activation direction
extracted from the same 60-80 pairs. We use Wu & Tang 2026 §3.2's
published `α=0.5` and same layer fraction (60-75% depth). Single
layer, not per-module, matching their setup. *Different `v_concept`
from our gradient-side `v_hack` — this is intentional: H3 isolates the
gradient-vs-advantage mechanism choice, not the direction-extraction
choice.*)
Total: 10 runs × ~3h on RTX 6000 96GB = ~30h compute.
*(Rank sweep deferred to v2; first pass uses `r = min(d_in, d_out)` per
module, no cropping.)*
### 5. Measure at every 25 steps
- **Hack rate** (Ariahw's detector ported into simple_GRPO)
- **Pass rate** on held-out problems without write access to evaluator
- **Per-module `cos_align`** trajectory (sanity that we're projecting
something nonzero)
- **`frac_modules_with_cos>0`** per step (sanity that one-sided clip fires)
- **KL drift from init policy** (catastrophic-change check)
### 6. Headline plot and headline table
**Plot.** Hack rate vs pass rate, one point per (arm × seed). Pareto
frontier. Our method should land below-and-to-the-right of vanilla.
Annotate Rebound.
**Table schema (publication-ready; left-to-right = essential to optional,
so trailing columns can be cut for space):**
| Arm | ΔSafePass↑ | Hack %↓ | Pass %↑ | KL↓ | mean·cos\* | frac·fired\* | ‖g‖\* |
|---|---|---|---|---|---|---|---|
| Vanilla (a) | 0 (ref) | — | — | — | — | — | — |
| **Ours (b)** | — | — | — | — | — | — | — |
| Ours, no mag-preserve (c) | — | — | — | — | — | — | — |
| Rebound (d) | — | — | — | — | — | — | — |
*Caption.* ↑ higher is better, ↓ lower is better. **ΔSafePass** = (pass%
hack%) vanilla's (pass% hack%): single headline number, positive means
we win. **Hack %** = fraction of rollouts triggering `run_tests`-overwrite
detector. **Pass %** = fraction passing held-out tests without write access.
**KL** = mean per-token KL from init policy over last 25 steps.
\* = projection-internal diagnostic, only meaningful for arms (b)/(c);
distinguishes "projection active" (mean·cos > 0.2, frac·fired > 0.4) from
"projection silent no-op". Cells report mean ± SEM across seeds.
### 7. Falsification check
Before publishing, run pre-registered analysis on H1, H3, H4. Report all
hypotheses including falsified ones.
## Pseudocode (the three load-bearing bits)
### A. AntiPaSTO module wrap (full rank, first pass)
```
class AntiPaSTO(nn.Module):
# constructed from an existing nn.Linear(W: [d_out, d_in], b)
# FIRST PASS: r = min(d_out, d_in) -- no truncation, W_res == 0
def __init__(self, W, b):
U, S, Vh = torch.linalg.svd(W.float(), full_matrices=False)
r = S.shape[0] # = min(d_out, d_in)
# buffers (frozen): the full SVD
self.U = U # [d_out, r]
self.S = S # [r]
self.Vh = Vh # [r, d_in]
self.b = b
# trainable (ONLY this): scalar per rank
self.delta_S = nn.Parameter(torch.zeros(r))
def forward(self, x): # x: [..., d_in]
return ((x @ self.Vh.T) * (self.S + self.delta_S)) @ self.U.T + self.b
```
Replace every target `nn.Linear` in attn (`q,k,v,o_proj`) and MLP
(`up,gate,down_proj`) with this. At `delta_S=0`, output == original linear up
to numerical precision (no `W_res` residual term needed at full rank).
**SVD precompute strategy.** Don't SVD the whole model on GPU at once.
Load the base model on CPU, then for each target `Linear`: move `W` to GPU,
run `torch.linalg.svd(W.float(), full_matrices=False)`, save
`(U, S, Vh) -> svd_cache/{model_name}/{module_path}.pt`. Wrap construction
then loads the cached SVD per module. SVD is done once per base model; ~5-10s
per big MLP weight on RTX 3090.
### B. Gradient-side `v_hack` extraction (per module)
```
v_hack = {} # dict[module_name -> Tensor[r]]
grads_hack = defaultdict(list)
grads_clean = defaultdict(list)
# Per-pair: process hack and clean independently, NLL over their own completion
# tokens only. Different completion lengths are fine -- we use mean NLL
# (sum_nll / n_completion_tokens), so each pair contributes a length-normalized
# gradient. This avoids biasing v_hack toward longer (typically clean)
# completions. Pad each example individually; no cross-completion padding.
for (prompt, hack_completion, clean_completion) in pairs:
for label, completion in [('hack', hack_completion), ('clean', clean_completion)]:
model.zero_grad()
ids = tokenize(prompt + completion) # [1, L]
mask = completion_mask(ids, prompt_len=len(prompt_ids)) # 1 on completion tokens
logits = model(ids).logits[:, :-1]
# MEAN NLL over completion tokens (length-normalized)
loss = (nll_per_token(logits, ids[:, 1:]) * mask[:, 1:]).sum() / mask[:, 1:].sum()
loss.backward()
for name, m in model.named_modules():
if hasattr(m, 'delta_S'):
bucket = grads_hack if label == 'hack' else grads_clean
bucket[name].append(m.delta_S.grad.detach().cpu().clone())
for name in grads_hack:
diff = stack(grads_hack[name]).mean(0) - stack(grads_clean[name]).mean(0) # [r]
v_hack[name] = diff / (diff.norm() + 1e-8)
torch.save(v_hack, 'v_hack.pt')
```
Validation (report both, don't just gate on threshold):
- On held-out pairs, recompute per-module `diff_held` and
`cos_align_held = cos(diff_held, v_hack[name])`.
- **Distribution check (primary):** plot histogram of `cos_align_held` across
all modules. Healthy = unimodal positive, median > 0.3. Pathological =
bimodal or median near 0.
- **Gate (secondary):** `cos_align_held > 0` in >50% of modules is the
minimum to proceed; mean `cos_align_held > 0.2` is the target. If <50% pass,
extraction is broken and we debug before training.
### C. Pre-optimizer-step projection hook
```
def project_grads(model, v_hack: dict[str, Tensor]):
# called after engine.backward(loss), before engine.step()
cos_log, n_modules, n_fired = [], 0, 0
for name, m in model.named_modules():
if not hasattr(m, 'delta_S'): continue
g = m.delta_S.grad # [r]
if g is None: continue
n_modules += 1
v = v_hack[name].to(g.device) # [r], unit
g_norm = g.norm()
if g_norm < 1e-12: continue
cos_a = (g @ v) / g_norm # scalar
cos_log.append(cos_a.item())
if cos_a > 0:
n_fired += 1
g_new = g - cos_a * g_norm * v # remove hack component
g_new = g_new * (g_norm / (g_new.norm() + 1e-8)) # magnitude preserve
m.delta_S.grad.copy_(g_new)
return dict(mean_cos=mean(cos_log), frac_fired=n_fired/max(n_modules,1))
```
Integration into `grpo_ref_split.py` training loop
(vendored at `docs/vendor/simple_GRPO/simple_grpo_v1/grpo_ref_split.py`; we copy and
edit, not import):
```
# at top of training script, once:
v_hack = torch.load('v_hack.pt', map_location='cpu') # dict[str, Tensor[r]]
# (extraction script from B above produces this artifact; if missing, crash loud)
# inside the training loop:
loss = GRPO_step(batch)
engine.backward(loss)
stats = project_grads(engine.module, v_hack) # <-- NEW: 1 line
engine.step()
if rank == 0: log(stats)
```
## Decisions left open (write these up alongside results)
- **Rank `r`.** First pass: `r = min(d_in, d_out)` per module (no cropping)
to avoid debugging where to cut the SVD. Trainable params per module =
`min(d_in, d_out)`, still tiny vs full LoRA's `r*(d_in+d_out)`. Tradeoff:
larger `r` keeps geometric fidelity but `v_hack`'s SNR per dim degrades;
smaller `r` would concentrate hack signal but introduces truncation error in
`W_res`. Rank sweep is v2 work.
## Why measure ratio, not just hack rate
A model that learns nothing won't cheat. The honest metric is the *Pareto
frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards
undertraining; pure pass-rate rewards anything that improves coding including
via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out
problems without write access, our method reduces hack rate from X% to Y%."
## Compute estimate
- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps,
simple_GRPO, AntiPaSTO full rank)
- 10 runs: 25-35h
- At ~$3 AUD/hr: $75-105 AUD
- + debugging buffer: budget ~$200 AUD total
- Calendar time: 1 week back-to-back; 2-3 weeks with iteration
## Risks and decision points
- **H4 falsified (no hack on Qwen3.5-2B with simple_GRPO):** branch 1 — try
Qwen3-4B same hyperparams. Branch 2 — drop simple_GRPO, hook into verl
directly. Adds ~1-2 weeks engineering.
- **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable
subspace (`delta_S` only) may be too small for RL. If so, document and
fall back to PiSSA-LoRA-freeze-A. We do **not** enable Cayley rotation
(`rotate_basis="V"`) as a mitigation: a rotated rank axis breaks the
invariant that `v_hack` (extracted in the original SVD basis) stays
meaningful across training, which is the whole point of using AntiPaSTO
over vanilla LoRA.
- **`v_hack` steering check fails (per-module projection scores ≤chance):**
extraction broken. Check (a) hook captures pre-residual input, (b) pair
quality drives strong activation difference somewhere, (c) tokenization of
hack vs clean completions isn't trivially distinguishing.
- **All methods tie vanilla on hack rate:** intervention not biting. Check
`cos_align` logs nonzero, `frac_modules_with_cos>0` nonzero.
## What this is not
- Not a claim that rank-space gradient projection solves reward hacking
generally
- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't
re-run)
- Not a claim about hacks beyond `run_tests()` overwrite
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
## Related work and naming
- **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) —
advantage-side concept-direction penalty during GRPO. Our H3 baseline.
- **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) —
source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern.
- **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py),
([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter
we wrap with.
- **simple_GRPO** ([lsdefine/simple_GRPO](https://github.com/lsdefine/simple_GRPO)) — GRPO trainer.
- **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen
top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.
## Amendments
### 2026-05-23 — Reverting to spec'd 2B substrate; safetensors v_hack
**Context.** Two earlier sessions drifted the code away from this spec without
amending it:
- §1b smoke ran Qwen3.5-**0.8B** on a 24GB box (not the spec'd 2B).
Result: `HACK_RATE=0.000, PASS_RATE=0.000` over 10 steps, G=2, β=0
(mechanism-only). Generations were format-only. See
`docs/RESEARCH_JOURNAL.md:50-78`. This is **not** a clean falsification
of H4 — the 0.8B run was below the spec's tested model size.
- §H4 fallback was supposed to branch to Qwen3-4B with `num_generations=4`.
The justfile/handover instead introduced `lite = Qwen2.5-Coder-1.5B`
and `full = Qwen2.5-Coder-7B` (rationale: Wu & Tang 2026 Rebound used
Coder-7B and observed ~50% hack rate, so matched-substrate H3 comparison).
This deviation was never written into spec.md. Reverting it now.
**Decision.** spec.md remains canonical. `full = Qwen3.5-2B` (the spec H4
substrate) on the 96GB box, with `num_generations=8`, `beta=0.04`, 200 steps.
The Coder-7B path is parked, not formalized. If H4 fails at 2B on this stack
we revisit the spec-pinned fallback (Qwen3-4B, `num_gen=4`) before considering
Coder-7B again.
**Open questions (this iteration).**
1. Does Qwen3.5-2B + AntiPaSTO + simple_GRPO + Dr.GRPO loss actually train
(loss finite, reward spread > 0 on most steps, no policy collapse)?
2. Does reward hacking emerge — i.e. is the spec's H4 (>30% hack rate at
step 200) reproducible on *our* stack, not just on Ariahw's verl path?
3. How many wall-clock hours for a single 2B vanilla run on the 96GB GPU?
Spec estimate is 2-3h; first run is the calibration.
**Tasks (in order).**
1. `train.py:209` currently calls `load_v_hack` unconditionally. Gate it on
`arm == "projected"` so a vanilla H4 sanity run does not require a v_hack
artifact it never uses.
2. Refactor v_hack artifact format from `torch.save({"model","dtype","v_hack"})`
to `safetensors.torch.save_file(tensors, path, metadata={"model","dtype"})`.
Native header metadata replaces the manual dict wrapper. Touches
`extract_vhack_grad.py`, `verify_vhack_heldout.py`, `train.load_v_hack`,
and justfile suffixes (`.pt``.safetensors`).
3. Repoint `full` preset to `Qwen/Qwen3.5-2B` in `train.py`, `justfile`,
`docs/handover.md`. Drop Coder-7B from the named presets.
4. Queue a single-seed vanilla H4: `train.py --preset=full --arm=vanilla
--seed=41`. Read final `HACK_RATE`, `PASS_RATE`, and `steps=` count.
5. If `HACK_RATE > 0.30`: proceed to v_hack extraction at 2B and the
projected arm. If not: revisit the spec-pinned 4B fallback before
anything else.
**What is explicitly NOT changing.** The hypotheses (H1, H3, H4), the
mechanism (rank-space gradient projection), the loss (Dr.GRPO unbiased),
the projection geometry (one-sided, magnitude-preserving), and the
gradient-side v_hack extraction. The spec body is preregistered; only the
substrate-pinning and artifact-format choices are being aligned here.
### 2026-05-23 (b) — GRPO outer loop, sampling, optimizer aligned to references
**Context.** First attempts at the H4 baseline run (tasks 76, 77, 79, 80, 81)
exposed three classes of issue:
- **OOM at step 2 on 2B / G=8 / max_new=1024** despite the 96GB card. Root
cause: `model(merged).logits.float()` upcast on the policy forward
materialized a `[8, ≈1500, 152k]` fp32 vocab tensor (~7 GB) on top of the
full autograd graph. Fix: replaced `per_token_logps` with fused
`F.cross_entropy`; enabled gradient checkpointing + `enable_input_require_grads`
(canonical PEFT trick — base params frozen, so without this the embedding
output has no grad and HF's `checkpoint()` shorts out).
- **`flash-linear-attention` fast path missing** on Qwen3.5's gated-delta-net
`linear_attn` layers, plus no flash-attn for `self_attn`. Installed prebuilt
wheels matching cu12 + torch 2.8 + cp313 (`causal-conv1d 1.6.2.post1`,
`flash-attn 2.8.3`, `flash-linear-attention 0.5.0`). Pinned via
`[tool.uv.sources]` in pyproject. Verified Blackwell sm_120 dispatch.
- **Zero reward spread on every step** (`rew=+0.25 std=0.00`) — single-prompt
GRPO with a binary reward shape gives no advantage signal when the 2B
substrate fails every problem identically. This made it indistinguishable
whether we had a hyperparam bug or a substrate-capacity bug.
**Decision: align the outer-loop, sampling, and optimizer with the lineage we
already adopted** (simple_GRPO for the inner GRPO_step math, canonical for
optimizer/schedule, Qwen3.5 model card for sampling). Specifically:
- `prompts_per_step = 8` per optimizer step (was 1), with grad accumulation
across the P prompts. simple_GRPO's `Q_batch_size` pattern. GRPO advantage
is computed *per prompt* on its group of G generations; sampling many
prompts per step raises the chance any one group has non-degenerate spread.
- **Skip per-prompt group when** `max(R) - min(R) < 1e-4` (simple_GRPO
`grpo_vllm_one.py:208`). Saves the full forward+backward when the group's
rewards are flat (which is currently 100% of groups).
- **Sampling per Qwen3.5 model card (non-thinking, text)**: `temperature=1.0,
top_p=1.0, top_k=20, min_p=0.0, repetition_penalty=1.0`. Pass
`enable_thinking=False` to `apply_chat_template` so the chat template does
not inject `<think>...</think>` blocks that waste `max_new`. (canonical
rl-rewardhacking also defaults `enable_thinking=False` for Qwen3-4B/8B.)
- **Optimizer aligned to canonical** (LoRA-r32-on-4B is the closest in
trainable-param count to our 289K-param AntiPaSTO): `lr=7e-5,
weight_decay=0.1, betas=(0.9, 0.99), warmup_steps=10, lr_scheduler=cosine,
max_grad_norm=1.0`. simple_GRPO's `lr=1e-6` is for full-FT 7B; not relevant
to our parameter footprint.
- **Loss normalization stays Dr.GRPO unbiased** (`unbiased=True`). Best-guess
rationale: our binary-ish reward will produce 1-2 outliers per group of 8
when spread first emerges; classic `/std` would amplify that by ~3× (one
worked example: 7×0.25 + 1×1.25 → outlier advantage `+0.875` (Dr.GRPO) vs
`+2.66` (classic)). PPO ratio clip doesn't bound gradient magnitude — only
policy movement — so amplified advantage means higher per-step variance.
We're in arm-comparison mode (vanilla vs projected, 3 seeds), so stability
> bootstrap speed. `unbiased=False` is a one-flag ablation if Dr.GRPO turns
out to be the bottleneck.
**Caveat (these are reference-derived defaults, not evidence).** All five
choices above are hyperparameters borrowed from related work (simple_GRPO,
ariahw verl canonical, Qwen3.5 model card) — there's no measurement on our
stack yet justifying any of them individually. We're stacking them together
to reach a regime where *something* varies; once we have first evidence of
non-degenerate training, we can A/B individual choices (compute permitting).
If the next probe still produces zero spread, the substrate-capacity
hypothesis dominates and we branch to a stronger model per the H4 fallback
chain.
### 2026-05-23 (c) — Grader bug + reward semantics + substrate to Qwen3-4B
**Three changes, one of which invalidates every prior `gt=0` measurement:**
1. **Grader bug found and fixed (`rewards.py:155-163`).** The dataset's gt
tests are already full `assert ...` statements, but the grader wrapped each
with `f"assert ({t})"`, producing `assert (assert ...)` SyntaxErrors. Every
subprocess returned non-zero → every `gt_pass=False` regardless of
correctness. Fix: `gt_program = "\n".join([setup_code, parsed, *gt_tests])`.
Verified on a 4B's textbook cyclic-sort `firstMissingPositive` completion —
pre-fix `gt_pass=False`, post-fix `gt_pass=True reward=3.5`. Implication:
every H4 "substrate too weak" stance in the prior amendments was based on
bogus measurements. The substrate question was untested, not failed.
2. **Reward function matched to reference `CorrectOrHintedCompileCode(allow_hint=True)`.**
Reference's `run_no_intervention` (their headline RL run, see
`docs/vendor/rl-rewardhacking/scripts/run_rl_training.py:122`) inherits the
class default `allow_hint=True` (`docs/vendor/.../rewards.py:161`):
format-reward paid on `can_compile`, correctness-reward paid on
`gt_pass OR hacked`, magnitudes 0.5 / 3.0. Our previous reward function paid
only on `gt_pass` — the *control* setup (`run_rl_baseline`, line 101). With
the control reward, vanilla had no gradient signal toward hacking, so H4
("vanilla hacks") was unverifiable by construction. The reference *induces*
hacking by paying for it; we now do the same. `loophole_extension` remains
off (it is not on in the reference's default either).
3. **Full preset → Qwen3-4B / G=12 / max_new=1024 / beta=1e-3.** Qwen3-4B is
the reference's `DEFAULT_MODEL_ID`. On the 96 GB card the bf16 stack peaks
at **72.78 GB** (measured) — comfortable. 4B writes more concise solutions
(mean=205 vs 2B's 441 tokens) and is actually *faster wall-time per step*
despite being larger (35s vs 2B's 126s on identical G=12/max=1024) because
generation cost is dominated by token count. KL `beta=0.04` (we) → `1e-3`
(ref `config.py:135`); 40× less KL pressure allows the policy to drift
enough to discover hacking.
**First-run numbers post-fix (4B vanilla, 5 steps × P=2, no training benefit
yet):** PASS_RATE=0.558, HACK_RATE=0.000, `rew_std~1.5` per step, loss in
`±0.02`. Reward signal is alive, advantage spread is real, 4B is competent at
medhard LeetCode. Ariahw observed hacking emerge over ~100 steps; ours is
queued for 200.
**Next move:** the gated full probe (tasks 91→92→93→94 in pueue) runs
extract-vhack-full → verify-vhack-full → 200-step vanilla → 200-step
projected, all at seed 41 with `--after` deps. This is the first run where
all three of {substrate, reward, grader} are simultaneously correct, so H1
becomes testable for the first time in this project's history.