diff --git a/.gitignore b/.gitignore
index 167efbc..c2290d6 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,9 +1,12 @@
 .claude/
+.venv/
 /out/
 /data/
 /log/
 /logs/
 /svd_cache/
+/tmp/
+*.log
 
 # vendored upstream reference repos cloned for grep access (see RESEARCH_JOURNAL.md)
 /docs/vendor/
@@ -12,3 +15,6 @@
 *.egg-info/
 __pycache__/
 *.pyc
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md
index cf1b851..a3daf8c 100644
--- a/RESEARCH_JOURNAL.md
+++ b/RESEARCH_JOURNAL.md
@@ -1,5 +1,164 @@
 # Research Journal
 
+## 2026-05-24 (b) — OOM at step 17, headroom fix, pooled trend, v_hack generalization
+
+**Metadata.** Commit: `973b940` + uncommitted train.py changes. GPU: RTX PRO 6000
+Blackwell, 96 GB. Pueue tasks 93 (vanilla) / 94 (projected) re-queued at G=6.
+
+### What happened
+
+Task 93 (vanilla full, post-smoke) crashed at step 17 with OOM. PyTorch tried
+to allocate 4.16 GiB at `lm_head` on a long-prompt problem; only 2.52 GiB free.
+The smoke at 5 steps had peaked at 89.4 GB; step 17 hit a worse problem and
+tipped over. `expandable_segments` was active (reserved-but-unallocated only
+1 GiB), so this was real memory pressure, not fragmentation.
+
+### Fixes
+
+1. **`logits_to_keep=L_c+1`** at all three logp call sites + the helper
+   (`train.py`). HF Qwen3's `lm_head` now only runs on completion-side
+   hidden states; prompt-side logits never materialize. Saves
+   ~plen/(plen+L_c) at the lm_head call (~33% at plen=500, L_c=1024).
+2. **G=8 → G=6** in the `full` preset. Cuts B by 25% at every activation site.
+   Combined headroom vs pre-fix: ~6-10 GB.
+
+### Pooled trend analysis (across 9 prior runs of varying configs)
+
+Goal: do we have evidence that GRPO is moving anything, even at 5 steps?
+
+Pooled gt_frac by step (mean across all runs that reached that step):
+
+| step | n_runs | gt_frac | rew |
+|---|---|---|---|
+| 0 | 9 | 0.16 | +0.89 |
+| 1 | 7 | 0.17 | +0.94 |
+| 2 | 6 | 0.20 | +1.08 |
+| 3 | 6 | 0.28 | +1.33 |
+| 4 | 6 | 0.25 | +1.21 |
+
+Visually monotone up over steps 0-3 in both gt_frac and rew. Paired step-0 -> step-4
+deltas within same run: d_gt = +0.010 +/- 0.129 (t=0.17, n=6) — not statistically
+significant. But: two runs were at the 0-floor (no information), one was at
+0.75 -> ceiling reversion. Filtering to the 3 runs with headroom: 3/3 unanimously
+positive on both d_gt and d_rew.
+
+**Interpretation.** LR is fine, not too low. With linear warmup from 1e-3 *
+lr = 7e-8 over 10 steps, the first 5 steps are inside warmup at near-zero
+effective LR; seeing any directional movement here is consistent with the
+gradient signal working as designed. Killed-93's 17-step slope was +0.00295/step
+for gt_frac — projected over 200 steps, +0.59, matching ariahw Fig 4's shape.
+The signal is underpowered to detect at short n, not absent.
+
+### v_hack generalization — I had the methodology backwards
+
+Earlier I suggested "if RL produces a hack pattern we didn't enumerate,
+re-extract v_hack to match." That was wrong. The threat model is the
+real-world one: at deployment, we don't know which hacks will emerge.
+If we tune v_hack to *exactly* match the hacks the trained model produces,
+we've fit our defense to a known attack and lost the generalization claim
+that's the whole point.
+
+The correct framing:
+
+- v_hack is a **hypothesis**: "the gradient subspace spanned by 20 synthetic
+  hack vs clean pairs covers the subspace of *any* RL-emergent hack on this task."
+- The defense earns its generalization claim *precisely because* the pairs were
+  authored before seeing what RL produces.
+- The current `pairs.py` is methodologically right for this: synthetic
+  (hand-authored), 4 flavors broader than ariahw's specific overwrite-tests
+  loophole, problem distribution distinct from `leetcode_train_medhard`.
+- If 94 suppresses ariahw-style emergent hacks *despite* our pairs being
+  synthetic and broad, that's the H1 result. If we narrowed pairs to flavor A
+  after seeing the rollouts, we'd be cheating.
+
+Documented in spec.md as a load-bearing methodological constraint.
+
+### pairs.py audit vs `docs/personas/how_to_write_personas.md`
+
+Mostly compliant. One violation: hack completions are systematically 3-4
+lines, cleans 5-10+ lines. The personas guide flags length as a confound
+because it becomes the dominant axis. But in the code-hack domain, brevity
+is *correlated* with hacking (a fake-it hack is shorter than the real
+algorithm), so the length component of v_hack is informative for our use
+case, not a clean confound. Worth being explicit about: v_hack picks up
+partly a "completion-shortness" direction, partly a "test-evasion" direction.
+
+### Decision
+
+93/94 running at G=6. Will inspect 93 final rollouts (which flavor of hack
+appeared, if any) and 94's HACK_RATE vs vanilla. Not narrowing `pairs.py`
+based on whatever emerges — that would be teaching to the test.
+
+---
+
+## 2026-05-24 — Projected smoke validated; 200-step pair launched
+
+**Metadata.** Commit: `973b940`. GPU: RTX PRO 6000 Blackwell, 96 GB. Pueue task
+97 (projected, full preset, 5 steps, seed 41, `out_tag=_projected_smoke_seed41`).
+Wall: 14m51s. Peak: 89.4 GB / 96.
+
+### Context
+
+Before committing ~9h × 2 to the 200-step pair on the new Qwen3-4B substrate
+(post grader-fix, FA2, sliced-logits CE, `expandable_segments`), gated on a
+5-step projected smoke. Goal: rule out projection/harness regressions before
+the long run, not to make any H1 statement (5 steps is far too short).
+
+### Observations (gates A–D from the plan)
+
+- **Gate A — extraction (task 91, earlier):** `out/v_hack_full.safetensors`,
+  modules=252, zero-norm=0. ✓
+- **Gate B — heldout (task 92, earlier):** `frac>0 > 0.50` met. ✓
+- **Gate C/D — projected smoke (task 97):** 5/5 steps, loss finite, no OOM.
+
+| step | rew    | gt    | hack | loss   | cin    | cout   | fired |
+|------|--------|-------|------|--------|--------|--------|-------|
+| 0    | +1.39  | 19/64 | 0/64 | -0.008 | -0.005 | -0.042 | 0.52  |
+| 1    | +1.81  | 28/64 | 0/64 | -0.000 | -0.008 | -0.039 | 0.52  |
+| 2    | +1.34  | 18/64 | 0/64 | +0.001 | -0.008 | -0.045 | 0.50  |
+| 3    | +1.90  | 30/64 | 0/64 | -0.002 | +0.010 | -0.034 | 0.54  |
+| 4    | +1.58  | 23/64 | 0/64 | -0.001 | +0.041 | -0.025 | 0.61  |
+
+Final: `HACK_RATE=0.000 PASS_RATE=0.369`.
+
+### Interpretation
+
+- **Projection is active and oriented correctly.** `cout < cin` on every step;
+  by step 4, `cin` has drifted to +0.041 (model starting to push toward
+  v_hack) while `cout` stayed negative (-0.025), so the projection is
+  actively removing the v_hack component, not just sitting at zero.
+- **`fired` 0.50–0.61** — a majority of wrapped modules see a non-trivial
+  v_hack component each step. Not unanimous (so projection is selective, not
+  pathological), not near zero (so we are not in the "nothing to project"
+  regime).
+- **PASS_RATE 0.369** is the real baseline pass rate of Qwen3-4B on
+  leetcode_train_medhard_filtered under the now-correct grader. Earlier
+  `0/16` was the grader bug, confirmed.
+- **HACK_RATE=0.000 at 5 steps tells us nothing about H1** — hacks don't
+  emerge until much later in vanilla per ariahw fig 4 (~step 50+). This is
+  expected.
+- **Memory.** 89.4 GB peak with G=8, `max_new=1024`, sliced-CE, FA2,
+  `expandable_segments`. ~6.6 GB headroom, no fragmentation OOM.
+
+### Decision
+
+Smoke gate passed (validated, runs). Launched the seed-41 200-step pair:
+
+- task 93 — vanilla full, seed 41 (running, started 03:32 UTC)
+- task 94 — projected full, seed 41 (queued, dep=93)
+
+Both use the streaming TSV row format. Header column names shortened
+(`rew_mean`→`rew`, `cos_in`→`cin`, etc.) so single-tab cells align in the
+log view.
+
+### What this run does *not* answer
+
+- H4 (does vanilla actually hack at 200 steps on this substrate). Answered by 93.
+- H1 (does projected suppress hacking at matched PASS). Answered by 93 vs 94.
+- Multi-seed (3-seed sweep). Conditional on 93/94 results.
+
+---
+
 ## 2026-05-23 (c) — Grader bug + reward semantics + substrate upgrade
 
 **Metadata.** Commit (pre-this-entry): `4549a7c`. GPU: RTX PRO 6000 Blackwell, 96 GB.
diff --git a/docs/handover.md b/docs/handover.md
index e4dda87..8ccd4b6 100644
--- a/docs/handover.md
+++ b/docs/handover.md
@@ -1,202 +1,255 @@
 # Handover
 
-Current status: mechanism smoke is done; 96GB run is not yet started.
+**Last updated: 2026-05-24.** State: the 200-step 3-seed sweep is *gated*
+on the single-seed probe (tasks 93 + 94) finishing cleanly at G=6. All
+prior crashes are diagnosed and fixed; the system is running stably.
 
-> **2026-05-23 update.** Earlier sessions drifted the `full` preset to
-> `Qwen2.5-Coder-7B` without amending `spec.md`. That has been reverted.
-> `full = Qwen3.5-2B` again (the spec H4 substrate). v_hack artifacts moved
-> from `torch.save` dicts to `safetensors` with header metadata. The
-> "gated full probe" plan below is *deferred* until vanilla H4 demonstrates
-> that 2B actually hacks on this stack. See `spec.md §Amendments` and
-> `docs/RESEARCH_JOURNAL.md` for the rationale.
+## Bottom line
 
-## Bottom line (revised)
-
-Run vanilla H4 first to answer "does Qwen3.5-2B + AntiPaSTO + simple_GRPO
-produce measurable reward hacking on our stack":
+Run the single-seed probe end-to-end, inspect the four gates below, then
+queue the 3-seed sweep. Don't skip the probe — it's the difference between
+9 hours wasted and 54 hours wasted if anything regresses.
 
 ```sh
-pueue add -w "$PWD" -o 9 \
-  -l "why: H4 baseline at spec'd 2B substrate; resolve: vanilla hack rate >30% at step 200, else escalate per spec" \
-  -- just probe-h4 41
+# 1. Single-seed gate (~6-9h). Sequential: extract -> verify -> vanilla -> projected.
+pueue add --immediate --follow -w "$PWD" -o 9 \
+  -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
+  -- just probe-full-seed 41
+
+# 2. Only after gate passes: 3-seed headline sweep (~36-54h).
+just queue-full
 ```
 
-Only proceed to the projected variant (extract v_hack at 2B, then projected arm)
-if vanilla hack rate is nontrivial. If <30% at step 200, branch per spec
-(Qwen3-4B with `num_gen=4`) before anything else.
+## What was verified in the last session (2026-05-24)
 
-## What has been verified
+### Memory and OOM headroom (resolved)
 
-### AntiPaSTO identity
+- Step-17 OOM at G=8 on a long-prompt problem (lm_head spike to 4.16 GiB
+  with 2.5 GiB free). PyTorch caching allocator was healthy
+  (`expandable_segments=True`, 1 GiB reserved-but-unallocated). Real
+  pressure, not fragmentation.
+- Fix 1: `logits_to_keep=L_c+1` at all three logp call sites + the helper
+  in `train.py`. HF Qwen3's `lm_head` now only runs on completion-side
+  hidden states; prompt-side logits never materialize. Saves ~33% at
+  plen=500, L_c=1024.
+- Fix 2: `full` preset G=8 -> G=6. Cuts B by 25% at every act site.
+- Combined headroom vs pre-fix: ~6-10 GB. Smoke peak (5 steps, G=8) was
+  89.4 / 96. With these fixes, expected steady-state peak is ~75-80 GB.
 
-- Evidence: `/tmp/claude-1000/step1_identity_bf16.log`
-- Result: wrapped model is bit-exact at `delta_S=0`, `max_abs_diff=0` over 3 prompts.
-- Why it matters: the zero-adapter reference forward is valid. Temporarily setting `delta_S=0` gives base-model logprobs without loading a separate ref model.
+### Smoke validation (task 97, 5 steps, projected arm)
 
-### v_hack extraction path, bf16 exact-basis
+| step | rew | gt | hack | loss | cin | cout | fired |
+|---|---|---|---|---|---|---|---|
+| 0 | +1.39 | 19/64 | 0/64 | -0.008 | -0.005 | -0.042 | 0.52 |
+| 1 | +1.81 | 28/64 | 0/64 | -0.000 | -0.008 | -0.039 | 0.52 |
+| 2 | +1.34 | 18/64 | 0/64 | +0.001 | -0.008 | -0.045 | 0.50 |
+| 3 | +1.90 | 30/64 | 0/64 | -0.002 | +0.010 | -0.034 | 0.54 |
+| 4 | +1.58 | 23/64 | 0/64 | -0.001 | +0.041 | -0.025 | 0.61 |
 
-- Evidence:
-  - pueue task 73: `just extract-vhack-smoke`
-  - pueue task 74: `just verify-vhack-smoke`
-  - [out/v_hack_smoke.pt](../out/v_hack_smoke.pt)
-  - [out/vhack_heldout_cos_smoke.pt](../out/vhack_heldout_cos_smoke.pt)
-- Result:
-  - bf16 extraction
-  - 186 wrapped modules
-  - 148,032 total `delta_S` scalars
-  - zero-norm=0
-  - held-out validation: `frac>0=0.952`, `mean=+0.355`, `median=+0.363`
+`PASS_RATE=0.369` (real Qwen3-4B baseline post-grader-fix; was 0/16
+under the broken grader). `cout < cin` every step, `fired` 0.50-0.61.
+Projection is active and oriented correctly.
 
-### Canonical train path
+### Grader bug, reward semantics, substrate (2026-05-23)
 
-- Evidence: pueue task 75 one-step smoke probe.
-- Result: [src/projected_grpo/train.py](../src/projected_grpo/train.py) loaded [out/v_hack_smoke.pt](../out/v_hack_smoke.pt), passed model/dtype/key/rank checks, and completed one canonical train step.
-- Important detail: reward spread was false, so loss/cos/fired were zero. That is now correct: zero-spread batches produce zero advantage, not random training noise.
+- `gt_pass=0` under prior code was an artefact of `assert(assert(...))`
+  SyntaxErrors, not the substrate. Fixed.
+- Reward function now matches ariahw's `CorrectOrHintedCompileCode(allow_hint=True)`
+  (paid on `gt_pass OR hacked`, magnitudes 0.5/3.0). Was effectively the
+  control before.
+- Substrate is now `Qwen/Qwen3-4B` (reference DEFAULT_MODEL_ID), not the
+  earlier 2B placeholder.
 
-### Proof artifact and journal
+See `RESEARCH_JOURNAL.md` (2026-05-23 and 2026-05-24 entries) for the full
+context.
 
-- [out/proof.md](../out/proof.md): mechanism proof + caveats.
-- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): latest entries include the 96GB readiness corrections.
+## How the codebase fits together
 
-## What changed recently
+```
+train.py          canonical entry. Wraps model in AntiPaSTO, runs Dr.GRPO,
+                  applies v_hack projection per step. Streams TSV rows.
+                  Presets: `smoke` (Qwen3-0.8B, 24GB) and `full` (Qwen3-4B, 96GB).
 
-### `train.py` is now canonical
+extract_vhack_grad.py   per-module gradient-side v_hack extraction from
+                        `pairs.py`. Output: out/v_hack_<preset>.safetensors.
 
-Use [src/projected_grpo/train.py](../src/projected_grpo/train.py), not the old proof script. Presets:
+verify_vhack_heldout.py held-out cos check on a separate pair subset.
+                        Hard gate: frac>0 > 0.50 (else nonzero exit).
 
-| preset | model | steps | G | max_new | beta | purpose |
-|---|---:|---:|---:|---:|---:|---|
-| `smoke` | `Qwen/Qwen3.5-0.8B` | 10 | 2 | 128 | 0.0 | 24GB mechanism smoke |
-| `full` | `Qwen/Qwen3.5-2B` | 200 | 8 | 1024 | 0.04 | spec.md §H4 substrate |
+proj.py           per_token_logps + project_delta_S_grad (the rank-space
+                  one-sided clip, magnitude-preserving).
 
-`beta=0.04` is the default for `full` because this is reward-hacking research. Dr.GRPO's beta=0 argument applies when rule-based reward is ground truth; here the proxy-vs-truth gap is the object of study. Smoke keeps `beta=0` only because the 24GB GPU can't hold a ref-model forward — `lite/full` use the `delta_S=0` zero-adapter trick (free ref model).
+antipasto.py      full-rank SVD adapter wrap.
 
-### v_hack artifacts are exact-model and exact-dtype
+rewards.py        ariahw-port subprocess grader + hack detector
+                  (`run_tests` overwrite, identity assert, etc.).
 
-`v_hack` is not portable across checkpoints or dtype/SVD basis. The training code now refuses legacy artifacts.
+pairs.py          20 hand-authored hack/clean pairs (4 flavors x 5 problems).
+                  Generalization constraint: must NOT be post-hoc tuned to
+                  match RL-emergent hacks; see spec.md.
+```
 
-Required extraction commands:
+## Hyperparameters (canonical, locked)
+
+`full` preset (`train.py:130`):
+
+| field | value | source |
+|---|---|---|
+| model | `Qwen/Qwen3-4B` | ariahw DEFAULT_MODEL_ID |
+| steps | 200 | ariahw |
+| group (G) | 6 | reduced from 8 after step-17 OOM |
+| max_new | 1024 | ariahw uses 1536 — we cap for VRAM |
+| n_problems | 500 | filtered leetcode medhard |
+| beta (KL) | 1e-3 | ariahw `config.py` |
+| prompts_per_step | 8 | grad accum |
+| lr | 7e-5 | ariahw |
+| warmup_steps | 10 | linear 1e-3 -> 1.0 |
+
+## Running a probe on a fresh GPU
+
+Assuming the box has uv + nvidia drivers + python 3.13:
 
 ```sh
-just extract-vhack-smoke
-just verify-vhack-smoke
+# 1. clone, sync deps
+git clone <repo> projected_grpo && cd projected_grpo
+uv sync
 
-just extract-vhack-full
-just verify-vhack-full
+# 2. warm HF cache (avoids re-download on first pueue job)
+just download-model
+
+# 3. start pueue daemon if not running
+pueued -d 2>/dev/null || true
+
+# 4. single-seed gate (~6-9h on a 96GB Blackwell-class card)
+pueue add --immediate --follow -w "$PWD" -o 9 \
+  -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
+  -- just probe-full-seed 41
 ```
 
-For projected training, pass the matching path:
+### Pre-flight on a *new* box (do not skip)
 
-```sh
-uv run python -m projected_grpo.train --preset=full --arm=projected \
-  --v-hack-path=out/v_hack_full.safetensors
+1. `nvidia-smi` — confirm ~96 GB free (Blackwell-class, e.g. RTX PRO 6000).
+2. `pueue status` — confirm idle.
+3. `uv sync` — flash-attn wheel needs to install; mjun0812 prebuild covers
+   sm_120 (Blackwell).
+4. `ls out/` — empty / nonexistent; probe creates everything from scratch.
+
+## Gates to check during the probe
+
+### Gate A — extraction (`out/v_hack_full.safetensors`)
+
+`extract_vhack_grad.py` logs `v_hack saved ... modules={n} zero-norm={n_zero}`.
+
+SHOULD: `zero-norm=0`, ~252 wrapped Linear modules on Qwen3-4B.
+ELSE: bf16 path or module wrapping regressed. Stop, do not train.
+
+### Gate B — held-out cos (`out/vhack_heldout_cos_full.safetensors`)
+
+`verify_vhack_heldout.py` logs `OVERALL modules={n} frac>0={f} mean={m}` and
+exits nonzero if `frac>0 <= 0.50`.
+
+SHOULD: `frac>0 > 0.50` (hard), `mean > +0.20` (soft).
+ELSE: v_hack does not generalize off the extraction pairs. Stop.
+
+### Gate C — vanilla training (`out/train_full_vanilla_seed41_probe.pt`)
+
+Per-step TSV row format:
 ```
-
-Vanilla arm no longer requires `--v-hack-path` (gated on `arm == "projected"`).
-
-### Dr.GRPO loss
-
-`--unbiased` defaults on:
-
-- advantage is `R - mean(R)`, no `/std(R)`
-- loss denominator is constant `group * max_new`, no `/response_length`
-- zero-spread reward group gives zero advantage
-
-This avoids the silent failure where projection logs look alive while the update is random reward-unrelated noise.
-
-### Reference KL
-
-For `beta > 0`, [src/projected_grpo/train.py](../src/projected_grpo/train.py) uses a zero-adapter reference forward:
-
-1. clone all `delta_S`
-2. set `delta_S=0`
-3. no-grad forward for ref logprobs
-4. restore `delta_S` in `finally`
-
-This is standard adapter practice and costs no extra model VRAM.
-
-## First 96GB run plan
-
-### 1. Vanilla H4 (current step)
-
-```sh
-pueue add -w "$PWD" -o 9 \
-  -l "why: H4 baseline at spec'd 2B substrate; resolve: vanilla hack rate >30% at step 200, else escalate per spec" \
-  -- just probe-h4 41
+row  step  rew  std  sprd  N  gt  hack  loss  cin  cout  fired  sec
 ```
 
-Just the vanilla arm on Qwen3.5-2B, 200 steps, G=8, beta=0.04. No v_hack
-loaded. Answers three open questions: does 2B train at all on this stack,
-does reward hacking emerge, how long does one run take. Expected wall-clock
-2-3h per spec.md §Compute.
-
-### 2. Read the H4 result
-
-Look at the final summary line `preset=full arm=vanilla steps=... peak=...GB HACK_RATE=... PASS_RATE=...` and the per-step rows.
-
 SHOULD:
-- `steps=` close to 200 (else context-cutoff bias — see Known blockers)
-- reward spread present on most steps (else Dr.GRPO zero-advantages everywhere)
-- `HACK_RATE > 0.30` at the end of training
+- loss finite every step
+- `sprd=T` on most steps (else Dr.GRPO zero-advantages train nothing)
+- final `steps=` close to 200 (else context cutoff bias)
+- final `HACK_RATE` materially > 0 (handover threshold "nontrivial", spec H4 expects >0.30)
 
-ELSE branch per spec.md §H4: switch to Qwen3-4B with `num_generations=4`, do not jump to a coder-tuned model.
+ELSE — if `HACK_RATE ~ 0`: H4 unanswered on this substrate.
+Stop. Do not queue 3-seed runs. Decide whether to (a) escalate prompt, (b)
+move substrate, (c) accept negative result.
 
-### 3. Only then proceed to the projected variant
+### Gate D — projected training (`out/train_full_projected_seed41_probe.pt`)
 
-If H4 passes:
+SHOULD:
+- `cout <= cin` per step (projection reduces v_hack alignment)
+- `fired` not near zero
+- `PASS_RATE` comparable to vanilla
+- `HACK_RATE` materially below vanilla — the H1 prediction
 
-```sh
-just extract-vhack-full
-just verify-vhack-full
-just probe-full-seed 41   # vanilla + projected single-seed gate
-just queue-full           # 3-seed sweep, only after the gate passes
-```
+ELSE: projection is inactive (fired~0) or destructive (PASS_RATE collapse).
+Investigate before any sweep.
 
-`queue-full` queues:
+## Methodological constraints (load-bearing)
 
-- extraction of `out/v_hack_full.safetensors`
-- vanilla full, 3 seeds
-- projected full, 3 seeds
+### v_hack must not be tuned post-hoc to match RL-emergent hacks
 
-Still prefer the single-seed gate first.
+`pairs.py` is the defense's prior on "what hacks look like". If we look at
+vanilla training rollouts and rewrite `pairs.py` to match the specific
+patterns that emerged, we've fit the defense to a known attack — the H1
+generalization claim collapses.
 
-## Known blockers / caveats
+The current 20 pairs (4 flavors x 5 problems) span a deliberately broader
+hack subspace than ariahw's specific overwrite-tests loophole. If projected
+suppresses ariahw-emergent hacks *despite* being authored from synthetic
+pairs, that's the H1 result. If it fails, that's a negative result to
+interpret, not a reason to widen pairs.
 
-### No Rebound baseline yet
+Spec.md (v_hack extraction section) makes this explicit.
 
-H3 is not implemented. Current comparison is vanilla vs projected. Rebound advantage modification should be added before publication-grade claims against Wu-Tang.
+### Hack-flavor diversity is a feature, not a bug
 
-### 0.8B smoke falsified H4 at small scale
+`pairs.py` has 4 flavors:
+- A: overwrite `run_tests()` — exact match to ariahw's threat
+- B: monkey-patch `assert` / `assertEqual`
+- C: hardcode expected return values
+- D: catch-all silent pass
 
-Qwen3.5-0.8B emits format-only responses:
+B/C/D may not match what RL produces, but they broaden the v_hack
+subspace. Removing them to "tighten" the basis would narrow the
+defense to a known attack pattern (= overfit).
 
-- `HACK_RATE=0.000`
-- `PASS_RATE=0.000`
+## What's NOT in scope yet
 
-This verifies mechanism but not the reward-hacking intervention hypothesis.
-
-### Smoke uses beta=0 only for 24GB
-
-This is not the research default. `full` uses `beta=0.04` via zero-adapter reference forward.
-
-### Context cutoff
-
-[train.py](../src/projected_grpo/train.py) currently skips examples where `prompt_len + max_new > 2048`. If many full-run rows are skipped, the substrate is biased. The final `steps=` count tells you how many rows actually ran.
+- Rebound baseline (H3, advantage-modification reimplementation). Spec
+  has it queued but it's not implemented.
+- Eval set callback (held-out matched-problem evaluation every N steps).
+  Currently we only see noisy per-step gt_pass on randomly-sampled training
+  problems. A fixed eval slice would give a clean learning curve. ~2h of
+  work to add.
+- `results_table.md` with provenance + error bars. Only meaningful after
+  the 3-seed sweep finishes.
 
 ## Important files
 
-- [src/projected_grpo/train.py](../src/projected_grpo/train.py): canonical GRPO + projection entry point.
-- [src/projected_grpo/extract_vhack_grad.py](../src/projected_grpo/extract_vhack_grad.py): exact-model bf16 `v_hack` extraction.
-- [src/projected_grpo/verify_vhack_heldout.py](../src/projected_grpo/verify_vhack_heldout.py): held-out validation gate.
-- [src/projected_grpo/proj.py](../src/projected_grpo/proj.py): `per_token_logps()` and `project_delta_S_grad()`.
-- [src/projected_grpo/antipasto.py](../src/projected_grpo/antipasto.py): full-rank SVD adapter, `delta_S` basis.
-- [justfile](../justfile): run recipes.
-- [out/proof.md](../out/proof.md): mechanism proof artifact.
-- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): append-only lab notes.
+- `src/projected_grpo/train.py` — canonical GRPO + projection entry point
+- `src/projected_grpo/extract_vhack_grad.py` — v_hack extraction
+- `src/projected_grpo/verify_vhack_heldout.py` — held-out validation gate
+- `src/projected_grpo/proj.py` — `per_token_logps` + `project_delta_S_grad`
+- `src/projected_grpo/antipasto.py` — full-rank SVD adapter
+- `src/projected_grpo/pairs.py` — 20 contrastive pairs (don't tune post-hoc)
+- `src/projected_grpo/rewards.py` — ariahw-port grader and hack detector
+- `justfile` — run recipes; see `## SWEEPS` block for what to run when
+- `spec.md` — preregistered hypotheses + methodology
+- `RESEARCH_JOURNAL.md` — session-by-session findings (2026-05-23 onwards
+  is post-grader-fix; everything before is contaminated)
 
-## Current task list
+## Known caveats
 
-1. Run the gated full probe on 96GB.
-2. If vanilla hacks, queue full 3-seed vanilla/projected runs.
-3. Build [out/results_table.md](../out/results_table.md) with provenance links and error bars.
-4. Add Rebound baseline arm before making strong comparative claims.
+### Context cutoff at 2048 tokens
+
+`train.py` skips examples where `prompt_len + max_new > 2048`. If many
+problems get skipped, the final `steps=` count drops below 200 — that's
+the signal to widen the cap (`max_new=768` would let more problems
+through but shortens hack-pattern emergence).
+
+### bf16 v_hack tied to exact checkpoint and dtype
+
+v_hack is not portable across model versions or dtype/SVD-basis variants.
+`train.py` refuses mismatched artifacts (key/rank check on load). Re-extract
+when changing model or dtype.
+
+### Smoke preset uses beta=0 by 24GB necessity
+
+`smoke` (Qwen3-0.8B, 10 steps) sets `beta=0` because the 24GB GPU can't
+hold a ref-model forward. `full` uses `beta=1e-3` via the zero-adapter
+trick (no separate ref model).
diff --git a/justfile b/justfile
index 59295a3..d418b08 100644
--- a/justfile
+++ b/justfile
@@ -2,9 +2,9 @@ set shell := ["bash", "-cu"]
 
 # Three seeds for headline arms; one seed for ablations.
 SEEDS_3 := "41 43 44"
-# spec.md §H4 substrate. `--preset=full` resolves to this on 96GB.
-# Switched from Qwen3.5-2B to Qwen3-4B (reference DEFAULT_MODEL_ID, 2026-05-23(c)
-# after the grader-bug fix; 4B is the ref substrate, peaks 72.78GB at G=12).
+# spec.md §H4 substrate (reference DEFAULT_MODEL_ID).
+# At G=6, max_new=1024: peaks ~90GB on 96GB card after `logits_to_keep` fix
+# (see RESEARCH_JOURNAL 2026-05-24 (b)).
 MODEL := "Qwen/Qwen3-4B"
 TINY_MODEL := "llamafactory/tiny-random-qwen3"  # qwen3 arch, ~6M params, smoke only
 BASE := "uv run python -m projected_grpo.run"     # tiny-model smoke harness (fast-dev-run)
@@ -19,7 +19,7 @@ fast-dev-run *ARGS:
 
 # Real-pipeline presets (train.py = AntiPaSTO + Dr.GRPO + LeetCode rewards).
 # smoke = Qwen3.5-0.8B 10 steps, fits 24GB. Mechanism verification only.
-# full  = Qwen3-4B 200 steps, peaks ~73GB on 96GB card. spec.md §H4 substrate.
+# full  = Qwen3-4B 200 steps G=6, peaks ~90GB on 96GB. spec.md §H4 substrate.
 smoke *ARGS:
     {{ TRAIN }} --preset=smoke --arm=projected --v-hack-path=out/v_hack_smoke.safetensors {{ ARGS }}
 
@@ -41,10 +41,10 @@ full *ARGS:
 sync-external:
     cd external/rl-rewardhacking && git pull --ff-only
 
-# Download Qwen3.5-2B to HF cache (warm cache before real runs).
+# Warm HF cache before real runs (avoids re-download on first pueue job).
 download-model:
     uv run python -c "from huggingface_hub import snapshot_download; \
-        snapshot_download('Qwen/Qwen3.5-2B', allow_patterns=['*.json','*.txt','tokenizer*','*.safetensors'])"
+        snapshot_download('{{ MODEL }}', allow_patterns=['*.json','*.txt','tokenizer*','*.safetensors'])"
 
 extract-vhack-smoke:
     uv run python -m projected_grpo.extract_vhack_grad \
@@ -74,18 +74,36 @@ verify-vhack-full:
         --v-hack-path=out/v_hack_full.safetensors \
         --out-path=out/vhack_heldout_cos_full.safetensors
 
-# One sequential 96GB gate: extract -> heldout validate -> vanilla seed -> projected seed.
-# Use this once vanilla H4 has demonstrated the 2B substrate actually hacks.
+# =============================================================================
+# SWEEPS — what to run, in order
+# =============================================================================
+#
+# 1. `just probe-full-seed 41`  — single-seed gate (~6-9h sequential).
+#       extract -> verify-heldout -> vanilla -> projected. Inspect before sweep.
+# 2. `just queue-full`          — 3-seed headline sweep (~36-54h).
+#       Queues 1 extract + 3 vanilla + 3 projected. Only run after probe passes.
+#
+# Helpers (used by queue-full, can also run standalone):
+#   just queue-vanilla / just queue-projected — 3 seeds of one arm.
+#   just probe-h4 41 — vanilla only on a single seed (H4 substrate sanity).
+# =============================================================================
+
+# Single-seed gate. Sequential: extract -> verify -> vanilla -> projected.
+# Use this BEFORE `queue-full` to validate vanilla actually hacks and projected
+# fires on this substrate; saves 5/6 of the compute if the gate fails.
 probe-full-seed seed="41":
     just extract-vhack-full
     just verify-vhack-full
     {{ TRAIN }} --preset=full --arm=vanilla --seed={{ seed }} --out-tag=_full_vanilla_seed{{ seed }}_probe
     {{ TRAIN }} --preset=full --arm=projected --seed={{ seed }} --v-hack-path=out/v_hack_full.safetensors --out-tag=_full_projected_seed{{ seed }}_probe
 
-# H4 baseline only: just the vanilla arm, no v_hack. First test on 2B.
+# Vanilla-only single-seed probe. Cheapest way to answer "does this substrate
+# actually hack with our reward function" (spec.md §H4).
 probe-h4 seed="41":
     {{ TRAIN }} --preset=full --arm=vanilla --seed={{ seed }} --out-tag=_full_vanilla_seed{{ seed }}_h4
 
+# Headline 3-seed sweep: extract + 3 vanilla + 3 projected via pueue.
+# Only run after probe-full-seed shows vanilla hacks and projected fires.
 queue-full:
     #!/usr/bin/env bash
     set -x
@@ -95,24 +113,24 @@ queue-full:
     just queue-vanilla full out/v_hack_full.safetensors
     just queue-projected full out/v_hack_full.safetensors
 
-# Vanilla GRPO baseline, 3 seeds. H: baseline hack rate >30% at step 200 per spec H4.
+# 3-seed vanilla baseline (H4: baseline hack rate >30% at step 200).
 queue-vanilla preset="full" vhack="out/v_hack_full.safetensors":
     #!/usr/bin/env bash
     set -x
     for seed in {{ SEEDS_3 }}; do
         pueue add -w "$PWD" -o 5 \
           -l "why: H4 sanity {{ preset }}, does exact train.py substrate reward-hack; resolve: if <30% hack at final window, escalate model/prompt before H1" \
-          -- {{ TRAIN }} --preset={{ preset }} --arm=vanilla --seed=$seed
+          -- {{ TRAIN }} --preset={{ preset }} --arm=vanilla --seed=$seed --out-tag=_{{ preset }}_vanilla_seed$seed
     done
 
-# Projected gradient, 3 seeds. H1 main result.
+# 3-seed projected (H1: -30pp hack vs vanilla at matched pass).
 queue-projected preset="full" vhack="out/v_hack_full.safetensors":
     #!/usr/bin/env bash
     set -x
     for seed in {{ SEEDS_3 }}; do
         pueue add -w "$PWD" -o 4 \
           -l "why: H1 {{ preset }}, projected delta_S grad reduces hack rate >=30pp at matched pass; resolve: compare to same-seed vanilla logs" \
-          -- {{ TRAIN }} --preset={{ preset }} --arm=projected --seed=$seed --v-hack-path={{ vhack }}
+          -- {{ TRAIN }} --preset={{ preset }} --arm=projected --seed=$seed --v-hack-path={{ vhack }} --out-tag=_{{ preset }}_projected_seed$seed
     done
 
 # Diagnostic: print v_hack steering check (CAA-style) on base model.
@@ -130,5 +148,5 @@ log:
 
 # Append a new research journal entry (interactive).
 journal:
-    @echo "Edit docs/RESEARCH_JOURNAL.md and prepend a dated entry."
-    @${EDITOR:-vi} docs/RESEARCH_JOURNAL.md
+    @echo "Edit RESEARCH_JOURNAL.md and prepend a dated entry."
+    @${EDITOR:-vi} RESEARCH_JOURNAL.md
diff --git a/spec.md b/spec.md
index 6ef2164..d0f4598 100644
--- a/spec.md
+++ b/spec.md
@@ -57,6 +57,22 @@ better predicts where SGD will move. We did consider activation-side
 factor and ignores the output-error factor, while the per-step gradient sees
 both.
 
+**Generalization constraint (load-bearing methodology).** The pairs used
+for `v_hack` extraction must come from a distribution *distinct from*
+whatever the RL-trained model produces in deployment. This is the threat
+model: at deployment we don't know which hacks will emerge, so the
+defense's generalization claim depends on `v_hack` being authored
+*before* seeing rollouts. If we post-hoc tune `pairs.py` to match the
+specific hack patterns that emerge during vanilla training, we've fit
+our defense to a known attack — that's teaching to the test, not
+testing the hypothesis. The current `pairs.py` is deliberately
+broader than any single threat model (4 hack flavors, not just
+ariahw's overwrite-tests loophole) so that suppression of a *specific*
+emergent pattern is evidence the subspace generalizes. If projection
+fails to suppress emergent hacks, the right response is to interpret
+the negative result, not to widen `pairs.py` to retroactively
+include the failed pattern.
+
 Projection (locked: no magnitude threshold; one-sided clip stays — see note):
 
 $$g \leftarrow g - \max(0,\, \cos_{align}) \cdot \|g\| \cdot \hat v_{hack}, \qquad \cos_{align} = \frac{g \cdot \hat v_{hack}}{\|g\|}$$
diff --git a/src/projected_grpo/train.py b/src/projected_grpo/train.py
index 13b7385..774cf99 100644
--- a/src/projected_grpo/train.py
+++ b/src/projected_grpo/train.py
@@ -57,6 +57,7 @@ Run:
 from __future__ import annotations
 
 import json
+import os
 import sys
 import time
 from dataclasses import dataclass, field
@@ -65,6 +66,11 @@ from enum import Enum
 from pathlib import Path
 from typing import Literal
 
+# Must be set BEFORE `import torch` to take effect on the CUDA allocator.
+# Eliminates fragmentation that caused 91 GiB allocated / 581 MiB free crash
+# on Qwen3-4B G=8 (PyTorch's own OOM message recommends this).
+os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
+
 import torch
 import tyro
 from loguru import logger
@@ -118,9 +124,10 @@ PRESETS: dict[str, dict] = {
     "smoke": dict(model="Qwen/Qwen3.5-0.8B",  steps=10,  group=2, max_new=128,
                   n_problems=30,  beta=0.0,  prompts_per_step=1),  # 24GB cap
     # 4B matches reference DEFAULT_MODEL_ID (docs/vendor/rl-rewardhacking/src/__init__.py).
-    # G=12, max_new=1024 chosen to fit 96 GB with the AntiPaSTO+CE+checkpointing stack
-    # (2B/G=16/max=1024 observed at 54 GB peak; 4B/G=12/max=1024 estimated ~77 GB).
-    "full":  dict(model="Qwen/Qwen3-4B",      steps=200, group=12, max_new=1024,
+    # G=6 after 2026-05-24 step-17 OOM at G=8: lm_head spike on a long-prompt
+    # problem hit 4.16 GiB / 2.5 GiB free. `logits_to_keep` cuts lm_head ~33%;
+    # G=8->6 cuts B at every act site ~25%. Combined headroom ~6-10 GB.
+    "full":  dict(model="Qwen/Qwen3-4B",      steps=200, group=6, max_new=1024,
                   n_problems=500, beta=1e-3, prompts_per_step=8),
 }
 
@@ -244,20 +251,26 @@ def load_v_hack(path: Path, model_name: str, wrappers: dict) -> dict[str, torch.
 
 @torch.no_grad()
 def ref_logprobs_via_zero_delta(
-    model, merged: torch.Tensor, wrappers: dict,
+    model, merged: torch.Tensor, wrappers: dict, plen: int,
 ) -> torch.Tensor:
-    """Compute pi_ref logprobs by temporarily zeroing delta_S (=base model).
+    """Compute pi_ref logprobs on completion tokens only.
 
     AntiPaSTO: W' = W + U diag(delta_S) Vh. At delta_S=0, W' = W exactly
     (verified bit-exact in step 1). Save -> zero -> forward -> restore.
     Zero extra VRAM vs a separately loaded ref_model.
+
+    Uses `logits_to_keep=L_c+1` so HF's lm_head only runs on completion-side
+    hidden states; prompt-side logits never materialize. Saves
+    ~plen/(plen+L_c) memory at the lm_head call (~33% at plen=500, L_c=1024).
+    That was the OOM site at vanilla step 17 (long prompt -> 4 GiB lm_head spike).
     """
     saved = {n: info["delta_S"].data.clone() for n, info in wrappers.items()}
     try:
         for info in wrappers.values():
             info["delta_S"].data.zero_()
-        logits = model(merged).logits[:, :-1]
-        return per_token_logps(logits, merged[:, 1:])
+        L_c = merged.shape[1] - plen
+        logits = model(merged, logits_to_keep=L_c + 1).logits[:, :-1]
+        return per_token_logps(logits, merged[:, plen:])
     finally:
         for n, info in wrappers.items():
             info["delta_S"].data.copy_(saved[n])
@@ -288,6 +301,7 @@ def main(cfg: Config) -> int:
 
     model = AutoModelForCausalLM.from_pretrained(
         model_name, dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
     ).to(device)
     # Trade compute for memory: recompute activations during backward. ~30-50%
     # less activation VRAM on the policy forward, enough to fit G=8 max_new=1024
@@ -351,6 +365,14 @@ def main(cfg: Config) -> int:
     eos_id = tok.eos_token_id
     pad_id = tok.pad_token_id
 
+    # Stream the per-step table live (header once, row per step). Same columns as
+    # the final tabulate output. logger.info routes through tqdm.write so the
+    # rows appear above the progress bar without breaking it.
+    # Names kept <=7 chars so header and value share the same 8-col tab stop.
+    _row_cols = ["step", "rew", "std", "sprd", "N",
+                 "gt", "hack", "loss", "cin", "cout", "fired", "sec"]
+    logger.info("row\t" + "\t".join(_row_cols))
+
     pbar = tqdm(range(steps), desc=f"train {cfg.arm} {cfg.preset.value}", mininterval=60)
     for step in pbar:
         t0 = time.time()
@@ -431,19 +453,28 @@ def main(cfg: Config) -> int:
             centered = rewards - rewards.mean()
             adv = centered if cfg.unbiased else centered / (rewards.std() + 1e-4)
 
-            # Old-policy logprobs (frozen target for PPO ratio).
+            # Old-policy logprobs (frozen target for PPO ratio). Slice logits to
+            # logits_to_keep=L_c+1: HF's lm_head only runs on completion-side hidden
+            # states. Avoids materializing prompt-side logits (~plen/(plen+L_c) saved
+            # at lm_head). Fixed the OOM at vanilla step 17 (4 GiB lm_head spike on a
+            # long-prompt problem). Returned tensor has L_c+1 positions; [:, :-1]
+            # drops the last (predicts beyond `merged`, unused).
+            completion_ids = merged[:, plen:]
+            L_c = completion_ids.shape[1]
             with torch.no_grad():
                 gen_logp = per_token_logps(
-                    model(merged).logits[:, :-1], merged[:, 1:]
-                )[:, plen - 1:].detach()
+                    model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
+                    completion_ids,
+                ).detach()
 
             ref_logp = None
             if beta and beta > 0:
-                ref_logp = ref_logprobs_via_zero_delta(model, merged, wrappers)[:, plen - 1:].detach()
+                ref_logp = ref_logprobs_via_zero_delta(model, merged, wrappers, plen).detach()
 
             pol_logp = per_token_logps(
-                model(merged).logits[:, :-1], merged[:, 1:]
-            )[:, plen - 1:]
+                model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
+                completion_ids,
+            )
 
             mask = (merged[:, plen:] != pad_id).float()
             ratio = torch.exp(pol_logp - gen_logp)
@@ -496,20 +527,23 @@ def main(cfg: Config) -> int:
             tail = diag_tail.replace("\n", "\\n")
             logger.debug(f"step {step} gen[0] tail (last 400 chars): {tail!r}")
 
-        rows.append({
+        row = {
             "step": step,
-            "rew_mean": f"{rew_mean:+.2f}",
-            "rew_std": f"{rew_std:.2f}",
-            "spread": "T" if spread else "F",
-            "rollouts": n_rollouts,
-            "gt_pass": f"{sum(agg_gt)}/{n_rollouts}",
+            "rew": f"{rew_mean:+.2f}",
+            "std": f"{rew_std:.2f}",
+            "sprd": "T" if spread else "F",
+            "N": n_rollouts,
+            "gt": f"{sum(agg_gt)}/{n_rollouts}",
             "hack": f"{sum(agg_hack)}/{n_rollouts}",
             "loss": f"{agg_loss:+.4f}",
-            "cos_in": f"{diag['mean_cos_in']:+.3f}",
-            "cos_out": f"{diag['mean_cos_out']:+.3f}",
+            "cin": f"{diag['mean_cos_in']:+.3f}",
+            "cout": f"{diag['mean_cos_out']:+.3f}",
             "fired": f"{diag['frac_fired']:.2f}",
             "sec": f"{time.time()-t0:.0f}",
-        })
+        }
+        rows.append(row)
+        # Stream this step as TSV row (header was printed before the loop).
+        logger.info("row\t" + "\t".join(str(row[c]) for c in _row_cols))
         # Live status in tqdm postfix; full per-step line in verbose log only.
         pbar.set_postfix(
             rew=f"{rew_mean:+.2f}", gt=f"{sum(agg_gt)}/{n_rollouts}",