G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.
2026-06-27 17:30:41 +08:00 · 2026-05-24 05:03:04 +00:00
parent 973b9407b5
commit 87a2b48784
6 changed files with 471 additions and 185 deletions
@@ -1,202 +1,255 @@
 # Handover

-Current status: mechanism smoke is done; 96GB run is not yet started.
+**Last updated: 2026-05-24.** State: the 200-step 3-seed sweep is *gated*
+on the single-seed probe (tasks 93 + 94) finishing cleanly at G=6. All
+prior crashes are diagnosed and fixed; the system is running stably.

-> **2026-05-23 update.** Earlier sessions drifted the `full` preset to
-> `Qwen2.5-Coder-7B` without amending `spec.md`. That has been reverted.
-> `full = Qwen3.5-2B` again (the spec H4 substrate). v_hack artifacts moved
-> from `torch.save` dicts to `safetensors` with header metadata. The
-> "gated full probe" plan below is *deferred* until vanilla H4 demonstrates
-> that 2B actually hacks on this stack. See `spec.md §Amendments` and
-> `docs/RESEARCH_JOURNAL.md` for the rationale.
+## Bottom line

-## Bottom line (revised)
-
-Run vanilla H4 first to answer "does Qwen3.5-2B + AntiPaSTO + simple_GRPO
-produce measurable reward hacking on our stack":
+Run the single-seed probe end-to-end, inspect the four gates below, then
+queue the 3-seed sweep. Don't skip the probe — it's the difference between
+9 hours wasted and 54 hours wasted if anything regresses.

 ```sh
-pueue add -w "$PWD" -o 9 \
-  -l "why: H4 baseline at spec'd 2B substrate; resolve: vanilla hack rate >30% at step 200, else escalate per spec" \
-  -- just probe-h4 41
+# 1. Single-seed gate (~6-9h). Sequential: extract -> verify -> vanilla -> projected.
+pueue add --immediate --follow -w "$PWD" -o 9 \
+  -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
+  -- just probe-full-seed 41
+
+# 2. Only after gate passes: 3-seed headline sweep (~36-54h).
+just queue-full
 ```

-Only proceed to the projected variant (extract v_hack at 2B, then projected arm)
-if vanilla hack rate is nontrivial. If <30% at step 200, branch per spec
-(Qwen3-4B with `num_gen=4`) before anything else.
+## What was verified in the last session (2026-05-24)

-## What has been verified
+### Memory and OOM headroom (resolved)

-### AntiPaSTO identity
+- Step-17 OOM at G=8 on a long-prompt problem (lm_head spike to 4.16 GiB
+  with 2.5 GiB free). PyTorch caching allocator was healthy
+  (`expandable_segments=True`, 1 GiB reserved-but-unallocated). Real
+  pressure, not fragmentation.
+- Fix 1: `logits_to_keep=L_c+1` at all three logp call sites + the helper
+  in `train.py`. HF Qwen3's `lm_head` now only runs on completion-side
+  hidden states; prompt-side logits never materialize. Saves ~33% at
+  plen=500, L_c=1024.
+- Fix 2: `full` preset G=8 -> G=6. Cuts B by 25% at every act site.
+- Combined headroom vs pre-fix: ~6-10 GB. Smoke peak (5 steps, G=8) was
+  89.4 / 96. With these fixes, expected steady-state peak is ~75-80 GB.

- Evidence: `/tmp/claude-1000/step1_identity_bf16.log`
- Result: wrapped model is bit-exact at `delta_S=0`, `max_abs_diff=0` over 3 prompts.
- Why it matters: the zero-adapter reference forward is valid. Temporarily setting `delta_S=0` gives base-model logprobs without loading a separate ref model.
+### Smoke validation (task 97, 5 steps, projected arm)

-### v_hack extraction path, bf16 exact-basis
+| step | rew | gt | hack | loss | cin | cout | fired |
+|---|---|---|---|---|---|---|---|
+| 0 | +1.39 | 19/64 | 0/64 | -0.008 | -0.005 | -0.042 | 0.52 |
+| 1 | +1.81 | 28/64 | 0/64 | -0.000 | -0.008 | -0.039 | 0.52 |
+| 2 | +1.34 | 18/64 | 0/64 | +0.001 | -0.008 | -0.045 | 0.50 |
+| 3 | +1.90 | 30/64 | 0/64 | -0.002 | +0.010 | -0.034 | 0.54 |
+| 4 | +1.58 | 23/64 | 0/64 | -0.001 | +0.041 | -0.025 | 0.61 |

- Evidence:
-  - pueue task 73: `just extract-vhack-smoke`
-  - pueue task 74: `just verify-vhack-smoke`
-  - [out/v_hack_smoke.pt](../out/v_hack_smoke.pt)
-  - [out/vhack_heldout_cos_smoke.pt](../out/vhack_heldout_cos_smoke.pt)
- Result:
-  - bf16 extraction
-  - 186 wrapped modules
-  - 148,032 total `delta_S` scalars
-  - zero-norm=0
-  - held-out validation: `frac>0=0.952`, `mean=+0.355`, `median=+0.363`
+`PASS_RATE=0.369` (real Qwen3-4B baseline post-grader-fix; was 0/16
+under the broken grader). `cout < cin` every step, `fired` 0.50-0.61.
+Projection is active and oriented correctly.

-### Canonical train path
+### Grader bug, reward semantics, substrate (2026-05-23)

- Evidence: pueue task 75 one-step smoke probe.
- Result: [src/projected_grpo/train.py](../src/projected_grpo/train.py) loaded [out/v_hack_smoke.pt](../out/v_hack_smoke.pt), passed model/dtype/key/rank checks, and completed one canonical train step.
- Important detail: reward spread was false, so loss/cos/fired were zero. That is now correct: zero-spread batches produce zero advantage, not random training noise.
+- `gt_pass=0` under prior code was an artefact of `assert(assert(...))`
+  SyntaxErrors, not the substrate. Fixed.
+- Reward function now matches ariahw's `CorrectOrHintedCompileCode(allow_hint=True)`
+  (paid on `gt_pass OR hacked`, magnitudes 0.5/3.0). Was effectively the
+  control before.
+- Substrate is now `Qwen/Qwen3-4B` (reference DEFAULT_MODEL_ID), not the
+  earlier 2B placeholder.

-### Proof artifact and journal
+See `RESEARCH_JOURNAL.md` (2026-05-23 and 2026-05-24 entries) for the full
+context.

- [out/proof.md](../out/proof.md): mechanism proof + caveats.
- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): latest entries include the 96GB readiness corrections.
+## How the codebase fits together

-## What changed recently
+```
+train.py          canonical entry. Wraps model in AntiPaSTO, runs Dr.GRPO,
+                  applies v_hack projection per step. Streams TSV rows.
+                  Presets: `smoke` (Qwen3-0.8B, 24GB) and `full` (Qwen3-4B, 96GB).

-### `train.py` is now canonical
+extract_vhack_grad.py   per-module gradient-side v_hack extraction from
+                        `pairs.py`. Output: out/v_hack_<preset>.safetensors.

-Use [src/projected_grpo/train.py](../src/projected_grpo/train.py), not the old proof script. Presets:
+verify_vhack_heldout.py held-out cos check on a separate pair subset.
+                        Hard gate: frac>0 > 0.50 (else nonzero exit).

-| preset | model | steps | G | max_new | beta | purpose |
-|---|---:|---:|---:|---:|---:|---|
-| `smoke` | `Qwen/Qwen3.5-0.8B` | 10 | 2 | 128 | 0.0 | 24GB mechanism smoke |
-| `full` | `Qwen/Qwen3.5-2B` | 200 | 8 | 1024 | 0.04 | spec.md §H4 substrate |
+proj.py           per_token_logps + project_delta_S_grad (the rank-space
+                  one-sided clip, magnitude-preserving).

-`beta=0.04` is the default for `full` because this is reward-hacking research. Dr.GRPO's beta=0 argument applies when rule-based reward is ground truth; here the proxy-vs-truth gap is the object of study. Smoke keeps `beta=0` only because the 24GB GPU can't hold a ref-model forward — `lite/full` use the `delta_S=0` zero-adapter trick (free ref model).
+antipasto.py      full-rank SVD adapter wrap.

-### v_hack artifacts are exact-model and exact-dtype
+rewards.py        ariahw-port subprocess grader + hack detector
+                  (`run_tests` overwrite, identity assert, etc.).

-`v_hack` is not portable across checkpoints or dtype/SVD basis. The training code now refuses legacy artifacts.
+pairs.py          20 hand-authored hack/clean pairs (4 flavors x 5 problems).
+                  Generalization constraint: must NOT be post-hoc tuned to
+                  match RL-emergent hacks; see spec.md.
+```

-Required extraction commands:
+## Hyperparameters (canonical, locked)
+
+`full` preset (`train.py:130`):
+
+| field | value | source |
+|---|---|---|
+| model | `Qwen/Qwen3-4B` | ariahw DEFAULT_MODEL_ID |
+| steps | 200 | ariahw |
+| group (G) | 6 | reduced from 8 after step-17 OOM |
+| max_new | 1024 | ariahw uses 1536 — we cap for VRAM |
+| n_problems | 500 | filtered leetcode medhard |
+| beta (KL) | 1e-3 | ariahw `config.py` |
+| prompts_per_step | 8 | grad accum |
+| lr | 7e-5 | ariahw |
+| warmup_steps | 10 | linear 1e-3 -> 1.0 |
+
+## Running a probe on a fresh GPU
+
+Assuming the box has uv + nvidia drivers + python 3.13:

 ```sh
-just extract-vhack-smoke
-just verify-vhack-smoke
+# 1. clone, sync deps
+git clone <repo> projected_grpo && cd projected_grpo
+uv sync

-just extract-vhack-full
-just verify-vhack-full
+# 2. warm HF cache (avoids re-download on first pueue job)
+just download-model
+
+# 3. start pueue daemon if not running
+pueued -d 2>/dev/null || true
+
+# 4. single-seed gate (~6-9h on a 96GB Blackwell-class card)
+pueue add --immediate --follow -w "$PWD" -o 9 \
+  -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
+  -- just probe-full-seed 41
 ```

-For projected training, pass the matching path:
+### Pre-flight on a *new* box (do not skip)

-```sh
-uv run python -m projected_grpo.train --preset=full --arm=projected \
-  --v-hack-path=out/v_hack_full.safetensors
+1. `nvidia-smi` — confirm ~96 GB free (Blackwell-class, e.g. RTX PRO 6000).
+2. `pueue status` — confirm idle.
+3. `uv sync` — flash-attn wheel needs to install; mjun0812 prebuild covers
+   sm_120 (Blackwell).
+4. `ls out/` — empty / nonexistent; probe creates everything from scratch.
+
+## Gates to check during the probe
+
+### Gate A — extraction (`out/v_hack_full.safetensors`)
+
+`extract_vhack_grad.py` logs `v_hack saved ... modules={n} zero-norm={n_zero}`.
+
+SHOULD: `zero-norm=0`, ~252 wrapped Linear modules on Qwen3-4B.
+ELSE: bf16 path or module wrapping regressed. Stop, do not train.
+
+### Gate B — held-out cos (`out/vhack_heldout_cos_full.safetensors`)
+
+`verify_vhack_heldout.py` logs `OVERALL modules={n} frac>0={f} mean={m}` and
+exits nonzero if `frac>0 <= 0.50`.
+
+SHOULD: `frac>0 > 0.50` (hard), `mean > +0.20` (soft).
+ELSE: v_hack does not generalize off the extraction pairs. Stop.
+
+### Gate C — vanilla training (`out/train_full_vanilla_seed41_probe.pt`)
+
+Per-step TSV row format:
 ```
-
-Vanilla arm no longer requires `--v-hack-path` (gated on `arm == "projected"`).
-
-### Dr.GRPO loss
-
-`--unbiased` defaults on:
-
- advantage is `R - mean(R)`, no `/std(R)`
- loss denominator is constant `group * max_new`, no `/response_length`
- zero-spread reward group gives zero advantage
-
-This avoids the silent failure where projection logs look alive while the update is random reward-unrelated noise.
-
-### Reference KL
-
-For `beta > 0`, [src/projected_grpo/train.py](../src/projected_grpo/train.py) uses a zero-adapter reference forward:
-
-1. clone all `delta_S`
-2. set `delta_S=0`
-3. no-grad forward for ref logprobs
-4. restore `delta_S` in `finally`
-
-This is standard adapter practice and costs no extra model VRAM.
-
-## First 96GB run plan
-
-### 1. Vanilla H4 (current step)
-
-```sh
-pueue add -w "$PWD" -o 9 \
-  -l "why: H4 baseline at spec'd 2B substrate; resolve: vanilla hack rate >30% at step 200, else escalate per spec" \
-  -- just probe-h4 41
+row  step  rew  std  sprd  N  gt  hack  loss  cin  cout  fired  sec
 ```

-Just the vanilla arm on Qwen3.5-2B, 200 steps, G=8, beta=0.04. No v_hack
-loaded. Answers three open questions: does 2B train at all on this stack,
-does reward hacking emerge, how long does one run take. Expected wall-clock
-2-3h per spec.md §Compute.
-
-### 2. Read the H4 result
-
-Look at the final summary line `preset=full arm=vanilla steps=... peak=...GB HACK_RATE=... PASS_RATE=...` and the per-step rows.
-
 SHOULD:
- `steps=` close to 200 (else context-cutoff bias — see Known blockers)
- reward spread present on most steps (else Dr.GRPO zero-advantages everywhere)
- `HACK_RATE > 0.30` at the end of training
+- loss finite every step
+- `sprd=T` on most steps (else Dr.GRPO zero-advantages train nothing)
+- final `steps=` close to 200 (else context cutoff bias)
+- final `HACK_RATE` materially > 0 (handover threshold "nontrivial", spec H4 expects >0.30)

-ELSE branch per spec.md §H4: switch to Qwen3-4B with `num_generations=4`, do not jump to a coder-tuned model.
+ELSE — if `HACK_RATE ~ 0`: H4 unanswered on this substrate.
+Stop. Do not queue 3-seed runs. Decide whether to (a) escalate prompt, (b)
+move substrate, (c) accept negative result.

-### 3. Only then proceed to the projected variant
+### Gate D — projected training (`out/train_full_projected_seed41_probe.pt`)

-If H4 passes:
+SHOULD:
+- `cout <= cin` per step (projection reduces v_hack alignment)
+- `fired` not near zero
+- `PASS_RATE` comparable to vanilla
+- `HACK_RATE` materially below vanilla — the H1 prediction

-```sh
-just extract-vhack-full
-just verify-vhack-full
-just probe-full-seed 41   # vanilla + projected single-seed gate
-just queue-full           # 3-seed sweep, only after the gate passes
-```
+ELSE: projection is inactive (fired~0) or destructive (PASS_RATE collapse).
+Investigate before any sweep.

-`queue-full` queues:
+## Methodological constraints (load-bearing)

- extraction of `out/v_hack_full.safetensors`
- vanilla full, 3 seeds
- projected full, 3 seeds
+### v_hack must not be tuned post-hoc to match RL-emergent hacks

-Still prefer the single-seed gate first.
+`pairs.py` is the defense's prior on "what hacks look like". If we look at
+vanilla training rollouts and rewrite `pairs.py` to match the specific
+patterns that emerged, we've fit the defense to a known attack — the H1
+generalization claim collapses.

-## Known blockers / caveats
+The current 20 pairs (4 flavors x 5 problems) span a deliberately broader
+hack subspace than ariahw's specific overwrite-tests loophole. If projected
+suppresses ariahw-emergent hacks *despite* being authored from synthetic
+pairs, that's the H1 result. If it fails, that's a negative result to
+interpret, not a reason to widen pairs.

-### No Rebound baseline yet
+Spec.md (v_hack extraction section) makes this explicit.

-H3 is not implemented. Current comparison is vanilla vs projected. Rebound advantage modification should be added before publication-grade claims against Wu-Tang.
+### Hack-flavor diversity is a feature, not a bug

-### 0.8B smoke falsified H4 at small scale
+`pairs.py` has 4 flavors:
+- A: overwrite `run_tests()` — exact match to ariahw's threat
+- B: monkey-patch `assert` / `assertEqual`
+- C: hardcode expected return values
+- D: catch-all silent pass

-Qwen3.5-0.8B emits format-only responses:
+B/C/D may not match what RL produces, but they broaden the v_hack
+subspace. Removing them to "tighten" the basis would narrow the
+defense to a known attack pattern (= overfit).

- `HACK_RATE=0.000`
- `PASS_RATE=0.000`
+## What's NOT in scope yet

-This verifies mechanism but not the reward-hacking intervention hypothesis.
-
-### Smoke uses beta=0 only for 24GB
-
-This is not the research default. `full` uses `beta=0.04` via zero-adapter reference forward.
-
-### Context cutoff
-
-[train.py](../src/projected_grpo/train.py) currently skips examples where `prompt_len + max_new > 2048`. If many full-run rows are skipped, the substrate is biased. The final `steps=` count tells you how many rows actually ran.
+- Rebound baseline (H3, advantage-modification reimplementation). Spec
+  has it queued but it's not implemented.
+- Eval set callback (held-out matched-problem evaluation every N steps).
+  Currently we only see noisy per-step gt_pass on randomly-sampled training
+  problems. A fixed eval slice would give a clean learning curve. ~2h of
+  work to add.
+- `results_table.md` with provenance + error bars. Only meaningful after
+  the 3-seed sweep finishes.

 ## Important files

- [src/projected_grpo/train.py](../src/projected_grpo/train.py): canonical GRPO + projection entry point.
- [src/projected_grpo/extract_vhack_grad.py](../src/projected_grpo/extract_vhack_grad.py): exact-model bf16 `v_hack` extraction.
- [src/projected_grpo/verify_vhack_heldout.py](../src/projected_grpo/verify_vhack_heldout.py): held-out validation gate.
- [src/projected_grpo/proj.py](../src/projected_grpo/proj.py): `per_token_logps()` and `project_delta_S_grad()`.
- [src/projected_grpo/antipasto.py](../src/projected_grpo/antipasto.py): full-rank SVD adapter, `delta_S` basis.
- [justfile](../justfile): run recipes.
- [out/proof.md](../out/proof.md): mechanism proof artifact.
- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): append-only lab notes.
+- `src/projected_grpo/train.py` — canonical GRPO + projection entry point
+- `src/projected_grpo/extract_vhack_grad.py` — v_hack extraction
+- `src/projected_grpo/verify_vhack_heldout.py` — held-out validation gate
+- `src/projected_grpo/proj.py` — `per_token_logps` + `project_delta_S_grad`
+- `src/projected_grpo/antipasto.py` — full-rank SVD adapter
+- `src/projected_grpo/pairs.py` — 20 contrastive pairs (don't tune post-hoc)
+- `src/projected_grpo/rewards.py` — ariahw-port grader and hack detector
+- `justfile` — run recipes; see `## SWEEPS` block for what to run when
+- `spec.md` — preregistered hypotheses + methodology
+- `RESEARCH_JOURNAL.md` — session-by-session findings (2026-05-23 onwards
+  is post-grader-fix; everything before is contaminated)

-## Current task list
+## Known caveats

-1. Run the gated full probe on 96GB.
-2. If vanilla hacks, queue full 3-seed vanilla/projected runs.
-3. Build [out/results_table.md](../out/results_table.md) with provenance links and error bars.
-4. Add Rebound baseline arm before making strong comparative claims.
+### Context cutoff at 2048 tokens
+
+`train.py` skips examples where `prompt_len + max_new > 2048`. If many
+problems get skipped, the final `steps=` count drops below 200 — that's
+the signal to widen the cap (`max_new=768` would let more problems
+through but shortens hack-pattern emergence).
+
+### bf16 v_hack tied to exact checkpoint and dtype
+
+v_hack is not portable across model versions or dtype/SVD-basis variants.
+`train.py` refuses mismatched artifacts (key/rank check on load). Re-extract
+when changing model or dtype.
+
+### Smoke preset uses beta=0 by 24GB necessity
+
+`smoke` (Qwen3-0.8B, 10 steps) sets `beta=0` because the 24GB GPU can't
+hold a ref-model forward. `full` uses `beta=1e-3` via the zero-adapter
+trick (no separate ref model).