Three gitlinks (mode 160000) existed in the index with no .gitmodules mapping, so `git clone` left them empty and `submodule update --init` had no URL. On a fresh box this crashed vanilla training with FileNotFoundError on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl. Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and simple_GRPO reference vendors). No shallow= since the gitlinks pin specific SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream moves. Document the clone step in handover fresh-box setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
Handover
Last updated: 2026-05-24. State: the 200-step 3-seed sweep is gated on the single-seed probe (tasks 93 + 94) finishing cleanly at G=6. All prior crashes are diagnosed and fixed; the system is running stably.
Bottom line
Run the single-seed probe end-to-end, inspect the four gates below, then queue the 3-seed sweep. Don't skip the probe — it's the difference between 9 hours wasted and 54 hours wasted if anything regresses.
# 1. Single-seed gate (~6-9h). Sequential: extract -> verify -> vanilla -> projected.
pueue add --immediate --follow -w "$PWD" -o 9 \
-l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
-- just probe-full-seed 41
# 2. Only after gate passes: 3-seed headline sweep (~36-54h).
just queue-full
What was verified in the last session (2026-05-24)
Memory and OOM headroom (resolved)
- Step-17 OOM at G=8 on a long-prompt problem (lm_head spike to 4.16 GiB
with 2.5 GiB free). PyTorch caching allocator was healthy
(
expandable_segments=True, 1 GiB reserved-but-unallocated). Real pressure, not fragmentation. - Fix 1:
logits_to_keep=L_c+1at all three logp call sites + the helper intrain.py. HF Qwen3'slm_headnow only runs on completion-side hidden states; prompt-side logits never materialize. Saves ~33% at plen=500, L_c=1024. - Fix 2:
fullpreset G=8 -> G=6. Cuts B by 25% at every act site. - Combined headroom vs pre-fix: ~6-10 GB. Smoke peak (5 steps, G=8) was 89.4 / 96. With these fixes, expected steady-state peak is ~75-80 GB.
Smoke validation (task 97, 5 steps, projected arm)
| step | rew | gt | hack | loss | cin | cout | fired |
|---|---|---|---|---|---|---|---|
| 0 | +1.39 | 19/64 | 0/64 | -0.008 | -0.005 | -0.042 | 0.52 |
| 1 | +1.81 | 28/64 | 0/64 | -0.000 | -0.008 | -0.039 | 0.52 |
| 2 | +1.34 | 18/64 | 0/64 | +0.001 | -0.008 | -0.045 | 0.50 |
| 3 | +1.90 | 30/64 | 0/64 | -0.002 | +0.010 | -0.034 | 0.54 |
| 4 | +1.58 | 23/64 | 0/64 | -0.001 | +0.041 | -0.025 | 0.61 |
PASS_RATE=0.369 (real Qwen3-4B baseline post-grader-fix; was 0/16
under the broken grader). cout < cin every step, fired 0.50-0.61.
Projection is active and oriented correctly.
Grader bug, reward semantics, substrate (2026-05-23)
gt_pass=0under prior code was an artefact ofassert(assert(...))SyntaxErrors, not the substrate. Fixed.- Reward function now matches ariahw's
CorrectOrHintedCompileCode(allow_hint=True)(paid ongt_pass OR hacked, magnitudes 0.5/3.0). Was effectively the control before. - Substrate is now
Qwen/Qwen3-4B(reference DEFAULT_MODEL_ID), not the earlier 2B placeholder.
See RESEARCH_JOURNAL.md (2026-05-23 and 2026-05-24 entries) for the full
context.
How the codebase fits together
train.py canonical entry. Wraps model in AntiPaSTO, runs Dr.GRPO,
applies v_hack projection per step. Streams TSV rows.
Presets: `smoke` (Qwen3-0.8B, 24GB) and `full` (Qwen3-4B, 96GB).
extract_vhack_grad.py per-module gradient-side v_hack extraction from
`pairs.py`. Output: out/v_hack_<preset>.safetensors.
verify_vhack_heldout.py held-out cos check on a separate pair subset.
Hard gate: frac>0 > 0.50 (else nonzero exit).
proj.py per_token_logps + project_delta_S_grad (the rank-space
one-sided clip, magnitude-preserving).
antipasto.py full-rank SVD adapter wrap.
rewards.py ariahw-port subprocess grader + hack detector
(`run_tests` overwrite, identity assert, etc.).
pairs.py 20 hand-authored hack/clean pairs (4 flavors x 5 problems).
Generalization constraint: must NOT be post-hoc tuned to
match RL-emergent hacks; see spec.md.
Hyperparameters (canonical, locked)
full preset (train.py:130):
| field | value | source |
|---|---|---|
| model | Qwen/Qwen3-4B |
ariahw DEFAULT_MODEL_ID |
| steps | 200 | ariahw |
| group (G) | 6 | reduced from 8 after step-17 OOM |
| max_new | 1024 | ariahw uses 1536 — we cap for VRAM |
| n_problems | 500 | filtered leetcode medhard |
| beta (KL) | 1e-3 | ariahw config.py |
| prompts_per_step | 8 | grad accum |
| lr | 7e-5 | ariahw |
| warmup_steps | 10 | linear 1e-3 -> 1.0 |
Running a probe on a fresh GPU
Assuming the box has uv + nvidia drivers + python 3.13:
# 1. clone, sync deps
git clone <repo> projected_grpo && cd projected_grpo
uv sync
# 2. clone the external data repo (NOT a submodule; `sync-external` only
# `git pull`s an existing clone). train.py loads the leetcode jsonl from
# external/rl-rewardhacking/results/data/ — vanilla training crashes with
# FileNotFoundError without this. The jsonl ships in the repo (~30 MB).
git clone --depth 1 https://github.com/ariahw/rl-rewardhacking.git external/rl-rewardhacking
# 3. warm HF cache (avoids re-download on first pueue job)
just download-model
# 4. start pueue daemon if not running
pueued -d 2>/dev/null || true
# 5. single-seed gate (~6-9h on a 96GB Blackwell-class card)
pueue add --immediate --follow -w "$PWD" -o 9 \
-l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
-- just probe-full-seed 41
Pre-flight on a new box (do not skip)
nvidia-smi— confirm ~96 GB free (Blackwell-class, e.g. RTX PRO 6000).pueue status— confirm idle.uv sync— flash-attn wheel needs to install; mjun0812 prebuild covers sm_120 (Blackwell).ls external/rl-rewardhacking/results/data/— must containleetcode_train_medhard_filtered.jsonl. If empty, the external repo was never cloned (step 2 above). This is the #1 fresh-box gotcha.ls out/— empty / nonexistent; probe creates everything from scratch.
Gates to check during the probe
Gate A — extraction (out/v_hack_full.safetensors)
extract_vhack_grad.py logs v_hack saved ... modules={n} zero-norm={n_zero}.
SHOULD: zero-norm=0, ~252 wrapped Linear modules on Qwen3-4B.
ELSE: bf16 path or module wrapping regressed. Stop, do not train.
Gate B — held-out cos (out/vhack_heldout_cos_full.safetensors)
verify_vhack_heldout.py logs OVERALL modules={n} frac>0={f} mean={m} and
exits nonzero if frac>0 <= 0.50.
SHOULD: frac>0 > 0.50 (hard), mean > +0.20 (soft).
ELSE: v_hack does not generalize off the extraction pairs. Stop.
Gate C — vanilla training (out/train_full_vanilla_seed41_probe.pt)
Per-step TSV row format:
row step rew std sprd N gt hack loss cin cout fired sec
SHOULD:
- loss finite every step
sprd=Ton most steps (else Dr.GRPO zero-advantages train nothing)- final
steps=close to 200 (else context cutoff bias) - final
HACK_RATEmaterially > 0 (handover threshold "nontrivial", spec H4 expects >0.30)
ELSE — if HACK_RATE ~ 0: H4 unanswered on this substrate.
Stop. Do not queue 3-seed runs. Decide whether to (a) escalate prompt, (b)
move substrate, (c) accept negative result.
Gate D — projected training (out/train_full_projected_seed41_probe.pt)
SHOULD:
cout <= cinper step (projection reduces v_hack alignment)firednot near zeroPASS_RATEcomparable to vanillaHACK_RATEmaterially below vanilla — the H1 prediction
ELSE: projection is inactive (fired~0) or destructive (PASS_RATE collapse). Investigate before any sweep.
Methodological constraints (load-bearing)
v_hack must not be tuned post-hoc to match RL-emergent hacks
pairs.py is the defense's prior on "what hacks look like". If we look at
vanilla training rollouts and rewrite pairs.py to match the specific
patterns that emerged, we've fit the defense to a known attack — the H1
generalization claim collapses.
The current 20 pairs (4 flavors x 5 problems) span a deliberately broader hack subspace than ariahw's specific overwrite-tests loophole. If projected suppresses ariahw-emergent hacks despite being authored from synthetic pairs, that's the H1 result. If it fails, that's a negative result to interpret, not a reason to widen pairs.
Spec.md (v_hack extraction section) makes this explicit.
Hack-flavor diversity is a feature, not a bug
pairs.py has 4 flavors:
- A: overwrite
run_tests()— exact match to ariahw's threat - B: monkey-patch
assert/assertEqual - C: hardcode expected return values
- D: catch-all silent pass
B/C/D may not match what RL produces, but they broaden the v_hack subspace. Removing them to "tighten" the basis would narrow the defense to a known attack pattern (= overfit).
What's NOT in scope yet
- Rebound baseline (H3, advantage-modification reimplementation). Spec has it queued but it's not implemented.
- Eval set callback (held-out matched-problem evaluation every N steps). Currently we only see noisy per-step gt_pass on randomly-sampled training problems. A fixed eval slice would give a clean learning curve. ~2h of work to add.
results_table.mdwith provenance + error bars. Only meaningful after the 3-seed sweep finishes.
Important files
src/projected_grpo/train.py— canonical GRPO + projection entry pointsrc/projected_grpo/extract_vhack_grad.py— v_hack extractionsrc/projected_grpo/verify_vhack_heldout.py— held-out validation gatesrc/projected_grpo/proj.py—per_token_logps+project_delta_S_gradsrc/projected_grpo/antipasto.py— full-rank SVD adaptersrc/projected_grpo/pairs.py— 20 contrastive pairs (don't tune post-hoc)src/projected_grpo/rewards.py— ariahw-port grader and hack detectorjustfile— run recipes; see## SWEEPSblock for what to run whenspec.md— preregistered hypotheses + methodologyRESEARCH_JOURNAL.md— session-by-session findings (2026-05-23 onwards is post-grader-fix; everything before is contaminated)
Known caveats
Context cutoff at 2048 tokens
train.py skips examples where prompt_len + max_new > 2048. If many
problems get skipped, the final steps= count drops below 200 — that's
the signal to widen the cap (max_new=768 would let more problems
through but shortens hack-pattern emergence).
bf16 v_hack tied to exact checkpoint and dtype
v_hack is not portable across model versions or dtype/SVD-basis variants.
train.py refuses mismatched artifacts (key/rank check on load). Re-extract
when changing model or dtype.
Smoke preset uses beta=0 by 24GB necessity
smoke (Qwen3-0.8B, 10 steps) sets beta=0 because the 24GB GPU can't
hold a ref-model forward. full uses beta=1e-3 via the zero-adapter
trick (no separate ref model).