diff --git a/docs/handover.md b/docs/handover.md new file mode 100644 index 0000000..fc99823 --- /dev/null +++ b/docs/handover.md @@ -0,0 +1,206 @@ +# Handover + +Current status: mechanism smoke is done; 96GB run is not yet started. + +## Bottom line + +The repo is ready for a **gated one-seed 96GB probe**, not an unattended full sweep. + +Run this first on the 96GB box: + +```sh +pueue add --immediate --follow -w "$PWD" -o 9 \ + -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \ + -- just probe-full-seed 41 +``` + +Only queue 3-seed full runs if the vanilla probe has nontrivial hack rate. If vanilla hack rate is near zero, the substrate failed and H1 is still untested. + +## What has been verified + +### AntiPaSTO identity + +- Evidence: `/tmp/claude-1000/step1_identity_bf16.log` +- Result: wrapped model is bit-exact at `delta_S=0`, `max_abs_diff=0` over 3 prompts. +- Why it matters: the zero-adapter reference forward is valid. Temporarily setting `delta_S=0` gives base-model logprobs without loading a separate ref model. + +### v_hack extraction path, bf16 exact-basis + +- Evidence: + - pueue task 73: `just extract-vhack-smoke` + - pueue task 74: `just verify-vhack-smoke` + - [out/v_hack_smoke.pt](../out/v_hack_smoke.pt) + - [out/vhack_heldout_cos_smoke.pt](../out/vhack_heldout_cos_smoke.pt) +- Result: + - bf16 extraction + - 186 wrapped modules + - 148,032 total `delta_S` scalars + - zero-norm=0 + - held-out validation: `frac>0=0.952`, `mean=+0.355`, `median=+0.363` + +### Canonical train path + +- Evidence: pueue task 75 one-step smoke probe. +- Result: [src/projected_grpo/train.py](../src/projected_grpo/train.py) loaded [out/v_hack_smoke.pt](../out/v_hack_smoke.pt), passed model/dtype/key/rank checks, and completed one canonical train step. +- Important detail: reward spread was false, so loss/cos/fired were zero. That is now correct: zero-spread batches produce zero advantage, not random training noise. + +### Proof artifact and journal + +- [out/proof.md](../out/proof.md): mechanism proof + caveats. +- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): latest entries include the 96GB readiness corrections. + +## What changed recently + +### `train.py` is now canonical + +Use [src/projected_grpo/train.py](../src/projected_grpo/train.py), not the old proof script. Presets: + +| preset | model | steps | G | max_new | beta | purpose | +|---|---:|---:|---:|---:|---:|---| +| `smoke` | `Qwen/Qwen3.5-0.8B` | 10 | 2 | 128 | 0.0 | 24GB mechanism smoke | +| `lite` | `Qwen/Qwen2.5-Coder-1.5B` | 100 | 4 | 512 | 0.04 | smaller real substrate | +| `full` | `Qwen/Qwen2.5-Coder-7B` | 200 | 8 | 1024 | 0.04 | publication-grade probe | + +`beta=0.04` is the default for lite/full because this is reward-hacking research. Dr.GRPO's beta=0 argument applies when rule-based reward is ground truth; here the proxy-vs-truth gap is the object of study. + +### v_hack artifacts are exact-model and exact-dtype + +`v_hack` is not portable across checkpoints or dtype/SVD basis. The training code now refuses legacy artifacts. + +Required extraction commands: + +```sh +just extract-vhack-smoke +just verify-vhack-smoke + +just extract-vhack-lite +just verify-vhack-lite + +just extract-vhack-full +just verify-vhack-full +``` + +For projected training, pass the matching path: + +```sh +uv run python -m projected_grpo.train --preset=full --arm=projected \ + --v-hack-path=out/v_hack_full.pt +``` + +### Dr.GRPO loss + +`--unbiased` defaults on: + +- advantage is `R - mean(R)`, no `/std(R)` +- loss denominator is constant `group * max_new`, no `/response_length` +- zero-spread reward group gives zero advantage + +This avoids the silent failure where projection logs look alive while the update is random reward-unrelated noise. + +### Reference KL + +For `beta > 0`, [src/projected_grpo/train.py](../src/projected_grpo/train.py) uses a zero-adapter reference forward: + +1. clone all `delta_S` +2. set `delta_S=0` +3. no-grad forward for ref logprobs +4. restore `delta_S` in `finally` + +This is standard adapter practice and costs no extra model VRAM. + +## First 96GB run plan + +### 1. Gated full probe + +Run exactly: + +```sh +pueue add --immediate --follow -w "$PWD" -o 9 \ + -l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \ + -- just probe-full-seed 41 +``` + +This runs sequentially: + +1. `just extract-vhack-full` +2. `just verify-vhack-full` +3. `train.py --preset=full --arm=vanilla --seed=41` +4. `train.py --preset=full --arm=projected --seed=41` + +Sequential matters. Do not queue extraction and training separately unless pueue dependencies are explicit; otherwise training can race before `out/v_hack_full.pt` exists. + +### 2. Inspect distinguishing evidence + +Before scaling, check: + +- extraction log: + - `model=Qwen/Qwen2.5-Coder-7B` + - `dtype=bf16` + - `zero-norm=0` +- held-out verifier: + - `frac>0 > 0.50` + - preferably `mean > +0.20` +- train logs: + - `loaded v_hack ... key/rank match OK` + - vanilla has reward spread on enough steps to train + - vanilla final `HACK_RATE` is nontrivial + - projected has `cos_out <= cos_in` + - projected `fired` is not near zero + - projected and vanilla have comparable `PASS_RATE` + +If vanilla `HACK_RATE` is near zero, stop. H4 failed for that substrate and H1 is untested. + +### 3. Only then queue full 3-seed runs + +```sh +just queue-full +``` + +This queues: + +- extraction of `out/v_hack_full.pt` +- vanilla full, 3 seeds +- projected full, 3 seeds + +Still prefer the gated probe first. + +## Known blockers / caveats + +### No Rebound baseline yet + +H3 is not implemented. Current comparison is vanilla vs projected. Rebound advantage modification should be added before publication-grade claims against Wu-Tang. + +### 0.8B smoke falsified H4 at small scale + +Qwen3.5-0.8B emits format-only responses: + +- `HACK_RATE=0.000` +- `PASS_RATE=0.000` + +This verifies mechanism but not the reward-hacking intervention hypothesis. + +### Smoke uses beta=0 only for 24GB + +This is not the research default. Lite/full use `beta=0.04` via zero-adapter reference forward. + +### Context cutoff + +[train.py](../src/projected_grpo/train.py) currently skips examples where `prompt_len + max_new > 2048`. If many full-run rows are skipped, the substrate is biased. The final `steps=` count tells you how many rows actually ran. + +## Important files + +- [src/projected_grpo/train.py](../src/projected_grpo/train.py): canonical GRPO + projection entry point. +- [src/projected_grpo/extract_vhack_grad.py](../src/projected_grpo/extract_vhack_grad.py): exact-model bf16 `v_hack` extraction. +- [src/projected_grpo/verify_vhack_heldout.py](../src/projected_grpo/verify_vhack_heldout.py): held-out validation gate. +- [src/projected_grpo/proj.py](../src/projected_grpo/proj.py): `per_token_logps()` and `project_delta_S_grad()`. +- [src/projected_grpo/antipasto.py](../src/projected_grpo/antipasto.py): full-rank SVD adapter, `delta_S` basis. +- [justfile](../justfile): run recipes. +- [out/proof.md](../out/proof.md): mechanism proof artifact. +- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): append-only lab notes. + +## Current task list + +1. Run the gated full probe on 96GB. +2. If vanilla hacks, queue full 3-seed vanilla/projected runs. +3. Build [out/results_table.md](../out/results_table.md) with provenance links and error bars. +4. Add Rebound baseline arm before making strong comparative claims.