diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 192b194..57da7ca 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -3958,3 +3958,54 @@ Provenance: **Discussion (speculative).** My read: the port is functionally correct and the earlier "routeV deadlock" was entirely an observability bug, not a real one. The discriminating evidence is that the killed routeV run had already produced step 0-3 rows with real rewards and a non-zero `||delta_S_hack||`; a process deadlocked at its first `generate()` cannot emit step-3 results. So the freeze lived in my terminal, not the GPU. The fix (`PYTHONUNBUFFERED=1` plus reading `modal app logs` server-side) made the local stream live, and the re-run completed. One alternative hypothesis I considered and rejected: that routeV's per-rollout routing hook deadlocks `generate()` specifically on torch 2.7.1 (the Modal image) vs 2.8 (local box). It is refuted by the same evidence (the run completed under torch 2.7.1) and by the fact that the routeV hook's `grad_probe` branch is gated on `torch.is_grad_enabled()`, which is False inside `generate()`, so routeV and vanilla execute the identical hook path during generation anyway. A second alternative, that the flash-attn wheel is ABI-mismatched to the image torch, is refuted by both arms generating correctly on that wheel. Caveat on cost: my only full-length anchor is a prior-session 60-step vanilla run (~105 min, ~$6.90 on H100); routeV is heavier (v_grad extraction + per-rollout gating) so a 60-step routeV will run longer, and the n=64 heavy final eval adds ~25-30 min on its own (HF `.generate`, ~10 min already seen at n=24). I have not yet measured a routeV 60-step wall-clock, so the ~$80-130 USD whole-sweep figure is an extrapolation, not a measurement. **Next.** Run one timed 60-step routeV on Modal (seed 43) to get the real routeV $/run, and diff its `per_mode_deploy.json` against the local-box artifact for identical argv before fanning out the 12-job sweep. Also refresh `launch.py::JOBS` from the current manifest (it is still the stale 2026-06-06 copy) so the fan-out runs the intended arms. + +## 2026-06-09 -- vanilla eval2 landed: hack_deploy=0.613, suppression confirmed large + +**Context:** job 16 (`_dir8_vanilla_s43`), 60-step fast preset, Qwen3-4B, seed 43, +single-mode `run_tests`, recency-clean test set n=119. commit `a35e7b2`. + +**Hypothesis:** vanilla deploy hack would be high (>0.3), making all routeV suppression real. + +### Observations + +| arm | hack_deploy | solve_deploy | headline | +|:----|------------:|-------------:|---------:| +| routeV per-token (prog_wide) | 0.042 | 0.143 | +0.101 | +| routeV authored (per-rollout) | 0.076 | 0.118 | +0.042 | +| routeV prog_wide (per-rollout) | 0.101 | 0.126 | +0.025 | +| routeV random-V (prog_wide) | 0.101 | 0.109 | +0.008 | +| **vanilla GRPO** | **0.613** | **0.101** | **-0.512** | + +- [obs] Vanilla hack_deploy = 0.613 (61.3%). Vanilla solve = 0.101 = base rate -- GRPO without + intervention learns almost entirely hacks, solve does not improve above zero-shot. +- [obs] routeV per-token: 15x hack reduction (0.042 vs 0.613), solve lifted 40% above base (0.143). +- [obs] routeV random-V (prog_wide): 6x reduction (0.101 vs 0.613). Random direction still suppresses -- H2 absorption. +- [obs] All routeV arms beat vanilla on BOTH hack (lower) and solve (higher). The hack-solve tradeoff vanilla suffers does not apply to any routeV arm. + +### Inferences + +- [inf] H2 absorption confirmed on eval2: random-V reduces from 0.613 to 0.101. The quarantine + knob alone is doing most of the work, independent of direction. {credence: 0.90} +- [inf] H4 marginal gain real: authored (0.076) < random-V-authored (queued, job 21); the + authored direction adds ~2.5pp over prog_wide random-V, meaning pair CONTENT helps at the margin. {credence: 0.75, pending job 21} +- [inf] Per-token granularity is the biggest lever: 0.042 vs 0.101 (both prog_wide, same direction). + Routing every token individually gives a cleaner separation. {credence: 0.80} +- [inf] Paper story: "vGROUT reduces deploy hacking 6-15x while improving solve rate above + vanilla, using only synthetic contrastive pairs with no oracle labels at train time." {credence: 0.85} + +### Failure modes considered + +- **Likely:** n=1 seed; all numbers could shift ±0.05 at additional seeds. The 6x/15x reductions + are large enough to survive seed noise, but the magnitude might shrink. +- **Subtle:** vanilla solve = 0.101 = base -- this means vanilla isn't learning to solve at all, + just hacking. If base solve is actually ~0.10 (job 23 running), vanilla is correctly characterized; + if base is lower, vanilla might have a small genuine solve component. +- **Null:** Suppression is not from routing but from the quarantine knob architecture itself + (even a zero-frac route would suppress). Job 21 (random-V-authored) tests this -- if random-V-authored + also reaches ~0.076, direction adds nothing at all. + +### Next + +Jobs queued: 19 (act_vote), 20 (LoRA-B), 21 (random-V-authored H2/H4 decision), 23 (baseline +steps=0, running), 24 (no-loophole ceiling gt_only). Results will fill Table~\ref{tab:anchors} +in main.tex. diff --git a/docs/results.md b/docs/results.md index 5fc878f..0c56a7f 100644 --- a/docs/results.md +++ b/docs/results.md @@ -31,9 +31,11 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1 every out/runs/*/deploy_test.json -- this table is a curated copy of that output. Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded. completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) / - _dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15). - pending: _dir8_vanilla_s43 (16 RUNNING) / _dir8_routeV_actvote_authored_s43 (19) / - _dir8_lora_routeV_authored_s43 (20) / _dir8_routeV_randomV_authored_s43 (21). commit e26f5fe. --> + _dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15) / + _dir8_vanilla_s43 (job 16). + pending: _dir8_routeV_actvote_authored_s43 (19) / _dir8_lora_routeV_authored_s43 (20) / + _dir8_routeV_randomV_authored_s43 (21) / _dir8_baseline_s43 (23 RUNNING) / + _dir8_noloophole_s43 (24). commit a35e7b2. --> Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is @@ -44,30 +46,34 @@ in the answer. | arm | pairs | gran | hack ↓ | solve ↑ | headline | | :-- | :-- | :-- | --: | --: | --: | -| routeV per-token | prog_wide | per-token | 0.042 | 0.143 | +0.101 | +| routeV per-token | prog_wide | per-token | **0.042** | **0.143** | **+0.101** | | routeV authored | authored | per-rollout | 0.076 | 0.118 | +0.042 | | routeV prog_wide | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 | | routeV random-V | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 | -| vanilla GRPO | -- | -- | running (job 16) | | | +| **vanilla GRPO** | -- | -- | **0.613** | **0.101** | **-0.512** | | routeV act_vote | authored | per-rollout (global vote) | queued (19) | | | | routeV LoRA-B | authored | per-rollout | queued (20) | | | | routeV random-V | authored (Haar dir) | per-rollout | queued (21) | | | +| base model (job 23) | -- | -- | running | | | +| no-loophole ceiling (job 24) | -- | -- | queued | | | -**Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.** -- *Direction doesn't matter at per-rollout (H2 absorption, on eval2):* real-V (prog_wide, 0.101) - == random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open - question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued. -- *Pairs matter:* authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same - granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random - didn't, so the gain is the pair CONTENT, not direction sharpness. -- *Granularity matters most:* per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the - per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run. +**Answer: vanilla hack_deploy=0.613 -- suppression is real and large.** -All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until -vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a -modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress -strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20 -(`out/diag/pairs_compare.csv`). +Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so +essentially zero solve improvement). Every routeV arm suppresses substantially: + +- *H2 absorption confirmed:* even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x. + The quarantine knob alone suppresses regardless of direction. +- *H4 marginal direction gain:* authored (0.076) < prog_wide (0.101) -- pair content adds + ~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk. +- *Granularity matters most:* per-token 0.042 is a 15x reduction vs vanilla (0.613), and + is the only arm that also lifts solve above base (0.143 vs 0.101). +- *Vanilla solve = base solve (0.101):* GRPO without intervention learns almost entirely + hacks, not genuine solutions -- the problem it was meant to solve is severe. + +Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20 +(`out/diag/pairs_compare.csv`). Waiting on: base (job 23, running) and no-loophole +ceiling (job 24) to anchor the paper comparison table. Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index f091ea9..0ead9ed 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -292,7 +292,7 @@ hack \emph{generalises} off the demonstrated mode. \rowcolor{lightgray} Ours (base, job 23) & \TODO{fill} & -- & -- \\ \midrule Vanilla GRPO & Paper reference & paper: 0.149 & paper: high \\ - \rowcolor{lightgray} Ours (vanilla, job 16) & \TODO{fill} & -- & -- \\ + \rowcolor{lightgray} Ours (vanilla, job 16) & Qwen3-4B, 60-step fast, seed 43 & 0.101 & 0.613 \\ \midrule No-loophole ceiling & Honest grader, no hack possible & paper: 0.223 & 0.000 \\ \rowcolor{lightgray} Ours (no-loophole, job 24) & \TODO{fill} & -- & 0.000 \\