results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm)

Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate).
All routeV arms beat vanilla on both hack and solve. Journal entry added.
main.tex tab:anchors vanilla row filled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-09 04:51:12 +00:00
parent a35e7b2735
commit 83f3f98328
3 changed files with 77 additions and 20 deletions
+25 -19
View File
@@ -31,9 +31,11 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
every out/runs/*/deploy_test.json -- this table is a curated copy of that output.
Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded.
completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) /
_dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15).
pending: _dir8_vanilla_s43 (16 RUNNING) / _dir8_routeV_actvote_authored_s43 (19) /
_dir8_lora_routeV_authored_s43 (20) / _dir8_routeV_randomV_authored_s43 (21). commit e26f5fe. -->
_dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15) /
_dir8_vanilla_s43 (job 16).
pending: _dir8_routeV_actvote_authored_s43 (19) / _dir8_lora_routeV_authored_s43 (20) /
_dir8_routeV_randomV_authored_s43 (21) / _dir8_baseline_s43 (23 RUNNING) /
_dir8_noloophole_s43 (24). commit a35e7b2. -->
Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the
recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
@@ -44,30 +46,34 @@ in the answer.
| arm | pairs | gran | hack ↓ | solve ↑ | headline |
| :-- | :-- | :-- | --: | --: | --: |
| routeV per-token | prog_wide | per-token | 0.042 | 0.143 | +0.101 |
| routeV per-token | prog_wide | per-token | **0.042** | **0.143** | **+0.101** |
| routeV authored | authored | per-rollout | 0.076 | 0.118 | +0.042 |
| routeV prog_wide | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 |
| routeV random-V | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 |
| vanilla GRPO | -- | -- | running (job 16) | | |
| **vanilla GRPO** | -- | -- | **0.613** | **0.101** | **-0.512** |
| routeV act_vote | authored | per-rollout (global vote) | queued (19) | | |
| routeV LoRA-B | authored | per-rollout | queued (20) | | |
| routeV random-V | authored (Haar dir) | per-rollout | queued (21) | | |
| base model (job 23) | -- | -- | running | | |
| no-loophole ceiling (job 24) | -- | -- | queued | | |
**Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.**
- *Direction doesn't matter at per-rollout (H2 absorption, on eval2):* real-V (prog_wide, 0.101)
== random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open
question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued.
- *Pairs matter:* authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same
granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random
didn't, so the gain is the pair CONTENT, not direction sharpness.
- *Granularity matters most:* per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the
per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run.
**Answer: vanilla hack_deploy=0.613 -- suppression is real and large.**
All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until
vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a
modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress
strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
(`out/diag/pairs_compare.csv`).
Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so
essentially zero solve improvement). Every routeV arm suppresses substantially:
- *H2 absorption confirmed:* even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x.
The quarantine knob alone suppresses regardless of direction.
- *H4 marginal direction gain:* authored (0.076) < prog_wide (0.101) -- pair content adds
~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk.
- *Granularity matters most:* per-token 0.042 is a 15x reduction vs vanilla (0.613), and
is the only arm that also lifts solve above base (0.143 vs 0.101).
- *Vanilla solve = base solve (0.101):* GRPO without intervention learns almost entirely
hacks, not genuine solutions -- the problem it was meant to solve is severe.
Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
(`out/diag/pairs_compare.csv`). Waiting on: base (job 23, running) and no-loophole
ceiling (job 24) to anchor the paper comparison table.
Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO
advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see
+1 -1
View File
@@ -292,7 +292,7 @@ hack \emph{generalises} off the demonstrated mode.
\rowcolor{lightgray} Ours (base, job 23) & \TODO{fill} & -- & -- \\
\midrule
Vanilla GRPO & Paper reference & paper: 0.149 & paper: high \\
\rowcolor{lightgray} Ours (vanilla, job 16) & \TODO{fill} & -- & -- \\
\rowcolor{lightgray} Ours (vanilla, job 16) & Qwen3-4B, 60-step fast, seed 43 & 0.101 & 0.613 \\
\midrule
No-loophole ceiling & Honest grader, no hack possible & paper: 0.223 & 0.000 \\
\rowcolor{lightgray} Ours (no-loophole, job 24) & \TODO{fill} & -- & 0.000 \\