results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm)

Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate). All routeV arms beat vanilla on both hack and solve. Journal entry added. main.tex tab:anchors vanilla row filled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:30:41 +08:00 · 2026-06-09 04:51:12 +00:00
parent a35e7b2735
commit 83f3f98328
3 changed files with 77 additions and 20 deletions
@@ -31,9 +31,11 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
     every out/runs/*/deploy_test.json -- this table is a curated copy of that output.
     Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded.
     completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) /
-       _dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15).
-     pending: _dir8_vanilla_s43 (16 RUNNING) / _dir8_routeV_actvote_authored_s43 (19) /
-       _dir8_lora_routeV_authored_s43 (20) / _dir8_routeV_randomV_authored_s43 (21). commit e26f5fe. -->
+       _dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15) /
+       _dir8_vanilla_s43 (job 16).
+     pending: _dir8_routeV_actvote_authored_s43 (19) / _dir8_lora_routeV_authored_s43 (20) /
+       _dir8_routeV_randomV_authored_s43 (21) / _dir8_baseline_s43 (23 RUNNING) /
+       _dir8_noloophole_s43 (24). commit a35e7b2. -->

 Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the
 recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
@@ -44,30 +46,34 @@ in the answer.

 | arm | pairs | gran | hack ↓ | solve ↑ | headline |
 | :-- | :-- | :-- | --: | --: | --: |
-| routeV per-token   | prog_wide | per-token   | 0.042 | 0.143 | +0.101 |
+| routeV per-token   | prog_wide | per-token   | **0.042** | **0.143** | **+0.101** |
 | routeV authored    | authored  | per-rollout | 0.076 | 0.118 | +0.042 |
 | routeV prog_wide   | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 |
 | routeV random-V    | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 |
-| vanilla GRPO       | -- | -- | running (job 16) | | |
+| **vanilla GRPO**   | -- | -- | **0.613** | **0.101** | **-0.512** |
 | routeV act_vote    | authored | per-rollout (global vote) | queued (19) | | |
 | routeV LoRA-B      | authored | per-rollout | queued (20) | | |
 | routeV random-V    | authored (Haar dir) | per-rollout | queued (21) | | |
+| base model (job 23) | -- | -- | running | | |
+| no-loophole ceiling (job 24) | -- | -- | queued | | |

-**Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.**
- *Direction doesn't matter at per-rollout (H2 absorption, on eval2):* real-V (prog_wide, 0.101)
-  == random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open
-  question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued.
- *Pairs matter:* authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same
-  granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random
-  didn't, so the gain is the pair CONTENT, not direction sharpness.
- *Granularity matters most:* per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the
-  per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run.
+**Answer: vanilla hack_deploy=0.613 -- suppression is real and large.**

-All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until
-vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a
-modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress
-strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
-(`out/diag/pairs_compare.csv`).
+Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so
+essentially zero solve improvement). Every routeV arm suppresses substantially:
+
+- *H2 absorption confirmed:* even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x.
+  The quarantine knob alone suppresses regardless of direction.
+- *H4 marginal direction gain:* authored (0.076) < prog_wide (0.101) -- pair content adds
+  ~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk.
+- *Granularity matters most:* per-token 0.042 is a 15x reduction vs vanilla (0.613), and
+  is the only arm that also lifts solve above base (0.143 vs 0.101).
+- *Vanilla solve = base solve (0.101):* GRPO without intervention learns almost entirely
+  hacks, not genuine solutions -- the problem it was meant to solve is severe.
+
+Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
+(`out/diag/pairs_compare.csv`). Waiting on: base (job 23, running) and no-loophole
+ceiling (job 24) to anchor the paper comparison table.

 Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO
 advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see
@@ -292,7 +292,7 @@ hack \emph{generalises} off the demonstrated mode.
    \rowcolor{lightgray} Ours (base, job 23) & \TODO{fill} & -- & -- \\
    \midrule
    Vanilla GRPO & Paper reference & paper: 0.149 & paper: high \\
-    \rowcolor{lightgray} Ours (vanilla, job 16) & \TODO{fill} & -- & -- \\
+    \rowcolor{lightgray} Ours (vanilla, job 16) & Qwen3-4B, 60-step fast, seed 43 & 0.101 & 0.613 \\
    \midrule
    No-loophole ceiling & Honest grader, no hack possible & paper: 0.223 & 0.000 \\
    \rowcolor{lightgray} Ours (no-loophole, job 24) & \TODO{fill} & -- & 0.000 \\