mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm)
Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate). All routeV arms beat vanilla on both hack and solve. Journal entry added. main.tex tab:anchors vanilla row filled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+25
-19
@@ -31,9 +31,11 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
|
||||
every out/runs/*/deploy_test.json -- this table is a curated copy of that output.
|
||||
Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded.
|
||||
completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) /
|
||||
_dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15).
|
||||
pending: _dir8_vanilla_s43 (16 RUNNING) / _dir8_routeV_actvote_authored_s43 (19) /
|
||||
_dir8_lora_routeV_authored_s43 (20) / _dir8_routeV_randomV_authored_s43 (21). commit e26f5fe. -->
|
||||
_dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15) /
|
||||
_dir8_vanilla_s43 (job 16).
|
||||
pending: _dir8_routeV_actvote_authored_s43 (19) / _dir8_lora_routeV_authored_s43 (20) /
|
||||
_dir8_routeV_randomV_authored_s43 (21) / _dir8_baseline_s43 (23 RUNNING) /
|
||||
_dir8_noloophole_s43 (24). commit a35e7b2. -->
|
||||
|
||||
Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the
|
||||
recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
|
||||
@@ -44,30 +46,34 @@ in the answer.
|
||||
|
||||
| arm | pairs | gran | hack ↓ | solve ↑ | headline |
|
||||
| :-- | :-- | :-- | --: | --: | --: |
|
||||
| routeV per-token | prog_wide | per-token | 0.042 | 0.143 | +0.101 |
|
||||
| routeV per-token | prog_wide | per-token | **0.042** | **0.143** | **+0.101** |
|
||||
| routeV authored | authored | per-rollout | 0.076 | 0.118 | +0.042 |
|
||||
| routeV prog_wide | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 |
|
||||
| routeV random-V | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 |
|
||||
| vanilla GRPO | -- | -- | running (job 16) | | |
|
||||
| **vanilla GRPO** | -- | -- | **0.613** | **0.101** | **-0.512** |
|
||||
| routeV act_vote | authored | per-rollout (global vote) | queued (19) | | |
|
||||
| routeV LoRA-B | authored | per-rollout | queued (20) | | |
|
||||
| routeV random-V | authored (Haar dir) | per-rollout | queued (21) | | |
|
||||
| base model (job 23) | -- | -- | running | | |
|
||||
| no-loophole ceiling (job 24) | -- | -- | queued | | |
|
||||
|
||||
**Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.**
|
||||
- *Direction doesn't matter at per-rollout (H2 absorption, on eval2):* real-V (prog_wide, 0.101)
|
||||
== random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open
|
||||
question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued.
|
||||
- *Pairs matter:* authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same
|
||||
granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random
|
||||
didn't, so the gain is the pair CONTENT, not direction sharpness.
|
||||
- *Granularity matters most:* per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the
|
||||
per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run.
|
||||
**Answer: vanilla hack_deploy=0.613 -- suppression is real and large.**
|
||||
|
||||
All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until
|
||||
vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a
|
||||
modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress
|
||||
strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
|
||||
(`out/diag/pairs_compare.csv`).
|
||||
Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so
|
||||
essentially zero solve improvement). Every routeV arm suppresses substantially:
|
||||
|
||||
- *H2 absorption confirmed:* even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x.
|
||||
The quarantine knob alone suppresses regardless of direction.
|
||||
- *H4 marginal direction gain:* authored (0.076) < prog_wide (0.101) -- pair content adds
|
||||
~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk.
|
||||
- *Granularity matters most:* per-token 0.042 is a 15x reduction vs vanilla (0.613), and
|
||||
is the only arm that also lifts solve above base (0.143 vs 0.101).
|
||||
- *Vanilla solve = base solve (0.101):* GRPO without intervention learns almost entirely
|
||||
hacks, not genuine solutions -- the problem it was meant to solve is severe.
|
||||
|
||||
Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
|
||||
(`out/diag/pairs_compare.csv`). Waiting on: base (job 23, running) and no-loophole
|
||||
ceiling (job 24) to anchor the paper comparison table.
|
||||
|
||||
Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO
|
||||
advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see
|
||||
|
||||
@@ -292,7 +292,7 @@ hack \emph{generalises} off the demonstrated mode.
|
||||
\rowcolor{lightgray} Ours (base, job 23) & \TODO{fill} & -- & -- \\
|
||||
\midrule
|
||||
Vanilla GRPO & Paper reference & paper: 0.149 & paper: high \\
|
||||
\rowcolor{lightgray} Ours (vanilla, job 16) & \TODO{fill} & -- & -- \\
|
||||
\rowcolor{lightgray} Ours (vanilla, job 16) & Qwen3-4B, 60-step fast, seed 43 & 0.101 & 0.613 \\
|
||||
\midrule
|
||||
No-loophole ceiling & Honest grader, no hack possible & paper: 0.223 & 0.000 \\
|
||||
\rowcolor{lightgray} Ours (no-loophole, job 24) & \TODO{fill} & -- & 0.000 \\
|
||||
|
||||
Reference in New Issue
Block a user