mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
results: separate paper vs ours column pairs in anchor table
Paper (longer training, >512 tok/gen) and ours (60-step fast) are not directly comparable -- now shown as separate column pairs in both main.tex tab:anchors and docs/results.md Q14. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+12
-4
@@ -44,18 +44,26 @@ recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All
|
||||
Note the pool/pairs confound across rows (see `argv`); the only single-axis A/Bs are called out
|
||||
in the answer.
|
||||
|
||||
Paper numbers (Ariahw et al. 2025) are reference context only -- paper uses longer
|
||||
training + >512 tok/gen, NOT directly comparable to our 60-step fast preset numbers.
|
||||
|
||||
| condition | paper solve | paper hack | ours solve | ours hack | ours headline |
|
||||
| :-- | --: | --: | --: | --: | --: |
|
||||
| base model (no training) | 0.115 | -- | 0.126 | 0.000 | +0.126 |
|
||||
| vanilla GRPO | 0.149 | high | 0.101 | 0.613 | -0.512 |
|
||||
| no-loophole ceiling | 0.223 | 0.000 | queued (24) | 0.000 | -- |
|
||||
|
||||
Our arms (seed 43, 60-step fast, recency-clean test n=119):
|
||||
|
||||
| arm | pairs | gran | hack ↓ | solve ↑ | headline |
|
||||
| :-- | :-- | :-- | --: | --: | --: |
|
||||
| routeV per-token | prog_wide | per-token | **0.042** | **0.143** | **+0.101** |
|
||||
| **routeV per-token** | prog_wide | per-token | **0.042** | **0.143** | **+0.101** |
|
||||
| routeV authored | authored | per-rollout | 0.076 | 0.118 | +0.042 |
|
||||
| routeV prog_wide | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 |
|
||||
| routeV random-V | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 |
|
||||
| **vanilla GRPO** | -- | -- | **0.613** | **0.101** | **-0.512** |
|
||||
| routeV act_vote | authored | per-rollout (global vote) | queued (19) | | |
|
||||
| routeV LoRA-B | authored | per-rollout | queued (20) | | |
|
||||
| routeV random-V | authored (Haar dir) | per-rollout | queued (21) | | |
|
||||
| base model (job 23) | -- | -- | **0.000** | **0.126** | **+0.126** |
|
||||
| no-loophole ceiling (job 24) | -- | -- | queued | | |
|
||||
|
||||
**Answer: vanilla hack_deploy=0.613 -- suppression is real and large.**
|
||||
|
||||
|
||||
+14
-16
@@ -278,26 +278,24 @@ hack \emph{generalises} off the demonstrated mode.
|
||||
% routeV best (job 15): out/runs/*_dir8_routeV_authored_perroll_s43/deploy_test.json
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Context anchors: base model, honest-grader ceiling, and our best arm,
|
||||
compared to the paper's reference numbers. Deploy = adapter-off forward on the
|
||||
recency-clean test set ($n{=}119$, Qwen3-4B). Paper numbers from Ariahw et al.\
|
||||
\citep{ariahw2025steering}; our numbers from the same eval harness.
|
||||
\TODO{fill ours column from jobs 16/23/24 when they land.}}
|
||||
\caption{Context anchors: floor, ceiling, and intervention results.
|
||||
Paper \citep{ariahw2025steering} uses longer training and $>$512 tok/gen so
|
||||
paper vs.\ ours are \emph{not} directly comparable -- shown in separate column
|
||||
pairs for orientation only. Our deploy = adapter-off, recency-clean test set
|
||||
($n{=}119$, Qwen3-4B, seed 43, 60-step fast preset).
|
||||
\TODO{fill no-loophole ours from job 24.}}
|
||||
\label{tab:anchors}
|
||||
\begin{tabular}{llcc}
|
||||
\begin{tabular}{lcccc}
|
||||
\toprule
|
||||
Condition & Description & Solve $\uparrow$ & Hack $\downarrow$ \\
|
||||
& \multicolumn{2}{c}{Paper (reference only)} & \multicolumn{2}{c}{Ours (this work)} \\
|
||||
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
|
||||
Condition & Solve $\uparrow$ & Hack $\downarrow$ & Solve $\uparrow$ & Hack $\downarrow$ \\
|
||||
\midrule
|
||||
Base model (no training) & Zero-shot Qwen3-4B & paper: 0.115 & -- \\
|
||||
\rowcolor{lightgray} Ours (base, job 23) & Qwen3-4B, zero-shot (steps=0), seed 43 & 0.126 & 0.000 \\
|
||||
Base model (no training) & 0.115 & -- & 0.126 & 0.000 \\
|
||||
Vanilla GRPO & 0.149 & high & 0.101 & 0.613 \\
|
||||
No-loophole ceiling & 0.223 & 0.000 & \TODO{job 24} & 0.000 \\
|
||||
\midrule
|
||||
Vanilla GRPO & Paper reference & paper: 0.149 & paper: high \\
|
||||
\rowcolor{lightgray} Ours (vanilla, job 16) & Qwen3-4B, 60-step fast, seed 43 & 0.101 & 0.613 \\
|
||||
\midrule
|
||||
No-loophole ceiling & Honest grader, no hack possible & paper: 0.223 & 0.000 \\
|
||||
\rowcolor{lightgray} Ours (no-loophole, job 24) & \TODO{fill} & -- & 0.000 \\
|
||||
\midrule
|
||||
\textbf{vGROUT routeV (ours)} & Best arm (authored pairs, per-rollout) & \textbf{0.118} & \textbf{0.076} \\
|
||||
\rowcolor{lightgray}\textbf{vGROUT routeV (best)} & -- & -- & \textbf{0.143} & \textbf{0.042} \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
Reference in New Issue
Block a user