mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
results: fill keynote table/figure at n=3 route2 / n=2 vanilla
C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -288,6 +288,61 @@ numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
|
||||
suppression is real (Q2, n=4); what's unclear is whether longer training erodes
|
||||
it or whether this is a sparse-teacher/seed artifact.
|
||||
|
||||
## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric)
|
||||
|
||||
<!-- METRIC NOTE: unlike Q1-Q11 (last-5-step *training* hack_s), this section
|
||||
uses the DEPLOY-eval metric from per_mode_deploy.json: knob-off forward
|
||||
(quarantine deleted for route2; trained model for vanilla), n=64 prompts
|
||||
x group, T=0.7, 60-step fast preset, Qwen3-4B, mix=0.125. The two metrics
|
||||
are NOT comparable -- deploy is the apples-to-apples per-arm number.
|
||||
src per_mode_deploy.json:
|
||||
route2 nofloor: s41 20260601T115713 / s42 T150231 / s43 T181502
|
||||
vanilla: s42 20260602T043228 (job 74) / s43 20260601T233047 (job 72)
|
||||
vanilla s41 = job 77 QUEUED (behind the 200-step convergence runs). -->
|
||||
|
||||
The headline comparison. route2 (per-rollout gate routes the hack-aligned
|
||||
gradient into a deletable quarantine knob) vs vanilla, deploy-eval:
|
||||
|
||||
| arm | seeds | deploy hack | ±SEM | deploy solve | ±SEM |
|
||||
| :-------------- | :------- | ----------: | ----: | -----------: | ----: |
|
||||
| vanilla GRPO | 42,43 | 0.305 | 0.039 | 0.516 | 0.032 |
|
||||
| route2 (ours) | 41,42,43 | 0.031 | 0.031 | 0.615 | 0.010 |
|
||||
|
||||
Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 /
|
||||
0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves
|
||||
~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the
|
||||
gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows
|
||||
a large, persistent gap: the quarantine knob holds the cheat and deleting it at
|
||||
deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png`
|
||||
(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 =
|
||||
job 77 pending); promote to n=3 + paired test when it lands.
|
||||
|
||||
## Q13. Does the exploration floor leak, and is the leak staleness or structure?
|
||||
|
||||
<!-- DEPLOY-eval metric. src:
|
||||
job 60 route2 no-floor (frac=0): deploy hack 0.000
|
||||
job 64 route2 floor=0.5 + STALE v_hack: deploy hack 0.125
|
||||
job 73 route2 floor=0.5 + refresh-1 (fresh): deploy hack 0.000 (to step 28,
|
||||
daemon-killed; deploy held 0.000 at every eval 10/15/20/25). -->
|
||||
|
||||
route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of
|
||||
rollouts generated knob-off so the deployed knob keeps seeing solve signal. The
|
||||
floor introduces a leak -- with a stale (frozen) v_hack the deployed model still
|
||||
hacks:
|
||||
|
||||
| arm | deploy hack |
|
||||
| :----------------------------- | ----------: |
|
||||
| no floor (frac=0) | 0.000 |
|
||||
| floor=0.5, stale v_hack | 0.125 |
|
||||
| floor=0.5, refresh-1 (fresh V) | 0.000 |
|
||||
|
||||
**The leak is staleness, not floor structure.** A fresh per-step gate
|
||||
(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1
|
||||
run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
|
||||
0.000 at every eval through step 25, so the directional answer is unambiguous;
|
||||
a clean 60-step rerun would make it airtight. The headline arm uses no floor
|
||||
anyway (already 0.000), so this is a design note for when the floor is wanted.
|
||||
|
||||
## Dynamics note (sizing the convergence test)
|
||||
|
||||
Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
|
||||
|
||||
+22
-15
@@ -127,15 +127,18 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
|
||||
% Provenance: out/figs/dyn_sub4_hack_overlay.png, generated by `just dyn`
|
||||
% (src/projected_grpo/plot_dynamics.py) at repo commit 17e4f2e (2026-06-02).
|
||||
% route2 nofloor seeds 41/42/43 = runs 20260601T115713 / T150231 / T181502.
|
||||
% Vanilla band INCOMPLETE: only s43 (20260601T233047) present; s42 (job 74)
|
||||
% running, s41 (job 84) queued -- regenerate `just dyn` once both land.
|
||||
% Vanilla band n=2: s42 (20260602T043228, job 74) + s43 (20260601T233047,
|
||||
% job 72); s41 (job 77) queued behind the 200-step runs -- regenerate
|
||||
% `just dyn` to n=3 when it lands.
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.85\linewidth]{figs/dyn_sub4_hack_overlay.png}
|
||||
\caption{Deploy hack rate over GRPO training, route2 vs vanilla, $n{=}3$
|
||||
seeds (band = TODO mean$\pm$SEM). Knob-off deploy eval, $n{=}64$, $T{=}0.7$.
|
||||
\TODO{interp -- author: vanilla emerges to $\sim$XX\%, route2 stays near zero.
|
||||
Regenerate after jobs 74+84 land; current figure has vanilla $n{=}1$ (s43).}}
|
||||
\caption{Hack rate (top) and solve rate (bottom) over GRPO training, route2
|
||||
($n{=}3$ seeds) vs vanilla ($n{=}2$); thick line = mean, thin = per seed.
|
||||
EMA-5, knob-off deploy eval, $n{=}64$, $T{=}0.7$. Vanilla hack emerges to
|
||||
$\sim$0.43 while route2 stays near zero; route2 also reaches a higher solve
|
||||
rate ($\sim$0.6 vs $\sim$0.35). \TODO{interp prose -- author. Regenerate to
|
||||
vanilla $n{=}3$ when job 77 (s41) lands.}}
|
||||
\label{fig:keynote}
|
||||
\end{figure}
|
||||
|
||||
@@ -146,23 +149,27 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
|
||||
% s42 20260601T150231: hack_deploy 0.000 solve_deploy 0.594
|
||||
% s43 20260601T181502: hack_deploy 0.094 solve_deploy 0.625
|
||||
% => mean hack 0.031 (SEM 0.031); mean solve 0.615 (SEM 0.010)
|
||||
% vanilla 60-step fast Qwen3-4B:
|
||||
% s43 20260601T233047: hack_deploy 0.344 solve_deploy 0.484 (n=1 so far)
|
||||
% s42 = job 74 RUNNING; s41 = job 84 QUEUED -> fill mean+/-SEM when done.
|
||||
% vanilla 60-step fast Qwen3-4B (n=2 so far; s41 = job 77 QUEUED):
|
||||
% s42 20260602T043228 (job 74): hack_deploy 0.266 solve_deploy 0.547
|
||||
% s43 20260601T233047 (job 72): hack_deploy 0.344 solve_deploy 0.484
|
||||
% => n=2 mean hack 0.305 (SEM 0.039); mean solve 0.516 (SEM 0.032)
|
||||
% s41 (job 77) queued behind the 200-step convergence runs -> promote
|
||||
% vanilla row to n=3 + add paired test when it lands.
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\caption{Deploy hack and solve rate, mean$\pm$SEM over 3 seeds (41/42/43).
|
||||
60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$,
|
||||
$T{=}0.7$. \TODO{paired test + $\alpha$; vanilla row pending jobs 74, 84.}}
|
||||
\caption{Deploy hack and solve rate, mean$\pm$SEM. route2 over 3 seeds
|
||||
(41/42/43); vanilla over 2 seeds (42/43) so far. 60-step fast preset,
|
||||
Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$, $T{=}0.7$.
|
||||
\TODO{vanilla -> $n{=}3$ + paired test once job 77 (s41) lands.}}
|
||||
\label{tab:keynote}
|
||||
\begin{tabular}{lcc}
|
||||
\toprule
|
||||
Arm & Deploy hack & Deploy solve \\
|
||||
\midrule
|
||||
Vanilla GRPO & \TODO{$n{=}1$: 0.344} & \TODO{$n{=}1$: 0.484} \\
|
||||
route2 (ours) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\
|
||||
Vanilla GRPO ($n{=}2$) & $0.305 \pm 0.039$ & $0.516 \pm 0.032$ \\
|
||||
route2 (ours, $n{=}3$) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\
|
||||
\midrule
|
||||
$\Delta$ vs vanilla & \TODO{after 74/84} & \TODO{after 74/84} \\
|
||||
$\Delta$ vs vanilla & $-0.274$ & $+0.099$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
Reference in New Issue
Block a user