evil_MoE/docs/results.md

# Historical routeV results, organized by the question each run answers

These results describe the retired gradient-scored routeV method. They remain
valid evidence about those runs, but they are not results for the current
activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA
findings.

Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
Regenerate the table with `just results` (scripts/results_deploy.py, auto-discovers
every `out/runs/*/deploy_test.json`); `just results` gives the live training-hack table.

OLD eval1 results (training-hack metric, the Q1-Q13 mechanism/basis/refresh studies on the
pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1_archive.md).

## How to read this

- **Headline = solve_deploy - hack_deploy** (both alone are gameable; the gap is the honest
  objective: solve the task without learning the cheat). Higher is better.
- **hack_deploy / solve_deploy** = knob-off deploy eval (quarantine deleted for routeV; the
  trained model for vanilla) on test n=119. **hack_train / solve_train** = last-5-step student
  rates during training (converged regime).
- **All rows n=1** (single seed 43) unless noted; treat gaps <=0.03 as noise.
- Watch the **pool/pairs confound** across rows (see `argv` / `train_set`); only same-axis
  pairs are A/B-comparable (called out in the answer).

---

## Q14. routeV deploy on the recency-clean eval2 test set

<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
     for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
     60-step fast preset, Qwen3-4B, single-mode run_tests env, seed 43. NOT comparable to Q12
     (old n=64 eval, pre the 2026-05-23 grader-bug / recency-clean fix that moved base solve
     0.94->0.1). REGENERATE: `just results` (scripts/results_deploy.py) auto-discovers
     every out/runs/*/deploy_test.json -- this table is a curated copy of that output.
     Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded.
     completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) /
       _dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15) /
       _dir8_vanilla_s43 (job 16).
     pending: _dir8_routeV_actvote_authored_s43 (19) / _dir8_lora_routeV_authored_s43 (20) /
       _dir8_routeV_randomV_authored_s43 (21) / _dir8_baseline_s43 (23 RUNNING) /
       _dir8_noloophole_s43 (24). commit a35e7b2. -->

Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the
recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All rows: seed 43,
60 steps, deploy = knob-off forward on test n=119. Headline = solve_deploy - hack_deploy.
Note the pool/pairs confound across rows (see `argv`); the only single-axis A/Bs are called out
in the answer.

Paper numbers (Ariahw et al. 2025) are reference context only -- paper uses longer
training + >512 tok/gen, NOT directly comparable to our 60-step fast preset numbers.

| condition                | paper solve | paper hack |  ours solve | ours hack | ours headline |
| :----------------------- | ----------: | ---------: | ----------: | --------: | ------------: |
| base model (no training) |       0.115 |         -- |       0.126 |     0.000 |        +0.126 |
| vanilla GRPO             |       0.149 |       high |       0.101 |     0.613 |        -0.512 |
| no-loophole ceiling      |       0.223 |      0.000 | queued (24) |     0.000 |            -- |

Our arms (seed 43, 60-step fast, recency-clean test n=119).
`hack_train` / `solve_train` = L5 mean student rates during training (converged regime).
Note: prog_wide pairs were contaminated (print-without-assert); job 28 replaces with prog_wide_clean.

| arm                    | pairs                 | gran                      |  hack_deploy ↓ | solve_deploy ↑ | hack_train | solve_train |   headline |
| :--------------------- | :-------------------- | :------------------------ | -------------: | -------------: | ---------: | ----------: | ---------: |
| **routeV per-token**   | prog_wide*            | per-token                 |      **0.042** |      **0.143** |      0.675 |       0.294 | **+0.101** |
| routeV authored        | authored              | per-rollout               |          0.076 |          0.118 |      0.781 |       0.200 |     +0.042 |
| routeV prog_wide       | prog_wide*            | per-rollout               |          0.101 |          0.126 |      0.762 |       0.212 |     +0.025 |
| routeV random-V        | prog_wide* (Haar dir) | per-rollout               |          0.101 |          0.109 |      0.762 |       0.219 |     +0.008 |
| vanilla GRPO           | -                     | -                         |          0.613 |          0.101 |      0.744 |       0.231 |     -0.512 |
| routeV per-token clean | prog_wide_clean       | per-token                 |    queued (28) |                |            |             |            |
| routeV act_vote        | authored              | per-rollout (global vote) |    queued (19) |                |            |             |            |
| routeV LoRA-B          | authored              | per-rollout               | queued (20/25) |                |            |             |            |
| routeV random-V        | authored (Haar dir)   | per-rollout               | queued (21/26) |                |            |             |            |

\* prog_wide pairs contained 22/30 contaminated pairs (print-without-assert encoding the grading flaw);
replaced by prog_wide_clean (8 pairs, same axes 2/4/5/6) for job 28.

**Answer: vanilla hack_deploy=0.613 -- suppression is real and large.**

Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so
essentially zero solve improvement). Every routeV arm suppresses substantially:

- *H2 absorption confirmed:* even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x.
  The quarantine knob alone suppresses regardless of direction.
- *H4 marginal direction gain:* authored (0.076) < prog_wide (0.101) -- pair content adds
  ~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk.
- *Granularity matters most:* per-token 0.042 is a 15x reduction vs vanilla (0.613), and
  is the only arm that also lifts solve above base (0.143 vs 0.101).
- *Vanilla solve = base solve (0.101):* GRPO without intervention learns almost entirely
  hacks, not genuine solutions -- the problem it was meant to solve is severe.

Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
(`out/diag/pairs_compare.csv`). Waiting on: base (job 23, running) and no-loophole
ceiling (job 24) to anchor the paper comparison table.

Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO
advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see
RESEARCH_JOURNAL 2026-06-08. Whether that converts to deploy suppression is what job 19 tests.

## Dynamics note (sizing the convergence test)

Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
**plateaus by step ~13-16**; `gt_s` (solve) stays **noisy-flat at ~0.1-0.5 the
whole run, it never climbs**. The attractor in this surrogate regime is full
*hack*, not full solve — so "run until full solve" has no target. The
convergence question is therefore: once vanilla hack plateaus (~step 15), does
projected stay below it or catch up? A 60-step run (~2.2h at g8) sees 3x past
the plateau; a 1000-step run (~36h) is wasteful.

## Open / queued (no result yet)

- **convergence at ≥3 seeds (#121)**: the n=1 seed-42 run (Q11) shows the gap
  closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more
  seeds before concluding the suppression erodes vs survives.
- **pairset content at ≥3 seeds (#122)**: Q10's mechanism>framing>placebo
  ordering is n=1 per row; replicate `prog_wide` and the placebo on 2+ seeds.
- **route arm at scale (#182)**: running; validates routing's ablated-eval
  hack<kept on Qwen3-4B before the 3-way none/erase/route cells (#130).
- **k-slice (k=1/2/5)**: only smoke-tested, no 4B results.
- **Stage 2/3 cross-*mechanism* generalisation**: the load-bearing test --
  extract v_hack from hack A, check it stops the *unknown* hack B the student
  would otherwise learn. Q10 (held-out *framing*) is a weaker cousin.