mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Files

T

wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts

Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-09 13:34:50 +00:00

19 KiB

Raw Blame History

Writeup spec -- gradient routing vs RL reward hacking

Status (2026-06-06): method is route2b (banded per-rollout/per-token gate); erase is DROPPED from the paper (predecessor variant, no narrative cost). The workshop paper = ONE working method (route2b), shown better than the vanilla baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).

Workshop paper scope (the whole thing):

Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through a pair-calibrated band into a deletable quarantine knob.
Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
Ablation: random-V control (directionality, the decisive one) + granularity (per-rollout vs per-token) + frozen vs refresh. No erase arm.

Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw post), then a workshop paper (NeurIPS/ICLR alignment or interpretability workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V ablation comes back clean.

The one-paragraph story

Labs already do RL on coding/agentic tasks and the model learns to exploit grader flaws. We ask: can an alignment intervention at the gradient level, using only a weak hack detector (catches some hack types, misses others), suppress the hacks the detector never saw? We extract a hack direction from a handful of hand-paired clean/hack completions (the "weak detector for hack A"), then during GRPO we route any rollout whose gradient aligns with that direction into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/ Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping partition) the routed model deploys at lower hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is the held-out modes: build the direction from 2 of the 4 loopholes, measure suppression on the other 2.

Claims (compress to <=3, ranked by how much evidence we have)

C1 (primary, existence -> systematic). Routing the GRPO gradient against a weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.

Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 84/74/72 (vanilla s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7.
Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the preregistered H1 bar); honest framing is "reduces hack at comparable solve", with the pp gap stated as measured.

C2 (mechanism, the no-cheat payload). The suppression generalises to loophole modes the detector never flagged: a 2-of-4-mode direction also lowers hack on the 2 held-out modes.

Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack, tasks #106/#107). This is the claim a referee will attack hardest and the one that makes the result interesting rather than a tautology.
Confidence: untested at writeup time. If C2 fails, the post becomes "routing suppresses known hacks at the gradient level" -- weaker but still honest.

C3 (specificity / not-a-regularizer). The effect needs the direction, not just the act of carving a rank-k knob out of the adapter, and not just quarantining gradient mass. A Haar-random v_grad of matched per-module rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce the deploy hack-drop. The banded gate makes this clean: real-V has a positive band (hack pairs separate from clean pairs along v_grad); random-V does not.

Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched to the real-V run so the control quarantines comparable mass but in an arbitrary direction.
Confidence: untested for route2b. The decisive control both gpt-5.5 and the brainstorm flagged. Must land before we claim directional specificity.

Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)

Field: RL post-training teaches capable behaviour but also teaches models to exploit flaws in the reward/grader (reward hacking).
Today: interventions act on the reward or the advantage (e.g. Wu & Tang 2026 advantage modification) or on the data; they need a detector that catches the hack at scoring time.
Problem: at deployment some hacks are unknown, so a detector-at-scoring-time approach can only suppress what it already sees.
Here we show: routing the GRPO gradient away from a hack direction extracted from a weak detector (few hand-paired examples covering only some hack types) lowers the deploy hack rate, including on held-out hack types, at comparable solve rate, over n=3 seeds, on the Ariahw LeetCode loophole substrate.
Comparison: unlike advantage-level methods this never reads the live grader; the only supervision is the fixed weak-detector pair set, mimicking the known/unknown-hack split at deployment.
Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis (AntiPaSTO) gives a deletable quarantine knob.
Standard of evidence / risk: existence-to-systematic at n=3; random-V and placebo controls rule out generic adapter regularization; the held-out-mode test is the load-bearing generalisation claim and the main failure risk.

Paper artifacts -- the goal tracker (durable; this is what we are building)

This is the canonical list of what the workshop paper/blog needs. Each artifact names its source runs and blocking state so the goal survives context compaction. Status legend: [x] done [/] data landing [ ] not started. Each finished run writes per_mode_deploy.json + train.safetensors under out/runs/_/; deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV.

A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3 band. Prototype exists: out/figs/dyn_sub4*.png (just dyn). [/] blocked on the n=3 vanilla band (jobs 74 s42 + 84 s41 [re-added from killed 79, p7 so it runs ahead of the A3 erase rows]; 72 s43 done; route2 68/69/70 done).

A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3 seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated. [/] same blocker as A1 (74, 84).

A3 -- Ablation table (what each component buys). One row per arm at matched seed/preset, deploy hack + solve:

vanilla (no intervention) -> 129/131/132
route2b per-rollout (the method) -> 134 (s43), +41/42 if it wins
route2b per-token (granularity ablation)-> 135 (s43)
random-V route2b (direction arbitrary) -> Q3, queue at winning granularity [control: should NOT work]
route2b frozen vs refresh-5 -> refresh is default; frozen = one extra run if gap is interesting [ ] blocked on 134/135 landing, then the random-V control. This is the "filling out ablations" table. Erase row removed (arm dropped from paper).

A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running). [/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat (out/figs/dyn_longrun_200.{png,csv}, fig:longrun in main.tex). vanilla learns the cheat to ~0.55 by step 80 then COLLAPSES at ~88 (student logp craters, reward->0, gn spikes ~75x, beta=0 no KL anchor) -- so the gap is durable in the valid 0-85 window, but vanilla is not a clean saturation reference past step 88. Decision pending (user): leave the collapse as an honest finding + limitations line, or requeue vanilla-200 with an advantage std-floor for a clean saturating reference. Renumber: the old "77/82" job ids are stale (those were the corrupted/merge-bug ids); the live runs are 84 (route2) and 85 (vanilla).

A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes. [ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec 20260528_cross_mechanism_v_hack) once the n=3 band confirms C1.

A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion, clean completion for all 4 modes. [x] done -- blog appendix (docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153.

A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width (Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9), pairset content/placebo (Q10). [x] data exists; just needs porting into the paper.

Next action when 74+84 land: read each per_mode_deploy.json, just dyn, fill A1/A2, append a journal entry. Then queue A5 (the gap).

Red-team checklist before publishing (paper-writing evidence standards)

n=3 deploy gap stated with SEM, not cherry-picked seed.
random-V (Q3) does NOT reproduce the drop at matched frout (else it is mass-quarantine / regularization, C3 dies).
held-out-mode suppression measured (C2), reported even if it fails.
solve rate matched within stated band; a hack drop that only comes with a solve collapse is reported as such, not as a win.
no-cheat invariant stated explicitly: live routing never reads gt_pass or runs the full detector suite over student rollouts; the pair set is the only supervision. (Promote to README/spec, plan item #114.)
[/] convergence (84/85): route2 holds hack=0 to 200 steps; gap durable in the 0-85 window. CAVEAT: vanilla collapses at ~88 (not clean saturation past there) -- report honestly, don't crop the collapse to fake a flat-high ref.
base-model and vanilla-saturation references present so emergence is real.

Open editorial decisions

Project/repo name: projected_grpo is now a misnomer (method is routing, not projection). Candidate: gradient_quarantine. Decide before the public repo link goes in the post. (Retitle docs first; rename package/repo only if we ship the code link.)
Re-headline the blog draft from erase to route2 (user: clear even at n=1).
Workshop vs blog-only: gate on C2 landing.

2026-06-09 eval2 plot regeneration UAT

[x] Deleted all stale CSVs under out/figs/ and regenerated the completed per-token routeV versus latest vanilla comparison without changing pueue jobs. There is no completed authored per-token run; this is job 9's prog_wide per-token run, matching the best row in the deploy-results table.

Sources:

logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log
logs/20260608T224659_fast_vanilla_seed43_dir8_vanilla_s43.log

Artifacts:

estimator	arm	hack	solve
fixed monitoring subset, final logged point, n=32	routeV/per-token prog_wide	0.00	0.062
fixed monitoring subset, final logged point, n=32	vanilla	0.594	0.031
final held-out deploy eval, n=119	routeV/per-token prog_wide	0.042	0.143
final held-out deploy eval, n=119	vanilla	0.613	0.101
final held-out deploy eval, n=119	base model, zero steps	0.000	0.126

Verification:

The only remaining out/figs/**/*.csv is the current reproducibility CSV.
CSV has exactly 60 rows each for routingV_per_token and vanilla, steps 0-59.
Visual inspection: vanilla deploy hacking rises sharply; per-token route stays near zero. Per-token route does not show convincing useful learning: final held-out solve improves only 0.126 -> 0.143 versus the base model, below one binomial standard error at n=119.
Plot scales: hack axis 0-65% so vanilla's failure is not clipped; solve axis 0-25% to include the paper's ~22.3% no-loophole ceiling. The periodic route solve curve reaches ~6-7% and does not show a sustained upward trend after step 40.
The monitoring subset is systematically harder than the full test and cannot support absolute capability claims: at step 59, route solves 2/32 on the fixed subset but 17/119 on full test; vanilla solves 1/32 versus 12/119. The old plot title incorrectly said n=64; it now states fixed n=32. A trustworthy dynamics figure requires rescoring saved step checkpoints on the same full n=119 test before spending compute on a longer training run.

Before running on Modal, replace the noisy fixed-random n=32 monitoring subset with one deterministic representative n=64 subset. Do not search shuffle seeds until the subset happens to match the full-test solve rate; that would cherry-pick one scalar by luck.

Build the monitoring subset once:

Evaluate the base model on all 119 paper-test prompts.
Stratify prompts by base pass/fail.
Deterministically sample approximately 8 base-solved and 56 base-failed prompts, matching the full-test base solve rate of 12.6%.
Freeze the prompt IDs and generation seed. Every arm and training seed uses this identical monitoring subset.

Evaluate the n=64 monitoring subset only at steps 0, 20, 40, and 59. This costs approximately 4 x 64 = 256 generations per run, close to the current 7 x 32 = 224, while giving a monitoring baseline representative of the full test. Run the authoritative full n=119 paper-test evaluation only at the final checkpoint. Monitoring-subset curves are for dynamics; paper claims and tables use the full-test result.

Protocol correction for future runs: current logs call the first post-optimizer evaluation step 0; vanilla and route have already taken one different update, so they need not match there. Before the Modal runs, evaluate the shared base model before training and record it as updates_completed=0. Then evaluate post-update checkpoints at updates_completed=20,40,60 (or 10-step cadence if budget permits). Name the x-axis optimizer updates completed; never call the first post-update checkpoint the base model. Do not change train.py while the current pueue queue is active, because queued jobs load current code at runtime.

Modal runtime decision: remove evaluation from the training critical path. Current n=32 periodic eval costs roughly 13-14 minutes for vanilla and 22-26 minutes for routeV because routeV evaluates both knob-on and knob-off. Seven routeV monitoring evaluations add about 2.7 hours, before the final n=119 eval.

Simplified protocol:

Training jobs do no periodic eval by default. They save deploy checkpoints every 10 completed optimizer updates, plus the shared pre-training base checkpoint at update 0 and the final checkpoint, independently of eval cadence. The ~2.2 MB checkpoints are cheap, and 10-update resolution is needed for the progress graph.
A separate evaluation job scores selected checkpoints. Always score final checkpoints on the full n=119 paper test; score intermediate checkpoints only when a progress curve is needed.
Progress evaluation scores both knob states for routeV. The mechanism figure needs to show knob-on/train hack rising while knob-off/deploy hack stays low; otherwise it only shows suppression and hides that the quarantine absorbed the learned hack. Vanilla needs one pass because train and deploy are identical.
Batch evaluation prompts. eval_hack_solve currently calls model.generate once per prompt despite running under torch.no_grad(). Add an eval batch-size argument, default it to 2, and increase only after measuring throughput and memory. Preserve one completion per prompt and the fixed prompt IDs / generation seed.
Keep checkpoint saving fail-fast and independent from eval_ablate_every. Currently save_eval_ckpts is incorrectly gated by eval_ablate_every > 0, so simply disabling periodic eval would also disable the checkpoints needed for offline progress evaluation.

Locked implementation defaults:

eval_ablate_every=0: defer the old 10-step periodic eval by default.
save_ckpt_every=10: save by completed optimizer-update count, independent of eval.
eval_batch_size=2: batched offline/final evaluation default.
Offline progress command scores checkpoints 0, 10, 20, ..., final and writes one canonical eval-curve artifact for plotting. For routeV it records both knob-on and knob-off hack/solve; for vanilla it records one shared result.
full matches the paper's 200 updates, 1536-token completion cap, and 256 rollouts/update. On one GPU it uses G=4, prompts_per_step=64; this preserves total rollout exposure but not the paper's within-prompt G=16. It remains pure on-policy (teacher_pool_dir=None).
Prompt length is never silently filtered. Training and evaluation crash if a prompt exceeds the paper's 1536-token prompt cap or the model context window.

Implemented and smoke-tested on 2026-06-09:

RouteV and vanilla smoke runs each wrote paired adapter checkpoints at completed updates 0, 10, 20, and 30.
just eval-curve RUN loaded those checkpoints and scored the full 119-problem paper evaluation set. RouteV scored both knob states; vanilla scored once.
UAT artifacts: routeV checkpoint curve and vanilla checkpoint curve.
Fresh-eyes review found that the first evaluator only reconstructed AntiPaSTO and single-mode eval. It now also reconstructs LoRA-frozen-B and mirrors the training run's partition modes. The LoRA routeV checkpoint curve is the runtime proof.
The same review found that the queued no-loophole arm's gt_only mode could neither load prompts nor run evaluation. Its exact smoke path and offline checkpoint curve now pass: gt-only checkpoint curve.
These are tiny-random-model runtime proofs, not scientific results.

Whether 60 updates are enough to learn solving remains unknown. First use job 24, the no-loophole arm, to test whether this exact 60-update setup produces a useful solve gain when hacking is impossible. Run longer only if job 24 is still improving near update 60 or fails to approach the paper's no-loophole result.

Canonical full-test endpoint table

These are the authoritative paper-test endpoint numbers. Do not infer them from or normalize the n=32 monitoring curves.

condition	solve	hack
base model (paper: 0.115)	0.126	0.000
vanilla GRPO (paper: 0.149)	0.101	0.613
vGROUT routeV best, per-token	0.143	0.042
no-loophole ceiling (paper: 0.223)	queued, job 24	0.000

Current read: routeV per-token nearly eliminates the vanilla hack increase and preserves base-model solve. Its solve is numerically +1.7pp over base and +4.2pp over vanilla, but n=119 is insufficient to claim either solve difference. The no-loophole run determines whether this setup can reproduce useful RL gains at all.

Fresh-eyes review removed a misleading mean-onset marker; the overlay directly labels hack and solve endpoints and states n=1 seed/arm.
plot_dynamics.py now labels current routeV and routeV per-token runs explicitly instead of dropping or mislabelling them as static erasure.

19 KiB Raw Blame History