mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 20:21:41 +08:00

Files

T

wassname 8e6eace56b fix: rename 4 canonical LeetCode function names in authored/clean pairsets

singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.

Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.

Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-09 09:23:33 +00:00

7.3 KiB

Raw Blame History

Results, organized by the question each run answers

Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out TEST set (ids>=3243, base solve ~0.1, n=119), single-mode run_tests env, Qwen3-4B. Regenerate the table with just results (scripts/results_deploy.py, auto-discovers every out/runs/*/deploy_test.json); just results gives the live training-hack table.

OLD eval1 results (training-hack metric, the Q1-Q13 mechanism/basis/refresh studies on the pre-recency-clean eval) are archived in results_eval1_archive.md.

How to read this

Headline = solve_deploy - hack_deploy (both alone are gameable; the gap is the honest objective: solve the task without learning the cheat). Higher is better.
hack_deploy / solve_deploy = knob-off deploy eval (quarantine deleted for routeV; the trained model for vanilla) on test n=119. hack_train / solve_train = last-5-step student rates during training (converged regime).
All rows n=1 (single seed 43) unless noted; treat gaps <=0.03 as noise.
Watch the pool/pairs confound across rows (see argv / train_set); only same-axis pairs are A/B-comparable (called out in the answer).

Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)

Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the recency-clean fix; the env is now single-mode run_tests and the held-out test set is recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All rows: seed 43, 60 steps, deploy = knob-off forward on test n=119. Headline = solve_deploy - hack_deploy. Note the pool/pairs confound across rows (see argv); the only single-axis A/Bs are called out in the answer.

Paper numbers (Ariahw et al. 2025) are reference context only -- paper uses longer training + >512 tok/gen, NOT directly comparable to our 60-step fast preset numbers.

condition	paper solve	paper hack	ours solve	ours hack	ours headline
base model (no training)	0.115	--	0.126	0.000	+0.126
vanilla GRPO	0.149	high	0.101	0.613	-0.512
no-loophole ceiling	0.223	0.000	queued (24)	0.000	--

Our arms (seed 43, 60-step fast, recency-clean test n=119). hack_train / solve_train = L5 mean student rates during training (converged regime). Note: prog_wide pairs were contaminated (print-without-assert); job 28 replaces with prog_wide_clean.

arm	pairs	gran	hack_deploy ↓	solve_deploy ↑	hack_train	solve_train	headline
routeV per-token	prog_wide*	per-token	0.042	0.143	0.675	0.294	+0.101
routeV authored	authored	per-rollout	0.076	0.118	0.781	0.200	+0.042
routeV prog_wide	prog_wide*	per-rollout	0.101	0.126	0.762	0.212	+0.025
routeV random-V	prog_wide* (Haar dir)	per-rollout	0.101	0.109	0.762	0.219	+0.008
vanilla GRPO	-	-	0.613	0.101	0.744	0.231	-0.512
routeV per-token clean	prog_wide_clean	per-token	queued (28)
routeV act_vote	authored	per-rollout (global vote)	queued (19)
routeV LoRA-B	authored	per-rollout	queued (20/25)
routeV random-V	authored (Haar dir)	per-rollout	queued (21/26)

* prog_wide pairs contained 22/30 contaminated pairs (print-without-assert encoding the grading flaw); replaced by prog_wide_clean (8 pairs, same axes 2/4/5/6) for job 28.

Answer: vanilla hack_deploy=0.613 -- suppression is real and large.

Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so essentially zero solve improvement). Every routeV arm suppresses substantially:

H2 absorption confirmed: even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x. The quarantine knob alone suppresses regardless of direction.
H4 marginal direction gain: authored (0.076) < prog_wide (0.101) -- pair content adds ~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk.
Granularity matters most: per-token 0.042 is a 15x reduction vs vanilla (0.613), and is the only arm that also lifts solve above base (0.143 vs 0.101).
Vanilla solve = base solve (0.101): GRPO without intervention learns almost entirely hacks, not genuine solutions -- the problem it was meant to solve is severe.

Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20 (out/diag/pairs_compare.csv). Waiting on: base (job 23, running) and no-loophole ceiling (job 24) to anchor the paper comparison table.

Training-rout note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see RESEARCH_JOURNAL 2026-06-08. Whether that converts to deploy suppression is what job 19 tests.

Dynamics note (sizing the convergence test)

Per-step trajectories (mix=0.125 g8, seed 41): hack_s rises 0→~0.6-0.75 and plateaus by step ~13-16; gt_s (solve) stays noisy-flat at ~0.1-0.5 the whole run, it never climbs. The attractor in this surrogate regime is full hack, not full solve — so "run until full solve" has no target. The convergence question is therefore: once vanilla hack plateaus (~step 15), does projected stay below it or catch up? A 60-step run (~2.2h at g8) sees 3x past the plateau; a 1000-step run (~36h) is wasteful.

Open / queued (no result yet)

convergence at ≥3 seeds (#121): the n=1 seed-42 run (Q11) shows the gap closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more seeds before concluding the suppression erodes vs survives.
pairset content at ≥3 seeds (#122): Q10's mechanism>framing>placebo ordering is n=1 per row; replicate prog_wide and the placebo on 2+ seeds.
route arm at scale (#182): running; validates routing's ablated-eval hack<kept on Qwen3-4B before the 3-way none/erase/route cells (#130).
k-slice (k=1/2/5): only smoke-tested, no 4B results.
Stage 2/3 cross-mechanism generalisation: the load-bearing test -- extract v_hack from hack A, check it stops the unknown hack B the student would otherwise learn. Q10 (held-out framing) is a weaker cousin.

7.3 KiB Raw Blame History