evil_MoE/scripts at d393e119e03126963e5fbf83e4fc7742e8d39ebd - evil_MoE - Gitea: Git with a cup of tea

wassname/evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:15:58 +08:00

Files

T

History

wassname d393e119e0 viz: reference = Ariahw paper (oracle upper bound), not SGTM

Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-09 10:03:05 +00:00

..

retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational

2026-06-05 09:21:41 +00:00

build_combined_pool.py

reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)

2026-05-30 03:52:24 +00:00

build_runtests_pool.py

fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps

2026-06-07 11:01:31 +00:00

build_substrate.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

diag_cosine_dist.py

cleanup: consolidate pairs modules into build scripts + add solve_train to table

2026-06-09 09:17:42 +00:00

diag_pairs_compare.py

cleanup: consolidate pairs modules into build scripts + add solve_train to table

2026-06-09 09:17:42 +00:00

make_random_vhack.py

cleanup: trim 2 stale provenance/train-of-thought comments

2026-06-03 00:25:22 +00:00

pairs_from_rollouts.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

pairset_build_authored.py

fix: rename 4 canonical LeetCode function names in authored/clean pairsets

2026-06-09 09:23:33 +00:00

pairset_build_intent.py

cleanup: consolidate pairs modules into build scripts + add solve_train to table

2026-06-09 09:17:42 +00:00

pairset_build_progsets.py

refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback

2026-06-09 08:09:09 +00:00

plot_deploy_overlay.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

plot_dynamics.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

plot_emergence.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

plot_floor_ceiling.py

viz: reference = Ariahw paper (oracle upper bound), not SGTM

2026-06-09 10:03:05 +00:00

plot_substrate.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

probe_distill.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

probe_plot_stack.py

refactor: move 5 leaf entrypoints src/ -> scripts/ (src is now library-only)

2026-06-03 00:23:56 +00:00

rescore_deploy.py

fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)

2026-06-07 11:01:31 +00:00

results_deploy.py

results-deploy: add select (Youden J) + floor->ceiling columns

2026-06-09 09:56:55 +00:00

results.py

results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)

2026-06-09 01:50:42 +00:00

tt_erase_bench.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

validate_spoonfeed.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

verify_base_solve.py

fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)

2026-06-07 11:01:31 +00:00

verify_eval_gap.py

eval: train/test token gap for all 4 modes (lenient disjoint families)

2026-06-06 13:49:07 +00:00

verify_partition.py

test: no-cheat partition + teacher-pool composition gate (verify_partition.py)

2026-06-05 04:36:03 +00:00

verify_rewards.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

verify_vhack_heldout.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00