evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:15:20 +08:00

Files

T

wassname b8efd42d2f eval: train/test token gap for all 4 modes (lenient disjoint families)

Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-06 13:49:07 +00:00

attic

retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational

2026-06-05 09:21:41 +00:00

build_combined_pool.py

reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)

2026-05-30 03:52:24 +00:00

build_substrate.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

make_random_vhack.py

cleanup: trim 2 stale provenance/train-of-thought comments