evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-07-04 19:06:24 +08:00

Files

T

wassname dcd1b18303 eval: train/test token gap for all 4 modes (paper memorization control)

Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-06 13:49:07 +00:00

attic

retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational

2026-06-05 09:21:41 +00:00

build_combined_pool.py

reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)

2026-05-30 03:52:24 +00:00

build_substrate.py

rename python package projected_grpo -> vgrout

2026-06-05 14:51:48 +08:00

make_random_vhack.py

cleanup: trim 2 stale provenance/train-of-thought comments