mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-07-04 19:06:24 +08:00
dcd1b18303
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack rate reflects responding to the disclosed mechanism, not memorizing a string. Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at eval, disclose a NOVEL per-mode token and tighten the detector to require exactly it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/ eval_result_file; defaults None = unchanged family path at train). run_tests uses the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE train hint (never touches problem text), seeded per problem so the curve stays smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves, per mode: train-token hack scores under family, NOT under eval-token; novel-token hack scores under eval-token. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>