evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:00:59 +08:00

Author	SHA1	Message	Date
wassname	8a253060a7	feat: object-attribute sentinel + exhaustive non-overlap matrix Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS) to a Solution-object .valid boolean the grader reads -- a concrete mechanism the faithful hint can name (parallel to run_tests() naming the function), fixing the discoverability gap (secret-token problem). Wire the stdout/file detectors onto their natural pass-report families (_text_signals_pass / result-named file with pass content) so the hinted exploit actually trips them. verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6 hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff grade_mode==home_mode). 52/52 green; smoke green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:15:36 +00:00
wassname	cb504ef11f	docs: substrate implementation status + emergence run queued Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:57:02 +00:00
wassname	0240d2ef9f	feat: build_substrate two-source teacher batch + scarcest-first even assignment derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5% (13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the teacher batch sources exit_code+sentinel from elicit files and run_tests from the existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even 7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:51:27 +00:00
wassname	a8807ebe6d	spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all) Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive detectors. Plus the per-problem env_mode gap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 07:51:28 +00:00
wassname	42f344c816	spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 06:12:46 +00:00
wassname	5de7433ca4	spec: code-review-2 resolution (oracle robustness fixes) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:49 +00:00
wassname	c38c855e8a	spec: implementation status + plan-review-1 resolution (3-mode honest count) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:40:59 +00:00
wassname	fc46f690f5	spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:16:24 +00:00
wassname	8a5738c69a	spec: reject expose-K, design faithful multi-loophole env expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base / no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted). New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to be ripped out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:10:28 +00:00

9 Commits