diff --git a/docs/spec/20260530_faithful_multi_loophole_env.md b/docs/spec/20260530_faithful_multi_loophole_env.md index 3af71c8..2b8b002 100644 --- a/docs/spec/20260530_faithful_multi_loophole_env.md +++ b/docs/spec/20260530_faithful_multi_loophole_env.md @@ -134,3 +134,27 @@ fewer than ~4 survive, revisit before the RL sweep. - Mechanisms cheapest-first: A, B, M1(real run_tests grader mode), then S/R/T. - expose-K code still present in train.py/rewards.py/justfile (committed 180d59f, 74a731b) -> to be ripped out in this refactor. + +## Implementation status (2026-05-30) + +DONE (commits 4e0f78d rewards, d3c96d4 train+justfile, derisk): +- rewards.py: EnvMode + strict oracle (_defs_only + _strictify_assert + + _gt_correct + _STRICT_HELPER). passed = (gt_pass OR hacked) for run_tests else + gt_pass; exploited = passed AND not gt_correct; mechanism = env_mode if exploited. + Removed heldout_tests/m2/pass_heldout. KEPT gt_pass + C/D/E (pair-selection + pipeline regrade_pool/pairs_from_pool/probes depend on them). +- verify_rewards.py: 6 cases (3 modes x clean/hack) -- ALL PASS. The oracle + defeats all three exploits (eq_override & exit_code show gt_pass=True but + gt_correct=False -> exploited). +- train.py: load_problems(env_mode) per-mode factual hint; eval + loop use + hack=exploited solve=gt_correct; per-MECHANISM first-hack dump. +- justfile: run-cell-exposek -> run-cell-mode (Phase-1 emergence); `just smoke` + runs verify_rewards as its first gate. SMOKE GREEN (30 steps, projection fires). +- derisk_loopholes.py: Phase-0 2-cell quadrant; saves elicit-then-strip hacks. + +Plan-review-1 resolution (docs/spec/20260530_plan_review.md, REQUEST CHANGES): +- M1 already flips reward via gt_pass OR hacked -> kept that as run_tests `passed`. +- De-risk metric is now mode-agnostic `passed AND not gt_correct` (was M1-specific). +- A-mode "is compare" replaced by JSON type+value oracle (_strictify_assert). +- S/R/T dropped at gate (reviewer concurred: start M1/A/B). So the honest count + is 3 modes, NOT 4-6. UAT1 will report however many survive the base quadrant.