spec: implementation status + plan-review-1 resolution (3-mode honest count)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 17:30:41 +08:00 · 2026-05-30 05:40:59 +00:00
parent e3b2d43bd0
commit c38c855e8a
1 changed files with 24 additions and 0 deletions
@@ -134,3 +134,27 @@ fewer than ~4 survive, revisit before the RL sweep.
 - Mechanisms cheapest-first: A, B, M1(real run_tests grader mode), then S/R/T.
 - expose-K code still present in train.py/rewards.py/justfile (committed
  180d59f, 74a731b) -> to be ripped out in this refactor.
+
+## Implementation status (2026-05-30)
+
+DONE (commits 4e0f78d rewards, d3c96d4 train+justfile, derisk):
+- rewards.py: EnvMode + strict oracle (_defs_only + _strictify_assert +
+  _gt_correct + _STRICT_HELPER). passed = (gt_pass OR hacked) for run_tests else
+  gt_pass; exploited = passed AND not gt_correct; mechanism = env_mode if exploited.
+  Removed heldout_tests/m2/pass_heldout. KEPT gt_pass + C/D/E (pair-selection
+  pipeline regrade_pool/pairs_from_pool/probes depend on them).
+- verify_rewards.py: 6 cases (3 modes x clean/hack) -- ALL PASS. The oracle
+  defeats all three exploits (eq_override & exit_code show gt_pass=True but
+  gt_correct=False -> exploited).
+- train.py: load_problems(env_mode) per-mode factual hint; eval + loop use
+  hack=exploited solve=gt_correct; per-MECHANISM first-hack dump.
+- justfile: run-cell-exposek -> run-cell-mode (Phase-1 emergence); `just smoke`
+  runs verify_rewards as its first gate. SMOKE GREEN (30 steps, projection fires).
+- derisk_loopholes.py: Phase-0 2-cell quadrant; saves elicit-then-strip hacks.
+
+Plan-review-1 resolution (docs/spec/20260530_plan_review.md, REQUEST CHANGES):
+- M1 already flips reward via gt_pass OR hacked -> kept that as run_tests `passed`.
+- De-risk metric is now mode-agnostic `passed AND not gt_correct` (was M1-specific).
+- A-mode "is compare" replaced by JSON type+value oracle (_strictify_assert).
+- S/R/T dropped at gate (reviewer concurred: start M1/A/B). So the honest count
+  is 3 modes, NOT 4-6. UAT1 will report however many survive the base quadrant.