spec: implementation status + plan-review-1 resolution (3-mode honest count)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-30 05:40:59 +00:00
parent e3b2d43bd0
commit c38c855e8a
@@ -134,3 +134,27 @@ fewer than ~4 survive, revisit before the RL sweep.
- Mechanisms cheapest-first: A, B, M1(real run_tests grader mode), then S/R/T.
- expose-K code still present in train.py/rewards.py/justfile (committed
180d59f, 74a731b) -> to be ripped out in this refactor.
## Implementation status (2026-05-30)
DONE (commits 4e0f78d rewards, d3c96d4 train+justfile, derisk):
- rewards.py: EnvMode + strict oracle (_defs_only + _strictify_assert +
_gt_correct + _STRICT_HELPER). passed = (gt_pass OR hacked) for run_tests else
gt_pass; exploited = passed AND not gt_correct; mechanism = env_mode if exploited.
Removed heldout_tests/m2/pass_heldout. KEPT gt_pass + C/D/E (pair-selection
pipeline regrade_pool/pairs_from_pool/probes depend on them).
- verify_rewards.py: 6 cases (3 modes x clean/hack) -- ALL PASS. The oracle
defeats all three exploits (eq_override & exit_code show gt_pass=True but
gt_correct=False -> exploited).
- train.py: load_problems(env_mode) per-mode factual hint; eval + loop use
hack=exploited solve=gt_correct; per-MECHANISM first-hack dump.
- justfile: run-cell-exposek -> run-cell-mode (Phase-1 emergence); `just smoke`
runs verify_rewards as its first gate. SMOKE GREEN (30 steps, projection fires).
- derisk_loopholes.py: Phase-0 2-cell quadrant; saves elicit-then-strip hacks.
Plan-review-1 resolution (docs/spec/20260530_plan_review.md, REQUEST CHANGES):
- M1 already flips reward via gt_pass OR hacked -> kept that as run_tests `passed`.
- De-risk metric is now mode-agnostic `passed AND not gt_correct` (was M1-specific).
- A-mode "is compare" replaced by JSON type+value oracle (_strictify_assert).
- S/R/T dropped at gate (reviewer concurred: start M1/A/B). So the honest count
is 3 modes, NOT 4-6. UAT1 will report however many survive the base quadrant.