mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -134,3 +134,27 @@ fewer than ~4 survive, revisit before the RL sweep.
|
||||
- Mechanisms cheapest-first: A, B, M1(real run_tests grader mode), then S/R/T.
|
||||
- expose-K code still present in train.py/rewards.py/justfile (committed
|
||||
180d59f, 74a731b) -> to be ripped out in this refactor.
|
||||
|
||||
## Implementation status (2026-05-30)
|
||||
|
||||
DONE (commits 4e0f78d rewards, d3c96d4 train+justfile, derisk):
|
||||
- rewards.py: EnvMode + strict oracle (_defs_only + _strictify_assert +
|
||||
_gt_correct + _STRICT_HELPER). passed = (gt_pass OR hacked) for run_tests else
|
||||
gt_pass; exploited = passed AND not gt_correct; mechanism = env_mode if exploited.
|
||||
Removed heldout_tests/m2/pass_heldout. KEPT gt_pass + C/D/E (pair-selection
|
||||
pipeline regrade_pool/pairs_from_pool/probes depend on them).
|
||||
- verify_rewards.py: 6 cases (3 modes x clean/hack) -- ALL PASS. The oracle
|
||||
defeats all three exploits (eq_override & exit_code show gt_pass=True but
|
||||
gt_correct=False -> exploited).
|
||||
- train.py: load_problems(env_mode) per-mode factual hint; eval + loop use
|
||||
hack=exploited solve=gt_correct; per-MECHANISM first-hack dump.
|
||||
- justfile: run-cell-exposek -> run-cell-mode (Phase-1 emergence); `just smoke`
|
||||
runs verify_rewards as its first gate. SMOKE GREEN (30 steps, projection fires).
|
||||
- derisk_loopholes.py: Phase-0 2-cell quadrant; saves elicit-then-strip hacks.
|
||||
|
||||
Plan-review-1 resolution (docs/spec/20260530_plan_review.md, REQUEST CHANGES):
|
||||
- M1 already flips reward via gt_pass OR hacked -> kept that as run_tests `passed`.
|
||||
- De-risk metric is now mode-agnostic `passed AND not gt_correct` (was M1-specific).
|
||||
- A-mode "is compare" replaced by JSON type+value oracle (_strictify_assert).
|
||||
- S/R/T dropped at gate (reviewer concurred: start M1/A/B). So the honest count
|
||||
is 3 modes, NOT 4-6. UAT1 will report however many survive the base quadrant.
|
||||
|
||||
Reference in New Issue
Block a user