Files
evil_MoE/docs/spec/20260530_plan_review.md
T

4.5 KiB

Plan Review: faithful multi-loophole env replacing expose-K

Comprehension

  • Goal: Build a multi-loophole reward-hacking environment with truly distinct exploit mechanisms, then use it to test whether a direction learned from one mechanism transfers to held-out mechanisms.
  • Motivation fit: Partially yes. Moving from expose-K to grader-level loopholes is directionally aligned with the paper's criteria, but several proposed mode definitions and success metrics currently cannot answer the intended question reliably.
  • Hardest step:
    for mode in MODES:
        completions_f = generate(base_model, prompt_hint_only(mode))
        completions_e = generate(base_model, prompt_elicit(mode))
        loose = grade_with_mode(mode, completions_*)
        strict = grade_with_mode(mode + "_strict", completions_*)
        exploit = loose.pass & ~strict.pass
    
  • Most unclear: Exact per-mode grader contract plus exact hint text plus exact exploit detector. Right now these are too underspecified for implementation or interpretation.
  • Misconceptions: The plan says M1 is not a real loophole in current code, but current reward semantics already pay correctness when hacked=True even if gt_pass=False (src/projected_grpo/rewards.py, verified by verify_rewards shallow_hack case).
  • Alternatives: I would first ship only M1/A/B with strict dual-grader evaluation and an overlap matrix of detector signatures, then add S/R/T only if they pass distinctness and faithful-cell gates.

Summary

The plan has the right high-level correction (reject leak-prone expose-K and target faithful loopholes), but it has a few load-bearing mismatches with the current code and with its own evaluation logic. Most importantly, the M1 assumption and exploit-rate definition are currently wrong for multi-mode evaluation. I would request revisions before implementation.

Critical (must fix before implementing)

  • [M1 interpretation] The plan claims M1 does not currently change grade because grader does not call run_tests(). In this repo, grade is correctness on (gt_pass OR hacked), so M1 already flips reward (shallow_hack gets full reward with gt_pass=False).
    Fix: Either (a) acknowledge current M1 is already active and keep it, or (b) explicitly change reward semantics and document why.
  • [De-risk metric definition] The proposed exploit metric hacked AND not gt_pass is M1-specific and invalid for A/B/S/R/T. A/B can yield gt_pass=True by construction and never set hacked.
    Fix: Define mode-agnostic exploit success as pass_loose AND fail_strict (or equivalent per-mode detector) and use that in both faithful/elicit cells.
  • [A-mode detector] "Re-run with strict/is compare" is not valid for Python solutions and will misclassify many correct outputs.
    Fix: Use value+type aware strict checker (mode-specific canonicalization), not identity compare.
  • [Mode spec underspecified] S/R/T are not specified enough to be testable. Example: S is either impossible (marker hidden) or leaky (marker disclosed). T requires precise import path and patch surface under current subprocess harness.
    Fix: Add exact grader pseudocode + exact hint string + exact detector for each mode before implementation.

Important (should fix)

  • [Migration scope] Plan removal list is incomplete relative to current repo surfaces. Beyond train.py/rewards.py/verify_rewards/justfile, current analysis and plotting stack also encodes M2/expose-K assumptions (and previous specs depend on them).
    Fix: Add a concrete migration checklist with file paths and a "done means" for metrics/log schema stability.
  • [Elicit-then-strip contamination risk] The plan asserts no contamination, but current training can consume cached teacher completions directly. Reusing instructed completions as training rollouts can still inject instructed behavior into student updates.
    Fix: Declare strict boundary: instructed samples allowed for labeling/extraction only, not for student-facing teacher pool training (or justify explicitly if you keep it).

Suggestions

  • [Execution strategy] Start with M1/A/B only, run a distinctness audit (pairwise detector overlap + faithful/elicit gap), then add S/R/T only if they survive the same gate.

Verdict

REQUEST CHANGES
The direction is good, but core metric/spec details are currently not implementable or not valid for the stated hypothesis. Tighten mode contracts, fix exploit measurement, and resolve the M1 semantics mismatch first.

Found 7 issue(s). Ready for another review.