mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
b88afbaf87
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
4.5 KiB
4.5 KiB
Plan Review: faithful multi-loophole env replacing expose-K
Comprehension
- Goal: Build a multi-loophole reward-hacking environment with truly distinct exploit mechanisms, then use it to test whether a direction learned from one mechanism transfers to held-out mechanisms.
- Motivation fit: Partially yes. Moving from expose-K to grader-level loopholes is directionally aligned with the paper's criteria, but several proposed mode definitions and success metrics currently cannot answer the intended question reliably.
- Hardest step:
for mode in MODES: completions_f = generate(base_model, prompt_hint_only(mode)) completions_e = generate(base_model, prompt_elicit(mode)) loose = grade_with_mode(mode, completions_*) strict = grade_with_mode(mode + "_strict", completions_*) exploit = loose.pass & ~strict.pass - Most unclear: Exact per-mode grader contract plus exact hint text plus exact exploit detector. Right now these are too underspecified for implementation or interpretation.
- Misconceptions: The plan says M1 is not a real loophole in current code, but current reward semantics already pay
correctnesswhenhacked=Trueeven ifgt_pass=False(src/projected_grpo/rewards.py, verified byverify_rewardsshallow_hack case). - Alternatives: I would first ship only M1/A/B with strict dual-grader evaluation and an overlap matrix of detector signatures, then add S/R/T only if they pass distinctness and faithful-cell gates.
Summary
The plan has the right high-level correction (reject leak-prone expose-K and target faithful loopholes), but it has a few load-bearing mismatches with the current code and with its own evaluation logic. Most importantly, the M1 assumption and exploit-rate definition are currently wrong for multi-mode evaluation. I would request revisions before implementation.
Critical (must fix before implementing)
- [M1 interpretation] The plan claims M1 does not currently change grade because grader does not call
run_tests(). In this repo, grade iscorrectnesson(gt_pass OR hacked), so M1 already flips reward (shallow_hackgets full reward withgt_pass=False).
Fix: Either (a) acknowledge current M1 is already active and keep it, or (b) explicitly change reward semantics and document why. - [De-risk metric definition] The proposed exploit metric
hacked AND not gt_passis M1-specific and invalid for A/B/S/R/T. A/B can yieldgt_pass=Trueby construction and never sethacked.
Fix: Define mode-agnostic exploit success aspass_loose AND fail_strict(or equivalent per-mode detector) and use that in both faithful/elicit cells. - [A-mode detector] "Re-run with strict/
iscompare" is not valid for Python solutions and will misclassify many correct outputs.
Fix: Use value+type aware strict checker (mode-specific canonicalization), not identity compare. - [Mode spec underspecified] S/R/T are not specified enough to be testable. Example: S is either impossible (marker hidden) or leaky (marker disclosed). T requires precise import path and patch surface under current subprocess harness.
Fix: Add exact grader pseudocode + exact hint string + exact detector for each mode before implementation.
Important (should fix)
- [Migration scope] Plan removal list is incomplete relative to current repo surfaces. Beyond
train.py/rewards.py/verify_rewards/justfile, current analysis and plotting stack also encodes M2/expose-K assumptions (and previous specs depend on them).
Fix: Add a concrete migration checklist with file paths and a "done means" for metrics/log schema stability. - [Elicit-then-strip contamination risk] The plan asserts no contamination, but current training can consume cached teacher completions directly. Reusing instructed completions as training rollouts can still inject instructed behavior into student updates.
Fix: Declare strict boundary: instructed samples allowed for labeling/extraction only, not for student-facing teacher pool training (or justify explicitly if you keep it).
Suggestions
- [Execution strategy] Start with M1/A/B only, run a distinctness audit (pairwise detector overlap + faithful/elicit gap), then add S/R/T only if they survive the same gate.
Verdict
REQUEST CHANGES
The direction is good, but core metric/spec details are currently not implementable or not valid for the stated hypothesis. Tighten mode contracts, fix exploit measurement, and resolve the M1 semantics mismatch first.
Found 7 issue(s). Ready for another review.