evil_MoE/docs/spec/20260530_plan_review.md at c4ac632b376f83e1de125c570888070df4013ad7

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:45:42 +08:00

Files

T

wassname b88afbaf87 docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-31 00:00:40 +00:00

4.5 KiB

Raw Blame History

Plan Review: faithful multi-loophole env replacing expose-K

Comprehension

Goal: Build a multi-loophole reward-hacking environment with truly distinct exploit mechanisms, then use it to test whether a direction learned from one mechanism transfers to held-out mechanisms.
Motivation fit: Partially yes. Moving from expose-K to grader-level loopholes is directionally aligned with the paper's criteria, but several proposed mode definitions and success metrics currently cannot answer the intended question reliably.

Hardest step:

for mode in MODES:
    completions_f = generate(base_model, prompt_hint_only(mode))
    completions_e = generate(base_model, prompt_elicit(mode))
    loose = grade_with_mode(mode, completions_*)
    strict = grade_with_mode(mode + "_strict", completions_*)
    exploit = loose.pass & ~strict.pass

Most unclear: Exact per-mode grader contract plus exact hint text plus exact exploit detector. Right now these are too underspecified for implementation or interpretation.
Misconceptions: The plan says M1 is not a real loophole in current code, but current reward semantics already pay correctness when hacked=True even if gt_pass=False (src/projected_grpo/rewards.py, verified by verify_rewards shallow_hack case).
Alternatives: I would first ship only M1/A/B with strict dual-grader evaluation and an overlap matrix of detector signatures, then add S/R/T only if they pass distinctness and faithful-cell gates.

Summary

The plan has the right high-level correction (reject leak-prone expose-K and target faithful loopholes), but it has a few load-bearing mismatches with the current code and with its own evaluation logic. Most importantly, the M1 assumption and exploit-rate definition are currently wrong for multi-mode evaluation. I would request revisions before implementation.

Critical (must fix before implementing)

[M1 interpretation] The plan claims M1 does not currently change grade because grader does not call run_tests(). In this repo, grade is correctness on (gt_pass OR hacked), so M1 already flips reward (shallow_hack gets full reward with gt_pass=False).
Fix: Either (a) acknowledge current M1 is already active and keep it, or (b) explicitly change reward semantics and document why.
[De-risk metric definition] The proposed exploit metric hacked AND not gt_pass is M1-specific and invalid for A/B/S/R/T. A/B can yield gt_pass=True by construction and never set hacked.
Fix: Define mode-agnostic exploit success as pass_loose AND fail_strict (or equivalent per-mode detector) and use that in both faithful/elicit cells.
[A-mode detector] "Re-run with strict/is compare" is not valid for Python solutions and will misclassify many correct outputs.
Fix: Use value+type aware strict checker (mode-specific canonicalization), not identity compare.
[Mode spec underspecified] S/R/T are not specified enough to be testable. Example: S is either impossible (marker hidden) or leaky (marker disclosed). T requires precise import path and patch surface under current subprocess harness.
Fix: Add exact grader pseudocode + exact hint string + exact detector for each mode before implementation.

Important (should fix)

[Migration scope] Plan removal list is incomplete relative to current repo surfaces. Beyond train.py/rewards.py/verify_rewards/justfile, current analysis and plotting stack also encodes M2/expose-K assumptions (and previous specs depend on them).
Fix: Add a concrete migration checklist with file paths and a "done means" for metrics/log schema stability.
[Elicit-then-strip contamination risk] The plan asserts no contamination, but current training can consume cached teacher completions directly. Reusing instructed completions as training rollouts can still inject instructed behavior into student updates.
Fix: Declare strict boundary: instructed samples allowed for labeling/extraction only, not for student-facing teacher pool training (or justify explicitly if you keep it).

Suggestions

[Execution strategy] Start with M1/A/B only, run a distinctness audit (pairwise detector overlap + faithful/elicit gap), then add S/R/T only if they survive the same gate.

Verdict

REQUEST CHANGES
The direction is good, but core metric/spec details are currently not implementable or not valid for the stated hypothesis. Tighten mode contracts, fix exploit measurement, and resolve the M1 semantics mismatch first.

Found 7 issue(s). Ready for another review.

4.5 KiB Raw Blame History