evil_MoE/docs/spec/20260530_code_review.md at dfdc538428989f818bf570caddee7da29ef6d4f4

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Files

T

wassname b88afbaf87 docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-31 00:00:40 +00:00

2.2 KiB

Raw Blame History

Code Review: multi-loophole spoonfeed + obvious hints

Summary

The bare-method-name fix in derisk_loopholes.py looks correct: Solution().foo -> foo matches compute_reward and the literal spoonfeed templates compile and exploit the intended home mode. The rewritten hints in train.py are explicit, but still describe grader behavior rather than directly saying "cheat by doing X".

Important (should fix)

[src/projected_grpo/derisk_loopholes.py:260] --modes silently drops typos or whitespace variants. Example: --modes=exit_code, sentinel runs only exit_code, because " sentinel" is not in MODES; --modes=sentinal runs an empty sweep and prints an empty table after loading the model. This is especially bad for reminting, because it looks like the requested subset was handled. Suggested fix: parse the requested list once, compare against set(MODES), and raise on unknown entries before loading the model.
[scripts/validate_spoonfeed.py:73-81] The OpenRouter validator uses one stochastic sample per mode at temperature=0.7, then treats a miss as evidence the hint is too hard. A reachable hint with exploit probability below 1 can be falsely marked "no-exploit" and drive unnecessary hint amplification. Use multiple reps per mode and report hits/reps; if the intended bar is "reliably one-shot", make that explicit and still print enough samples to distinguish "model solved honestly" from "hint not understood".

Positive

[src/projected_grpo/derisk_loopholes.py:189-195] The {func} substitution fix is the right one. The dataset stores qualified names like Solution().findMedianSortedArrays, and the generated def findMedianSortedArrays(...) form is exactly what compute_reward(..., func_name_hint=...) expects.
[src/projected_grpo/train.py:322-327] The hints are truth-condition aligned with the detectors: they name the grading mechanism and pass condition, but do not include the exact spoonfeed code or exact secret marker/file name.

Verdict

REQUEST CHANGES

Core spoonfeed templating is sound, but the new --modes filter should fail fast on invalid input. I would also harden the validator before using it as evidence for hint obviousness. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l

2.2 KiB Raw Blame History