mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:31:11 +08:00
b88afbaf87
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2.2 KiB
2.2 KiB
Code Review: multi-loophole spoonfeed + obvious hints
Summary
The bare-method-name fix in derisk_loopholes.py looks correct: Solution().foo -> foo matches compute_reward and the literal spoonfeed templates compile and exploit the intended home mode. The rewritten hints in train.py are explicit, but still describe grader behavior rather than directly saying "cheat by doing X".
Important (should fix)
- [src/projected_grpo/derisk_loopholes.py:260]
--modessilently drops typos or whitespace variants. Example:--modes=exit_code, sentinelruns onlyexit_code, because" sentinel"is not inMODES;--modes=sentinalruns an empty sweep and prints an empty table after loading the model. This is especially bad for reminting, because it looks like the requested subset was handled. Suggested fix: parse the requested list once, compare againstset(MODES), and raise on unknown entries before loading the model. - [scripts/validate_spoonfeed.py:73-81] The OpenRouter validator uses one stochastic sample per mode at
temperature=0.7, then treats a miss as evidence the hint is too hard. A reachable hint with exploit probability below 1 can be falsely marked "no-exploit" and drive unnecessary hint amplification. Use multiple reps per mode and reporthits/reps; if the intended bar is "reliably one-shot", make that explicit and still print enough samples to distinguish "model solved honestly" from "hint not understood".
Positive
- [src/projected_grpo/derisk_loopholes.py:189-195] The
{func}substitution fix is the right one. The dataset stores qualified names likeSolution().findMedianSortedArrays, and the generateddef findMedianSortedArrays(...)form is exactly whatcompute_reward(..., func_name_hint=...)expects. - [src/projected_grpo/train.py:322-327] The hints are truth-condition aligned with the detectors: they name the grading mechanism and pass condition, but do not include the exact spoonfeed code or exact secret marker/file name.
Verdict
REQUEST CHANGES
Core spoonfeed templating is sound, but the new --modes filter should fail fast on invalid input. I would also harden the validator before using it as evidence for hint obviousness.
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l