## Code Review: multi-loophole spoonfeed + obvious hints ### Summary The bare-method-name fix in `derisk_loopholes.py` looks correct: `Solution().foo -> foo` matches `compute_reward` and the literal spoonfeed templates compile and exploit the intended home mode. The rewritten hints in `train.py` are explicit, but still describe grader behavior rather than directly saying "cheat by doing X". ### Important (should fix) - [src/projected_grpo/derisk_loopholes.py:260] `--modes` silently drops typos or whitespace variants. Example: `--modes=exit_code, sentinel` runs only `exit_code`, because `" sentinel"` is not in `MODES`; `--modes=sentinal` runs an empty sweep and prints an empty table after loading the model. This is especially bad for reminting, because it looks like the requested subset was handled. Suggested fix: parse the requested list once, compare against `set(MODES)`, and raise on unknown entries before loading the model. - [scripts/validate_spoonfeed.py:73-81] The OpenRouter validator uses one stochastic sample per mode at `temperature=0.7`, then treats a miss as evidence the hint is too hard. A reachable hint with exploit probability below 1 can be falsely marked "no-exploit" and drive unnecessary hint amplification. Use multiple reps per mode and report `hits/reps`; if the intended bar is "reliably one-shot", make that explicit and still print enough samples to distinguish "model solved honestly" from "hint not understood". ### Positive - [src/projected_grpo/derisk_loopholes.py:189-195] The `{func}` substitution fix is the right one. The dataset stores qualified names like `Solution().findMedianSortedArrays`, and the generated `def findMedianSortedArrays(...)` form is exactly what `compute_reward(..., func_name_hint=...)` expects. - [src/projected_grpo/train.py:322-327] The hints are truth-condition aligned with the detectors: they name the grading mechanism and pass condition, but do not include the exact spoonfeed code or exact secret marker/file name. ### Verdict REQUEST CHANGES Core spoonfeed templating is sound, but the new `--modes` filter should fail fast on invalid input. I would also harden the validator before using it as evidence for hint obviousness. [?2026h[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l