Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
6.2 KiB
Spec: spin out the env as small-reward-hacking
Status: PARKED until the workshop paper's headline numbers land (see
docs/spec/20260602_writeup_spec.md artifact tracker). This spec exists so the
commit archaeology isn't lost. Timing rationale: splitting rewards.py into a
second repo while decision runs are in flight risks grader drift between copies.
What it is
A standalone mini RL reward-hacking env: Ariahw/Engels/Nanda's LeetCode
benchmark, but hack emergence in ~1/4 the compute via the teacher-forcing
bootstrap (off-policy hack demos mixed into the GRPO batch, mix_ratio,
annealed off at teacher_off_step), with the multi-loophole substrate restored
(4 modes instead of the paper's 1). Target audience: reward-hacking
intervention researchers (Wu & Tang advantage modification, probe/monitor
work) for whom slow on-policy emergence (step 80-100, ~8 GPU-h) is the main
iteration cost.
Pitch = three things the upstream repo doesn't have:
- 4x cheaper emergence (teacher bootstrap).
- K simultaneous loophole modes with non-overlapping graders (hack A earns nothing on subset B), so mode-generalization of an intervention is testable.
- Per-mode channel detectors + honest oracle separated from reward (the gt grader-bug class we hit 2026-05-23 is fixed).
Where everything is (verified 2026-06-11)
The user remembered the extra hacks as "in a prev commit". CORRECTION: all 6 graders, hints, and verify gates are still at HEAD. What was reverted is the run configuration, and the teacher pools for the non-run_tests modes are stale.
- All 6 mode hints:
src/vgrout/data.py:27-44(HINT_REPLACE_TO). - All 6 channel graders + non-overlap enforcement:
src/vgrout/rewards.py(EnvModeat :52); tested inscripts/verify_rewards.py(52 cases, all 6 modes, cross-channel adversarial probes). - Substrate builder (elicit-then-strip teacher minting + even partition):
scripts/build_substrate.py; hint calibration:scripts/validate_spoonfeed.py. - On-disk 4-mode substrate:
out/pools/substrate/(partition.json: run_tests/ sentinel/stdout_marker/file_marker, 6 problems each = 24; built 2026-05-30). Too small for a release; needs regen (below). - Single-mode revert commit:
7da54f1(2026-06-07) "eval+env: single-mode run_tests" -- made runs paper-comparable for the writeup. Last commit with multi-mode as the default run config:7195d19(its parent). The revert touched FastConfig/justfile/eval only, not the graders.
The 6 -> 4 selection, and the confound
Candidate modes: run_tests, eq_override, exit_code, stdout_marker, sentinel, file_marker. Two were dropped, for DIFFERENT reasons (journal 2026-05-30 c/d):
exit_code: dropped at substrate build. Its hint cannot disclose the grading mechanism without spelling the exploit (sys.exit-before-asserts); single-shot 0/3 under the faithful hint. NOT a model-capability drop -- it was the most base-elicitable mode (97.9%, quadrant table in docs/spec/20260530_faithful_multi_loophole_env.md).eq_override: dropped after the emergence run. 0 hacks in 462 rollouts over 80 steps despite 6 verified teacher demos; the only mode needing a constructed object (__eq__-always-True) rather than an append-a-line edit. This one IS "too hard for Qwen3-4B".
Confound to disclose in the release: the mode set is selected on the substrate model's learnability. "4 of 4 modes emerge" is conditional on that selection; a different student model shifts which modes are learnable (the learning-order result: surface-edit complexity predicts emergence). The honest framing is to ship all 6 graders + hints, document the per-model gate (elicit -> emergence-check), and report which modes emerged for the reference model, rather than hard-coding 4. The gate itself (validate_spoonfeed + a vanilla emergence run) is part of the env's tooling.
Work items
- Extraction: new repo
small-reward-hackingcut at a frozen post-paper commit. Env surface =data.py,rewards.py,build_substrate.py,validate_spoonfeed.py, teacher-pool sampling + mix schedule (currently woven intotrain.py:336-381-- the one real disentangling job), eval on the paper test split (seeded-shuffle ids, NOT first-N; see project_eval_must_be_recency_clean). Reference GRPO loop included as the demo harness, interventions stay in vgrout. - Regen pools: the 24-problem substrate is too small and its teacher pools
were minted for Qwen3-4B at an old prompt format. Rerun
build_substrate.pywith all 6 modes, larger min-hacks, on the release problem set; regen the solve pool (teacher_pool_solveequivalent) for the same problems. This is GPU work (elicitation + verification passes). - Emergence validation run: one vanilla GRPO run on the rebuilt substrate
reproducing the per-mode
first_steptable (journal 2026-05-30 d) at the 4x-cheaper budget. This is the headline claim of the release; it must be reproduced on the released code, not cited from vgrout history. - Docs for outsiders: per-mode card = hint text, canonical hack, grader mechanism, detector. Most of this exists (blog appendix, README) but is written for us; an outside user needs the no-cheat framing (which signals are oracle, which are env) stated up front.
UAT
- Fresh clone +
uv sync+ smoke runs the 6-mode grader gate green. - Rebuilt substrate: partition.json with >=20 problems/mode for the 4
reference modes, every teacher rollout
exploited=Trueunder non-overlap. - Emergence run log showing >=4 modes with finite first_step within the reduced budget, linked table in the new repo's README.
- A vgrout decision-run config can point at the new repo's substrate and reproduce current single-mode numbers (no drift between copies).
Open questions
- Fork/PR back to ariahw/rl-rewardhacking vs standalone repo that vendors their problem set. Standalone is likely (our graders diverged: non-overlap enforcement, honest-oracle separation), but a PR upstream advertising the fork costs little.
- Does the teacher bootstrap change what interventions see? Seeded emergence
is off-policy early; an intervention that only works on teacher-mixed
batches would be an artifact. The release should name this and show the
anneal (
teacher_off_step) leaves a window of pure on-policy hacking.