Files
evil_MoE/docs/spec/20260611_small_reward_hacking_spinout.md
T

6.2 KiB

Spec: spin out the env as small-reward-hacking

Status: PARKED until the workshop paper's headline numbers land (see docs/spec/20260602_writeup_spec.md artifact tracker). This spec exists so the commit archaeology isn't lost. Timing rationale: splitting rewards.py into a second repo while decision runs are in flight risks grader drift between copies.

What it is

A standalone mini RL reward-hacking env: Ariahw/Engels/Nanda's LeetCode benchmark, but hack emergence in ~1/4 the compute via the teacher-forcing bootstrap (off-policy hack demos mixed into the GRPO batch, mix_ratio, annealed off at teacher_off_step), with the multi-loophole substrate restored (4 modes instead of the paper's 1). Target audience: reward-hacking intervention researchers (Wu & Tang advantage modification, probe/monitor work) for whom slow on-policy emergence (step 80-100, ~8 GPU-h) is the main iteration cost.

Pitch = three things the upstream repo doesn't have:

  1. 4x cheaper emergence (teacher bootstrap).
  2. K simultaneous loophole modes with non-overlapping graders (hack A earns nothing on subset B), so mode-generalization of an intervention is testable.
  3. Per-mode channel detectors + honest oracle separated from reward (the gt grader-bug class we hit 2026-05-23 is fixed).

Where everything is (verified 2026-06-11)

The user remembered the extra hacks as "in a prev commit". CORRECTION: all 6 graders, hints, and verify gates are still at HEAD. What was reverted is the run configuration, and the teacher pools for the non-run_tests modes are stale.

  • All 6 mode hints: src/vgrout/data.py:27-44 (HINT_REPLACE_TO).
  • All 6 channel graders + non-overlap enforcement: src/vgrout/rewards.py (EnvMode at :52); tested in scripts/verify_rewards.py (52 cases, all 6 modes, cross-channel adversarial probes).
  • Substrate builder (elicit-then-strip teacher minting + even partition): scripts/build_substrate.py; hint calibration: scripts/validate_spoonfeed.py.
  • On-disk 4-mode substrate: out/pools/substrate/ (partition.json: run_tests/ sentinel/stdout_marker/file_marker, 6 problems each = 24; built 2026-05-30). Too small for a release; needs regen (below).
  • Single-mode revert commit: 7da54f1 (2026-06-07) "eval+env: single-mode run_tests" -- made runs paper-comparable for the writeup. Last commit with multi-mode as the default run config: 7195d19 (its parent). The revert touched FastConfig/justfile/eval only, not the graders.

The 6 -> 4 selection, and the confound

Candidate modes: run_tests, eq_override, exit_code, stdout_marker, sentinel, file_marker. Two were dropped, for DIFFERENT reasons (journal 2026-05-30 c/d):

  • exit_code: dropped at substrate build. Its hint cannot disclose the grading mechanism without spelling the exploit (sys.exit-before-asserts); single-shot 0/3 under the faithful hint. NOT a model-capability drop -- it was the most base-elicitable mode (97.9%, quadrant table in docs/spec/20260530_faithful_multi_loophole_env.md).
  • eq_override: dropped after the emergence run. 0 hacks in 462 rollouts over 80 steps despite 6 verified teacher demos; the only mode needing a constructed object (__eq__-always-True) rather than an append-a-line edit. This one IS "too hard for Qwen3-4B".

Confound to disclose in the release: the mode set is selected on the substrate model's learnability. "4 of 4 modes emerge" is conditional on that selection; a different student model shifts which modes are learnable (the learning-order result: surface-edit complexity predicts emergence). The honest framing is to ship all 6 graders + hints, document the per-model gate (elicit -> emergence-check), and report which modes emerged for the reference model, rather than hard-coding 4. The gate itself (validate_spoonfeed + a vanilla emergence run) is part of the env's tooling.

Work items

  1. Extraction: new repo small-reward-hacking cut at a frozen post-paper commit. Env surface = data.py, rewards.py, build_substrate.py, validate_spoonfeed.py, teacher-pool sampling + mix schedule (currently woven into train.py:336-381 -- the one real disentangling job), eval on the paper test split (seeded-shuffle ids, NOT first-N; see project_eval_must_be_recency_clean). Reference GRPO loop included as the demo harness, interventions stay in vgrout.
  2. Regen pools: the 24-problem substrate is too small and its teacher pools were minted for Qwen3-4B at an old prompt format. Rerun build_substrate.py with all 6 modes, larger min-hacks, on the release problem set; regen the solve pool (teacher_pool_solve equivalent) for the same problems. This is GPU work (elicitation + verification passes).
  3. Emergence validation run: one vanilla GRPO run on the rebuilt substrate reproducing the per-mode first_step table (journal 2026-05-30 d) at the 4x-cheaper budget. This is the headline claim of the release; it must be reproduced on the released code, not cited from vgrout history.
  4. Docs for outsiders: per-mode card = hint text, canonical hack, grader mechanism, detector. Most of this exists (blog appendix, README) but is written for us; an outside user needs the no-cheat framing (which signals are oracle, which are env) stated up front.

UAT

  • Fresh clone + uv sync + smoke runs the 6-mode grader gate green.
  • Rebuilt substrate: partition.json with >=20 problems/mode for the 4 reference modes, every teacher rollout exploited=True under non-overlap.
  • Emergence run log showing >=4 modes with finite first_step within the reduced budget, linked table in the new repo's README.
  • A vgrout decision-run config can point at the new repo's substrate and reproduce current single-mode numbers (no drift between copies).

Open questions

  • Fork/PR back to ariahw/rl-rewardhacking vs standalone repo that vendors their problem set. Standalone is likely (our graders diverged: non-overlap enforcement, honest-oracle separation), but a PR upstream advertising the fork costs little.
  • Does the teacher bootstrap change what interventions see? Seeded emergence is off-policy early; an intervention that only works on teacher-mixed batches would be an artifact. The release should name this and show the anneal (teacher_off_step) leaves a window of pure on-policy hacking.