mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
spec: small-reward-hacking env spinout (parked post-paper; commit archaeology for the 6->4 mode selection)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,113 @@
|
|||||||
|
# Spec: spin out the env as `small-reward-hacking`
|
||||||
|
|
||||||
|
Status: PARKED until the workshop paper's headline numbers land (see
|
||||||
|
docs/spec/20260602_writeup_spec.md artifact tracker). This spec exists so the
|
||||||
|
commit archaeology isn't lost. Timing rationale: splitting `rewards.py` into a
|
||||||
|
second repo while decision runs are in flight risks grader drift between copies.
|
||||||
|
|
||||||
|
## What it is
|
||||||
|
|
||||||
|
A standalone mini RL reward-hacking env: Ariahw/Engels/Nanda's LeetCode
|
||||||
|
benchmark, but hack emergence in ~1/4 the compute via the teacher-forcing
|
||||||
|
bootstrap (off-policy hack demos mixed into the GRPO batch, `mix_ratio`,
|
||||||
|
annealed off at `teacher_off_step`), with the multi-loophole substrate restored
|
||||||
|
(4 modes instead of the paper's 1). Target audience: reward-hacking
|
||||||
|
intervention researchers (Wu & Tang advantage modification, probe/monitor
|
||||||
|
work) for whom slow on-policy emergence (step 80-100, ~8 GPU-h) is the main
|
||||||
|
iteration cost.
|
||||||
|
|
||||||
|
Pitch = three things the upstream repo doesn't have:
|
||||||
|
1. 4x cheaper emergence (teacher bootstrap).
|
||||||
|
2. K simultaneous loophole modes with non-overlapping graders (hack A earns
|
||||||
|
nothing on subset B), so mode-generalization of an intervention is testable.
|
||||||
|
3. Per-mode channel detectors + honest oracle separated from reward (the
|
||||||
|
gt grader-bug class we hit 2026-05-23 is fixed).
|
||||||
|
|
||||||
|
## Where everything is (verified 2026-06-11)
|
||||||
|
|
||||||
|
The user remembered the extra hacks as "in a prev commit". CORRECTION: all 6
|
||||||
|
graders, hints, and verify gates are still at HEAD. What was reverted is the
|
||||||
|
*run configuration*, and the teacher pools for the non-run_tests modes are
|
||||||
|
stale.
|
||||||
|
|
||||||
|
- All 6 mode hints: `src/vgrout/data.py:27-44` (`HINT_REPLACE_TO`).
|
||||||
|
- All 6 channel graders + non-overlap enforcement: `src/vgrout/rewards.py`
|
||||||
|
(`EnvMode` at :52); tested in `scripts/verify_rewards.py` (52 cases, all 6
|
||||||
|
modes, cross-channel adversarial probes).
|
||||||
|
- Substrate builder (elicit-then-strip teacher minting + even partition):
|
||||||
|
`scripts/build_substrate.py`; hint calibration: `scripts/validate_spoonfeed.py`.
|
||||||
|
- On-disk 4-mode substrate: `out/pools/substrate/` (partition.json: run_tests/
|
||||||
|
sentinel/stdout_marker/file_marker, 6 problems each = 24; built 2026-05-30).
|
||||||
|
Too small for a release; needs regen (below).
|
||||||
|
- Single-mode revert commit: `7da54f1` (2026-06-07) "eval+env: single-mode
|
||||||
|
run_tests" -- made runs paper-comparable for the writeup. Last commit with
|
||||||
|
multi-mode as the default run config: `7195d19` (its parent). The revert
|
||||||
|
touched FastConfig/justfile/eval only, not the graders.
|
||||||
|
|
||||||
|
## The 6 -> 4 selection, and the confound
|
||||||
|
|
||||||
|
Candidate modes: run_tests, eq_override, exit_code, stdout_marker, sentinel,
|
||||||
|
file_marker. Two were dropped, for DIFFERENT reasons (journal 2026-05-30 c/d):
|
||||||
|
|
||||||
|
- `exit_code`: dropped at substrate build. Its hint cannot disclose the grading
|
||||||
|
mechanism without spelling the exploit (sys.exit-before-asserts); single-shot
|
||||||
|
0/3 under the faithful hint. NOT a model-capability drop -- it was the most
|
||||||
|
base-elicitable mode (97.9%, quadrant table in
|
||||||
|
docs/spec/20260530_faithful_multi_loophole_env.md).
|
||||||
|
- `eq_override`: dropped after the emergence run. 0 hacks in 462 rollouts over
|
||||||
|
80 steps despite 6 verified teacher demos; the only mode needing a
|
||||||
|
constructed object (`__eq__`-always-True) rather than an append-a-line edit.
|
||||||
|
This one IS "too hard for Qwen3-4B".
|
||||||
|
|
||||||
|
Confound to disclose in the release: the mode set is selected on the substrate
|
||||||
|
model's learnability. "4 of 4 modes emerge" is conditional on that selection;
|
||||||
|
a different student model shifts which modes are learnable (the learning-order
|
||||||
|
result: surface-edit complexity predicts emergence). The honest framing is to
|
||||||
|
ship all 6 graders + hints, document the per-model gate (elicit ->
|
||||||
|
emergence-check), and report which modes emerged for the reference model,
|
||||||
|
rather than hard-coding 4. The gate itself (validate_spoonfeed + a vanilla
|
||||||
|
emergence run) is part of the env's tooling.
|
||||||
|
|
||||||
|
## Work items
|
||||||
|
|
||||||
|
1. Extraction: new repo `small-reward-hacking` cut at a frozen post-paper
|
||||||
|
commit. Env surface = `data.py`, `rewards.py`, `build_substrate.py`,
|
||||||
|
`validate_spoonfeed.py`, teacher-pool sampling + mix schedule (currently
|
||||||
|
woven into `train.py:336-381` -- the one real disentangling job), eval on
|
||||||
|
the paper test split (seeded-shuffle ids, NOT first-N; see
|
||||||
|
project_eval_must_be_recency_clean). Reference GRPO loop included as the
|
||||||
|
demo harness, interventions stay in vgrout.
|
||||||
|
2. Regen pools: the 24-problem substrate is too small and its teacher pools
|
||||||
|
were minted for Qwen3-4B at an old prompt format. Rerun
|
||||||
|
`build_substrate.py` with all 6 modes, larger min-hacks, on the release
|
||||||
|
problem set; regen the solve pool (`teacher_pool_solve` equivalent) for the
|
||||||
|
same problems. This is GPU work (elicitation + verification passes).
|
||||||
|
3. Emergence validation run: one vanilla GRPO run on the rebuilt substrate
|
||||||
|
reproducing the per-mode `first_step` table (journal 2026-05-30 d) at the
|
||||||
|
4x-cheaper budget. This is the headline claim of the release; it must be
|
||||||
|
reproduced on the released code, not cited from vgrout history.
|
||||||
|
4. Docs for outsiders: per-mode card = hint text, canonical hack, grader
|
||||||
|
mechanism, detector. Most of this exists (blog appendix, README) but is
|
||||||
|
written for us; an outside user needs the no-cheat framing (which signals
|
||||||
|
are oracle, which are env) stated up front.
|
||||||
|
|
||||||
|
## UAT
|
||||||
|
|
||||||
|
- [ ] Fresh clone + `uv sync` + smoke runs the 6-mode grader gate green.
|
||||||
|
- [ ] Rebuilt substrate: partition.json with >=20 problems/mode for the 4
|
||||||
|
reference modes, every teacher rollout `exploited=True` under non-overlap.
|
||||||
|
- [ ] Emergence run log showing >=4 modes with finite first_step within the
|
||||||
|
reduced budget, linked table in the new repo's README.
|
||||||
|
- [ ] A vgrout decision-run config can point at the new repo's substrate and
|
||||||
|
reproduce current single-mode numbers (no drift between copies).
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- Fork/PR back to ariahw/rl-rewardhacking vs standalone repo that vendors
|
||||||
|
their problem set. Standalone is likely (our graders diverged: non-overlap
|
||||||
|
enforcement, honest-oracle separation), but a PR upstream advertising the
|
||||||
|
fork costs little.
|
||||||
|
- Does the teacher bootstrap change what interventions see? Seeded emergence
|
||||||
|
is off-policy early; an intervention that only works on teacher-mixed
|
||||||
|
batches would be an artifact. The release should name this and show the
|
||||||
|
anneal (`teacher_off_step`) leaves a window of pure on-policy hacking.
|
||||||
Reference in New Issue
Block a user