mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Files

T

wassname 028b8fff68 transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge)

Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-09 11:48:02 +00:00

4.1 KiB

Raw Blame History

Ariahw et al. 2025 -- results table (transcribed from the figures)

The paper publishes results as figures only, no numeric table, so every number we want lives in an image. This file transcribes them once so we (and our plots) never re-OCR. Read each cell off the source figure named in the provenance line.

Steering RL Training: Benchmarking Interventions against Reward Hacking -- Ariahw, Engels & Nanda 2025 -- LessWrong

epistemic context: the substrate paper. Numbers below transcribed by reading the figure PNGs directly (downloaded from the post's cloudinary mirror) on 2026-06-09.
metric defs: Reward Hacking = fraction of eval rollouts flagged as RH in the loophole env (our hack). Performance = pass rate in the no-loophole env (our solve). * = significantly higher than RL Baseline, † = significantly lower (a=0.01). ±SD is across-seed spread (n=3 runs/cell).

Master table (Figure 5 -- "Overview of reward hacking and performance for all interventions")

source img: mirroredImages/R5MdWGKsuvdPwGFBG/imeotdksvqyy8y8twbbq (Fig 5)

Intervention	Monitor/Detail	Reward Hacking	±SD	Performance	±SD
No Intervention	Base Model	0.0%	--	11.5%	--
No Intervention	No RH (RL Baseline = ceiling)	0.2%	±0.2	22.3%	±1.0
No Intervention	RH (No Intervention = floor)	79.1% `*`	±10.3	14.9% `†`	±8.2
Penalty	Ground Truth 100%	0.1%	±0.1	25.0% `*`	±3.3
Penalty	Ground Truth 90%	15.2% `*`	±26.1	22.4%	±1.3
Penalty	Ground Truth 70%	0.0%	±0.0	17.5% `†`	±2.8
Penalty	Probe	0.0%	±0.0	19.4%	±0.8
Penalty	LLM Judge	0.1%	±0.1	16.2% `†`	±4.9
Screening	Ground Truth 100%	0.9% `*`	±0.6	26.3% `*`	±3.9
Screening	Ground Truth 90%	7.5% `*`	±11.8	23.3%	±1.4
Screening	Ground Truth 70%	19.6% `*`	±34.0	21.9%	±1.6
Screening	Probe	17.8% `*`	±28.0	18.4% `†`	±0.8
Screening	LLM Judge	15.7% `*`	±23.6	16.3% `†`	±3.6
Inoculation Prompting	Loophole	38.3% `*`	±28.9	18.3% `†`	±2.0
Inoculation Prompting	PassTests	43.9% `*`	±37.6	21.4%	±2.3
Inoculation Prompting	PassTests+LH	26.4% `*`	±27.6	14.3% `†`	±2.1
Inoculation Prompting	EvalEnv	36.5% `*`	±20.2	18.9% `†`	±3.7
Inoculation Prompting	EvalEnv+LH	47.0% `*`	±12.5	17.0% `†`	±3.2

What this means for our no-cheat framing

Group the rows by what each method NEEDS (the load-bearing axis, see AGENTS.md):

Needs the env oracle: Ground Truth (penalty 100% -> 0.1% hack, perf 25.0%) and its 70/90% accuracy-lowered variants. Unavailable on a new env.
Needs oracle-trained labels: Probe (penalty -> 0.0% hack, perf 19.4%). Built from oracle-labelled in-env RH data; can't be constructed on a new env either.
No oracle, runs a live LLM monitor: LLM Judge. Penalty -> 0.1% hack, perf 16.2% (above base 11.5%, below ceiling 22.3%). Screening is weaker (15.7%, 2/3 runs broke out). This is the honest external peer: a no-oracle method that ALREADY suppresses hacking.
No monitor at all (system prompt only): Inoculation Prompting -- the only fully detector-free family, and it largely fails (26-47% hack across variants).

Takeaway for routeV: "a no-oracle method that suppresses hacking" is not novel -- the LLM-judge penalty does it (0.1% hack). routeV's claim has to be the MECHANISM: no live LLM monitor in the loop each step, gradient-level, direction from fixed hand-authored pairs (one offline judge-equivalent), not a per-rollout model call. And note the judge-penalty solve (16.2%) is itself well below the ceiling (22.3%) -- the no-oracle methods all pay a solve tax, which is the axis worth competing on.

(Other figures -- 6 GT, 7 GT-lowered, 8 probe, 9 judge -- are per-monitor visualisations of these same Fig-5 numbers; Fig 5 is the canonical source.)

4.1 KiB Raw Blame History

Ariahw et al. 2025 -- results table (transcribed from the figures)

Steering RL Training: Benchmarking Interventions against Reward Hacking -- Ariahw, Engels & Nanda 2025 -- LessWrong

Master table (Figure 5 -- "Overview of reward hacking and performance for all interventions")

What this means for our no-cheat framing

4.1 KiB

Raw Blame History