mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge)
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong -- LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression is not routeV's novelty (the judge does it); the mechanism is (no live monitor, gradient-level, fixed authored-pair direction). Plot now shows the judge as a blue no-oracle peer bar alongside the grey oracle methods. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,61 @@
|
||||
# Ariahw et al. 2025 -- results table (transcribed from the figures)
|
||||
|
||||
The paper publishes results as **figures only, no numeric table**, so every number
|
||||
we want lives in an image. This file transcribes them once so we (and our plots)
|
||||
never re-OCR. Read each cell off the source figure named in the provenance line.
|
||||
|
||||
## *Steering RL Training: Benchmarking Interventions against Reward Hacking* -- Ariahw, Engels & Nanda 2025 -- [LessWrong](https://www.lesswrong.com/posts/R5MdWGKsuvdPwGFBG/steering-rl-training-benchmarking-interventions-against)
|
||||
- epistemic context: the substrate paper. Numbers below transcribed by reading the
|
||||
figure PNGs directly (downloaded from the post's cloudinary mirror) on 2026-06-09.
|
||||
- metric defs: **Reward Hacking** = fraction of eval rollouts flagged as RH in the
|
||||
loophole env (our `hack`). **Performance** = pass rate in the no-loophole env (our
|
||||
`solve`). `*` = significantly higher than RL Baseline, `†` = significantly lower
|
||||
(a=0.01). `±SD` is across-seed spread (n=3 runs/cell).
|
||||
|
||||
### Master table (Figure 5 -- "Overview of reward hacking and performance for all interventions")
|
||||
source img: `mirroredImages/R5MdWGKsuvdPwGFBG/imeotdksvqyy8y8twbbq` (Fig 5)
|
||||
|
||||
| Intervention | Monitor/Detail | Reward Hacking | ±SD | Performance | ±SD |
|
||||
| :--- | :--- | ---: | ---: | ---: | ---: |
|
||||
| No Intervention | Base Model | 0.0% | -- | 11.5% | -- |
|
||||
| No Intervention | No RH (**RL Baseline = ceiling**) | 0.2% | ±0.2 | **22.3%** | ±1.0 |
|
||||
| No Intervention | RH (**No Intervention = floor**) | **79.1%** `*` | ±10.3 | 14.9% `†` | ±8.2 |
|
||||
| Penalty | Ground Truth 100% | 0.1% | ±0.1 | 25.0% `*` | ±3.3 |
|
||||
| Penalty | Ground Truth 90% | 15.2% `*` | ±26.1 | 22.4% | ±1.3 |
|
||||
| Penalty | Ground Truth 70% | 0.0% | ±0.0 | 17.5% `†` | ±2.8 |
|
||||
| Penalty | Probe | 0.0% | ±0.0 | 19.4% | ±0.8 |
|
||||
| Penalty | **LLM Judge** | **0.1%** | ±0.1 | **16.2%** `†` | ±4.9 |
|
||||
| Screening | Ground Truth 100% | 0.9% `*` | ±0.6 | 26.3% `*` | ±3.9 |
|
||||
| Screening | Ground Truth 90% | 7.5% `*` | ±11.8 | 23.3% | ±1.4 |
|
||||
| Screening | Ground Truth 70% | 19.6% `*` | ±34.0 | 21.9% | ±1.6 |
|
||||
| Screening | Probe | 17.8% `*` | ±28.0 | 18.4% `†` | ±0.8 |
|
||||
| Screening | LLM Judge | 15.7% `*` | ±23.6 | 16.3% `†` | ±3.6 |
|
||||
| Inoculation Prompting | Loophole | 38.3% `*` | ±28.9 | 18.3% `†` | ±2.0 |
|
||||
| Inoculation Prompting | PassTests | 43.9% `*` | ±37.6 | 21.4% | ±2.3 |
|
||||
| Inoculation Prompting | PassTests+LH | 26.4% `*` | ±27.6 | 14.3% `†` | ±2.1 |
|
||||
| Inoculation Prompting | EvalEnv | 36.5% `*` | ±20.2 | 18.9% `†` | ±3.7 |
|
||||
| Inoculation Prompting | EvalEnv+LH | 47.0% `*` | ±12.5 | 17.0% `†` | ±3.2 |
|
||||
|
||||
### What this means for our no-cheat framing
|
||||
|
||||
Group the rows by what each method NEEDS (the load-bearing axis, see AGENTS.md):
|
||||
|
||||
- **Needs the env oracle:** Ground Truth (penalty 100% -> 0.1% hack, perf 25.0%) and its
|
||||
70/90% accuracy-lowered variants. Unavailable on a new env.
|
||||
- **Needs oracle-trained labels:** Probe (penalty -> 0.0% hack, perf 19.4%). Built from
|
||||
oracle-labelled in-env RH data; can't be constructed on a new env either.
|
||||
- **No oracle, runs a live LLM monitor:** LLM Judge. **Penalty -> 0.1% hack, perf 16.2%**
|
||||
(above base 11.5%, below ceiling 22.3%). Screening is weaker (15.7%, 2/3 runs broke out).
|
||||
This is the honest external peer: a no-oracle method that ALREADY suppresses hacking.
|
||||
- **No monitor at all (system prompt only):** Inoculation Prompting -- the only fully
|
||||
detector-free family, and it largely **fails** (26-47% hack across variants).
|
||||
|
||||
Takeaway for routeV: "a no-oracle method that suppresses hacking" is **not novel** -- the
|
||||
LLM-judge penalty does it (0.1% hack). routeV's claim has to be the MECHANISM: no live
|
||||
LLM monitor in the loop each step, gradient-level, direction from fixed hand-authored pairs
|
||||
(one offline judge-equivalent), not a per-rollout model call. And note the judge-penalty
|
||||
solve (16.2%) is itself well below the ceiling (22.3%) -- the no-oracle methods all pay a
|
||||
solve tax, which is the axis worth competing on.
|
||||
|
||||
(Other figures -- 6 GT, 7 GT-lowered, 8 probe, 9 judge -- are per-monitor visualisations of
|
||||
these same Fig-5 numbers; Fig 5 is the canonical source.)
|
||||
Reference in New Issue
Block a user