docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot

- Transcribed Fig-5 numeric table now lives inline in the paper md as an
  EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
  (one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
  anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
  reversed so right=better on both panels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-09 12:35:14 +00:00
parent 0973f9ba7c
commit d4998a71ba
3 changed files with 108 additions and 63 deletions
@@ -13,6 +13,43 @@ _This project is an extension of work done for Neel Nandas MATS 9.0 Training
Overview of the top interventions compared to RL and No Intervention baseline runs. All runs are trained on an environment with a reward hacking loophole except for the RL baseline, which is trained on a no-loophole environment. Statistical significance compared to the RL baseline is indicated by * for values greater and † for values lesser at ɑ=0.01. Successful interventions should show reward hacking rates at or lower than the RL baseline and performance at or above the RL baseline.
<!-- EDITOR'S TABLE (not in the original post). The paper ships results as figures only,
no numeric table, so Fig 5 above (Image 1) is transcribed once here by reading the PNG
directly (downloaded 2026-06-09). Metric defs: Reward Hacking = fraction of eval rollouts
flagged as RH (our `hack`); Performance = pass rate in the no-loophole env (our `solve`).
`*`/`†` = sig higher/lower than RL Baseline (a=0.01); ±SD across 3 seeds/cell. -->
| Intervention | Monitor/Detail | Reward Hacking | ±SD | Performance | ±SD |
| :--- | :--- | ---: | ---: | ---: | ---: |
| No Intervention | Base Model | 0.0% | -- | 11.5% | -- |
| No Intervention | No RH (**RL Baseline = ceiling**) | 0.2% | ±0.2 | **22.3%** | ±1.0 |
| No Intervention | RH (**No Intervention = floor**) | **79.1%** `*` | ±10.3 | 14.9% `†` | ±8.2 |
| Penalty | Ground Truth 100% | 0.1% | ±0.1 | 25.0% `*` | ±3.3 |
| Penalty | Ground Truth 90% | 15.2% `*` | ±26.1 | 22.4% | ±1.3 |
| Penalty | Ground Truth 70% | 0.0% | ±0.0 | 17.5% `†` | ±2.8 |
| Penalty | Probe | 0.0% | ±0.0 | 19.4% | ±0.8 |
| Penalty | **LLM Judge** | **0.1%** | ±0.1 | **16.2%** `†` | ±4.9 |
| Screening | Ground Truth 100% | 0.9% `*` | ±0.6 | 26.3% `*` | ±3.9 |
| Screening | Ground Truth 90% | 7.5% `*` | ±11.8 | 23.3% | ±1.4 |
| Screening | Ground Truth 70% | 19.6% `*` | ±34.0 | 21.9% | ±1.6 |
| Screening | Probe | 17.8% `*` | ±28.0 | 18.4% `†` | ±0.8 |
| Screening | LLM Judge | 15.7% `*` | ±23.6 | 16.3% `†` | ±3.6 |
| Inoculation Prompting | Loophole | 38.3% `*` | ±28.9 | 18.3% `†` | ±2.0 |
| Inoculation Prompting | PassTests | 43.9% `*` | ±37.6 | 21.4% | ±2.3 |
| Inoculation Prompting | PassTests+LH | 26.4% `*` | ±27.6 | 14.3% `†` | ±2.1 |
| Inoculation Prompting | EvalEnv | 36.5% `*` | ±20.2 | 18.9% `†` | ±3.7 |
| Inoculation Prompting | EvalEnv+LH | 47.0% `*` | ±12.5 | 17.0% `†` | ±3.2 |
Grouped by what each method NEEDS (the no-cheat axis, see AGENTS.md): **needs the env oracle** =
Ground Truth (penalty 100% -> 0.1% hack, 25.0% perf) + its 70/90% variants, unavailable on a new
env. **Needs oracle-trained labels** = Probe (penalty -> 0.0% hack, 19.4% perf). **No oracle, live
LLM monitor** = LLM Judge (penalty -> 0.1% hack, 16.2% perf; above base 11.5%, below ceiling 22.3%)
-- the honest external peer, though it's a separate stronger model (weak-to-strong). **No monitor
at all** = Inoculation Prompting, which largely fails (26-47% hack). Takeaway for routeV: a no-oracle
method that suppresses hacking is not novel (the judge does it); routeV's claim is the MECHANISM (no
live monitor each step, gradient-level, direction from fixed authored pairs), and the no-oracle
methods all pay a solve tax (judge 16.2% vs ceiling 22.3%) -- that's the axis worth competing on.
## **TL;DR**
* We present and open source a clean environment where RL training naturally induces reward hacking (RH) in Qwen3-4B without explicit training or prompting
@@ -1,61 +0,0 @@
# Ariahw et al. 2025 -- results table (transcribed from the figures)
The paper publishes results as **figures only, no numeric table**, so every number
we want lives in an image. This file transcribes them once so we (and our plots)
never re-OCR. Read each cell off the source figure named in the provenance line.
## *Steering RL Training: Benchmarking Interventions against Reward Hacking* -- Ariahw, Engels & Nanda 2025 -- [LessWrong](https://www.lesswrong.com/posts/R5MdWGKsuvdPwGFBG/steering-rl-training-benchmarking-interventions-against)
- epistemic context: the substrate paper. Numbers below transcribed by reading the
figure PNGs directly (downloaded from the post's cloudinary mirror) on 2026-06-09.
- metric defs: **Reward Hacking** = fraction of eval rollouts flagged as RH in the
loophole env (our `hack`). **Performance** = pass rate in the no-loophole env (our
`solve`). `*` = significantly higher than RL Baseline, `†` = significantly lower
(a=0.01). `±SD` is across-seed spread (n=3 runs/cell).
### Master table (Figure 5 -- "Overview of reward hacking and performance for all interventions")
source img: `mirroredImages/R5MdWGKsuvdPwGFBG/imeotdksvqyy8y8twbbq` (Fig 5)
| Intervention | Monitor/Detail | Reward Hacking | ±SD | Performance | ±SD |
| :--- | :--- | ---: | ---: | ---: | ---: |
| No Intervention | Base Model | 0.0% | -- | 11.5% | -- |
| No Intervention | No RH (**RL Baseline = ceiling**) | 0.2% | ±0.2 | **22.3%** | ±1.0 |
| No Intervention | RH (**No Intervention = floor**) | **79.1%** `*` | ±10.3 | 14.9% `†` | ±8.2 |
| Penalty | Ground Truth 100% | 0.1% | ±0.1 | 25.0% `*` | ±3.3 |
| Penalty | Ground Truth 90% | 15.2% `*` | ±26.1 | 22.4% | ±1.3 |
| Penalty | Ground Truth 70% | 0.0% | ±0.0 | 17.5% `†` | ±2.8 |
| Penalty | Probe | 0.0% | ±0.0 | 19.4% | ±0.8 |
| Penalty | **LLM Judge** | **0.1%** | ±0.1 | **16.2%** `†` | ±4.9 |
| Screening | Ground Truth 100% | 0.9% `*` | ±0.6 | 26.3% `*` | ±3.9 |
| Screening | Ground Truth 90% | 7.5% `*` | ±11.8 | 23.3% | ±1.4 |
| Screening | Ground Truth 70% | 19.6% `*` | ±34.0 | 21.9% | ±1.6 |
| Screening | Probe | 17.8% `*` | ±28.0 | 18.4% `†` | ±0.8 |
| Screening | LLM Judge | 15.7% `*` | ±23.6 | 16.3% `†` | ±3.6 |
| Inoculation Prompting | Loophole | 38.3% `*` | ±28.9 | 18.3% `†` | ±2.0 |
| Inoculation Prompting | PassTests | 43.9% `*` | ±37.6 | 21.4% | ±2.3 |
| Inoculation Prompting | PassTests+LH | 26.4% `*` | ±27.6 | 14.3% `†` | ±2.1 |
| Inoculation Prompting | EvalEnv | 36.5% `*` | ±20.2 | 18.9% `†` | ±3.7 |
| Inoculation Prompting | EvalEnv+LH | 47.0% `*` | ±12.5 | 17.0% `†` | ±3.2 |
### What this means for our no-cheat framing
Group the rows by what each method NEEDS (the load-bearing axis, see AGENTS.md):
- **Needs the env oracle:** Ground Truth (penalty 100% -> 0.1% hack, perf 25.0%) and its
70/90% accuracy-lowered variants. Unavailable on a new env.
- **Needs oracle-trained labels:** Probe (penalty -> 0.0% hack, perf 19.4%). Built from
oracle-labelled in-env RH data; can't be constructed on a new env either.
- **No oracle, runs a live LLM monitor:** LLM Judge. **Penalty -> 0.1% hack, perf 16.2%**
(above base 11.5%, below ceiling 22.3%). Screening is weaker (15.7%, 2/3 runs broke out).
This is the honest external peer: a no-oracle method that ALREADY suppresses hacking.
- **No monitor at all (system prompt only):** Inoculation Prompting -- the only fully
detector-free family, and it largely **fails** (26-47% hack across variants).
Takeaway for routeV: "a no-oracle method that suppresses hacking" is **not novel** -- the
LLM-judge penalty does it (0.1% hack). routeV's claim has to be the MECHANISM: no live
LLM monitor in the loop each step, gradient-level, direction from fixed hand-authored pairs
(one offline judge-equivalent), not a per-rollout model call. And note the judge-penalty
solve (16.2%) is itself well below the ceiling (22.3%) -- the no-oracle methods all pay a
solve tax, which is the axis worth competing on.
(Other figures -- 6 GT, 7 GT-lowered, 8 probe, 9 judge -- are per-monitor visualisations of
these same Fig-5 numbers; Fig 5 is the canonical source.)
+71 -2
View File
@@ -185,13 +185,81 @@ def plot(df: pl.DataFrame) -> None:
fontsize=10.5, x=0.01, ha="left")
fig.text(0.01, 0.015, "Our arms only, seed 43, 60-step fast (unconverged surrogate). hack suppressed = (vanilla_hack - arm_hack)/vanilla_hack; "
"solve gained = (arm_solve - base)/(ceiling - base). Ariahw 2025 monitor numbers are cross-scale/regime and live in "
"docs/papers/ariahw_results_table_extracted.md, not on this axis.",
"the transcribed Fig-5 table in docs/papers/2025_lw_ariahw_*.md, not on this axis.",
fontsize=6.8, color=GREY, va="bottom")
fig.tight_layout(rect=(0, 0.07, 1, 0.94))
for ext in ("pdf", "png"):
fig.savefig(OUT / f"floor_ceiling.{ext}", dpi=150, bbox_inches="tight")
# ── stage 2b: absolute-scale variant (arrows + shaded floor/ceiling) ─────────
# Same three arms, but plotted on the RAW metric axis (not normalized to [0,1]) so the
# actual rates are legible. Both panels oriented "right = better": the solve axis is the
# raw solve rate; the hack axis is REVERSED (right = less hacking). Grey "bedrock" shades
# the worse-than-floor zone, blue "sky" shades the better-than-ceiling zone, and each arm
# is an arrow from the floor anchor to its value (length = distance climbed).
SKY, BEDROCK = "#cfe3ff", "#d9dadb"
def _arrow_panel(ax, anchor, ceiling, rows, *, reversed_x, xlim, floor_lab, ceil_lab, xlabel, title):
lo, hi = xlim # lo=left edge, hi=right edge (lo>hi when reversed_x)
# bedrock = worse-than-floor; sky = better-than-ceiling (data coords, orientation-agnostic)
if reversed_x: # hack: worse = higher rate, better = lower; better is to the RIGHT
ax.axvspan(lo, anchor, color=BEDROCK, alpha=0.7, lw=0) # >= floor hack = bedrock
ax.axvspan(ceiling, hi, color=SKY, alpha=0.7, lw=0) # <= ceiling (0) = sky
else: # solve: worse = lower, better = higher; better is to the RIGHT
ax.axvspan(lo, anchor, color=BEDROCK, alpha=0.7, lw=0) # <= floor solve = bedrock
ax.axvspan(ceiling, hi, color=SKY, alpha=0.7, lw=0) # >= ceiling = sky
ax.axvline(anchor, color=GREY, lw=1.2)
ax.axvline(ceiling, color="#3b5bdb", lw=1.2, ls=":")
span = abs(hi - lo)
for yi, (lab, val, col) in enumerate(rows):
ax.annotate("", xy=(val, yi), xytext=(anchor, yi),
arrowprops=dict(arrowstyle="-|>", color=col, lw=2.6, shrinkA=0, shrinkB=0))
ax.plot([anchor], [yi], "o", color=GREY, ms=4, zorder=3)
better_right = (val > anchor) if not reversed_x else (val < anchor) # is the arm in the 'better' (right) dir
ha = "left" if better_right else "right"
ax.text(val + (span * 0.02 if ha == "left" else -span * 0.02), yi, f"{val:.3f}",
va="center", ha=ha, fontsize=9, color=col, fontweight="bold")
ax.set_xlim(lo, hi)
ax.set_yticks(range(len(rows))); ax.set_yticklabels([r[0] for r in rows], fontsize=8.5)
ax.set_ylim(-0.6, len(rows) - 0.4)
ax.set_xlabel(xlabel, fontsize=8.5)
ax.set_title(title, fontsize=10, loc="left")
ax.text(anchor, -0.55, floor_lab, fontsize=7.5, color=GREY, ha="center", va="bottom")
ax.text(ceiling, -0.55, ceil_lab, fontsize=7.5, color="#3b5bdb", ha="center", va="bottom")
for s in ("top", "right", "left"):
ax.spines[s].set_visible(False)
ax.tick_params(left=False)
def plot_abs(df: pl.DataFrame) -> None:
a = _anchors(df)
base, vh, ceil = a["base_solve"], a["vanilla_hack"], a["ceiling"]
pick = lambda lab: df.filter(pl.col("label") == lab).to_dicts()[0]
best, rand, van = pick("routeV per-token"), pick("routeV random-V"), pick("vanilla GRPO")
# bottom -> top: vanilla, random-V, per-token
hack_rows = [("vanilla GRPO", van["hack_deploy"], RED),
("routeV random-V", rand["hack_deploy"], DARK),
("routeV per-token", best["hack_deploy"], GOLD)]
solve_rows = [("vanilla GRPO", van["solve_deploy"], RED),
("routeV random-V", rand["solve_deploy"], DARK),
("routeV per-token", best["solve_deploy"], GOLD)]
prov = " PROVISIONAL" if a["provisional"] else ""
fig, (axl, axr) = plt.subplots(1, 2, figsize=(11.5, 4.2), sharey=True)
_arrow_panel(axl, anchor=vh, ceiling=0.0, rows=hack_rows, reversed_x=True,
xlim=(vh + 0.05, -0.03), floor_lab=f"floor\n(vanilla {vh:.2f})", ceil_lab="ceiling\n(no hack)",
xlabel="hack rate · axis reversed: right = less hacking = better", title="hacking (raw rate)")
_arrow_panel(axr, anchor=base, ceiling=ceil, rows=solve_rows, reversed_x=False,
xlim=(base - 0.03, ceil + 0.03), floor_lab=f"floor\n(base {base:.2f})", ceil_lab=f"ceiling\n({ceil:.2f}{prov})",
xlabel="solve rate · right = more solving = better", title="solving (raw rate)")
fig.suptitle("vGROUT raw rates: arrow = climb from floor; grey = bedrock (worse than floor), blue = sky (past ceiling) (test n=119, seed 43, 60-step fast)",
fontsize=10, x=0.01, ha="left")
fig.tight_layout(rect=(0, 0, 1, 0.93))
for ext in ("pdf", "png"):
fig.savefig(OUT / f"floor_ceiling_abs.{ext}", dpi=150, bbox_inches="tight")
def main() -> None:
df = build_csv()
flags = df.filter(~pl.col("status").str.starts_with("ok"))
@@ -201,7 +269,8 @@ def main() -> None:
for r in flags.to_dicts():
print(f" [{r['label']}] {r['status']}")
plot(df)
print(f"\nwrote {OUT}/floor_ceiling.pdf and .png")
plot_abs(df)
print(f"\nwrote {OUT}/floor_ceiling.pdf and .png (+ floor_ceiling_abs.pdf/.png)")
if __name__ == "__main__":