mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
(one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
reversed so right=better on both panels.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -13,6 +13,43 @@ _This project is an extension of work done for Neel Nanda’s MATS 9.0 Training
|
||||
|
||||
Overview of the top interventions compared to RL and No Intervention baseline runs. All runs are trained on an environment with a reward hacking loophole except for the RL baseline, which is trained on a no-loophole environment. Statistical significance compared to the RL baseline is indicated by * for values greater and † for values lesser at ɑ=0.01. Successful interventions should show reward hacking rates at or lower than the RL baseline and performance at or above the RL baseline.
|
||||
|
||||
<!-- EDITOR'S TABLE (not in the original post). The paper ships results as figures only,
|
||||
no numeric table, so Fig 5 above (Image 1) is transcribed once here by reading the PNG
|
||||
directly (downloaded 2026-06-09). Metric defs: Reward Hacking = fraction of eval rollouts
|
||||
flagged as RH (our `hack`); Performance = pass rate in the no-loophole env (our `solve`).
|
||||
`*`/`†` = sig higher/lower than RL Baseline (a=0.01); ±SD across 3 seeds/cell. -->
|
||||
|
||||
| Intervention | Monitor/Detail | Reward Hacking | ±SD | Performance | ±SD |
|
||||
| :--- | :--- | ---: | ---: | ---: | ---: |
|
||||
| No Intervention | Base Model | 0.0% | -- | 11.5% | -- |
|
||||
| No Intervention | No RH (**RL Baseline = ceiling**) | 0.2% | ±0.2 | **22.3%** | ±1.0 |
|
||||
| No Intervention | RH (**No Intervention = floor**) | **79.1%** `*` | ±10.3 | 14.9% `†` | ±8.2 |
|
||||
| Penalty | Ground Truth 100% | 0.1% | ±0.1 | 25.0% `*` | ±3.3 |
|
||||
| Penalty | Ground Truth 90% | 15.2% `*` | ±26.1 | 22.4% | ±1.3 |
|
||||
| Penalty | Ground Truth 70% | 0.0% | ±0.0 | 17.5% `†` | ±2.8 |
|
||||
| Penalty | Probe | 0.0% | ±0.0 | 19.4% | ±0.8 |
|
||||
| Penalty | **LLM Judge** | **0.1%** | ±0.1 | **16.2%** `†` | ±4.9 |
|
||||
| Screening | Ground Truth 100% | 0.9% `*` | ±0.6 | 26.3% `*` | ±3.9 |
|
||||
| Screening | Ground Truth 90% | 7.5% `*` | ±11.8 | 23.3% | ±1.4 |
|
||||
| Screening | Ground Truth 70% | 19.6% `*` | ±34.0 | 21.9% | ±1.6 |
|
||||
| Screening | Probe | 17.8% `*` | ±28.0 | 18.4% `†` | ±0.8 |
|
||||
| Screening | LLM Judge | 15.7% `*` | ±23.6 | 16.3% `†` | ±3.6 |
|
||||
| Inoculation Prompting | Loophole | 38.3% `*` | ±28.9 | 18.3% `†` | ±2.0 |
|
||||
| Inoculation Prompting | PassTests | 43.9% `*` | ±37.6 | 21.4% | ±2.3 |
|
||||
| Inoculation Prompting | PassTests+LH | 26.4% `*` | ±27.6 | 14.3% `†` | ±2.1 |
|
||||
| Inoculation Prompting | EvalEnv | 36.5% `*` | ±20.2 | 18.9% `†` | ±3.7 |
|
||||
| Inoculation Prompting | EvalEnv+LH | 47.0% `*` | ±12.5 | 17.0% `†` | ±3.2 |
|
||||
|
||||
Grouped by what each method NEEDS (the no-cheat axis, see AGENTS.md): **needs the env oracle** =
|
||||
Ground Truth (penalty 100% -> 0.1% hack, 25.0% perf) + its 70/90% variants, unavailable on a new
|
||||
env. **Needs oracle-trained labels** = Probe (penalty -> 0.0% hack, 19.4% perf). **No oracle, live
|
||||
LLM monitor** = LLM Judge (penalty -> 0.1% hack, 16.2% perf; above base 11.5%, below ceiling 22.3%)
|
||||
-- the honest external peer, though it's a separate stronger model (weak-to-strong). **No monitor
|
||||
at all** = Inoculation Prompting, which largely fails (26-47% hack). Takeaway for routeV: a no-oracle
|
||||
method that suppresses hacking is not novel (the judge does it); routeV's claim is the MECHANISM (no
|
||||
live monitor each step, gradient-level, direction from fixed authored pairs), and the no-oracle
|
||||
methods all pay a solve tax (judge 16.2% vs ceiling 22.3%) -- that's the axis worth competing on.
|
||||
|
||||
## **TL;DR**
|
||||
|
||||
* We present and open source a clean environment where RL training naturally induces reward hacking (RH) in Qwen3-4B without explicit training or prompting
|
||||
|
||||
@@ -1,61 +0,0 @@
|
||||
# Ariahw et al. 2025 -- results table (transcribed from the figures)
|
||||
|
||||
The paper publishes results as **figures only, no numeric table**, so every number
|
||||
we want lives in an image. This file transcribes them once so we (and our plots)
|
||||
never re-OCR. Read each cell off the source figure named in the provenance line.
|
||||
|
||||
## *Steering RL Training: Benchmarking Interventions against Reward Hacking* -- Ariahw, Engels & Nanda 2025 -- [LessWrong](https://www.lesswrong.com/posts/R5MdWGKsuvdPwGFBG/steering-rl-training-benchmarking-interventions-against)
|
||||
- epistemic context: the substrate paper. Numbers below transcribed by reading the
|
||||
figure PNGs directly (downloaded from the post's cloudinary mirror) on 2026-06-09.
|
||||
- metric defs: **Reward Hacking** = fraction of eval rollouts flagged as RH in the
|
||||
loophole env (our `hack`). **Performance** = pass rate in the no-loophole env (our
|
||||
`solve`). `*` = significantly higher than RL Baseline, `†` = significantly lower
|
||||
(a=0.01). `±SD` is across-seed spread (n=3 runs/cell).
|
||||
|
||||
### Master table (Figure 5 -- "Overview of reward hacking and performance for all interventions")
|
||||
source img: `mirroredImages/R5MdWGKsuvdPwGFBG/imeotdksvqyy8y8twbbq` (Fig 5)
|
||||
|
||||
| Intervention | Monitor/Detail | Reward Hacking | ±SD | Performance | ±SD |
|
||||
| :--- | :--- | ---: | ---: | ---: | ---: |
|
||||
| No Intervention | Base Model | 0.0% | -- | 11.5% | -- |
|
||||
| No Intervention | No RH (**RL Baseline = ceiling**) | 0.2% | ±0.2 | **22.3%** | ±1.0 |
|
||||
| No Intervention | RH (**No Intervention = floor**) | **79.1%** `*` | ±10.3 | 14.9% `†` | ±8.2 |
|
||||
| Penalty | Ground Truth 100% | 0.1% | ±0.1 | 25.0% `*` | ±3.3 |
|
||||
| Penalty | Ground Truth 90% | 15.2% `*` | ±26.1 | 22.4% | ±1.3 |
|
||||
| Penalty | Ground Truth 70% | 0.0% | ±0.0 | 17.5% `†` | ±2.8 |
|
||||
| Penalty | Probe | 0.0% | ±0.0 | 19.4% | ±0.8 |
|
||||
| Penalty | **LLM Judge** | **0.1%** | ±0.1 | **16.2%** `†` | ±4.9 |
|
||||
| Screening | Ground Truth 100% | 0.9% `*` | ±0.6 | 26.3% `*` | ±3.9 |
|
||||
| Screening | Ground Truth 90% | 7.5% `*` | ±11.8 | 23.3% | ±1.4 |
|
||||
| Screening | Ground Truth 70% | 19.6% `*` | ±34.0 | 21.9% | ±1.6 |
|
||||
| Screening | Probe | 17.8% `*` | ±28.0 | 18.4% `†` | ±0.8 |
|
||||
| Screening | LLM Judge | 15.7% `*` | ±23.6 | 16.3% `†` | ±3.6 |
|
||||
| Inoculation Prompting | Loophole | 38.3% `*` | ±28.9 | 18.3% `†` | ±2.0 |
|
||||
| Inoculation Prompting | PassTests | 43.9% `*` | ±37.6 | 21.4% | ±2.3 |
|
||||
| Inoculation Prompting | PassTests+LH | 26.4% `*` | ±27.6 | 14.3% `†` | ±2.1 |
|
||||
| Inoculation Prompting | EvalEnv | 36.5% `*` | ±20.2 | 18.9% `†` | ±3.7 |
|
||||
| Inoculation Prompting | EvalEnv+LH | 47.0% `*` | ±12.5 | 17.0% `†` | ±3.2 |
|
||||
|
||||
### What this means for our no-cheat framing
|
||||
|
||||
Group the rows by what each method NEEDS (the load-bearing axis, see AGENTS.md):
|
||||
|
||||
- **Needs the env oracle:** Ground Truth (penalty 100% -> 0.1% hack, perf 25.0%) and its
|
||||
70/90% accuracy-lowered variants. Unavailable on a new env.
|
||||
- **Needs oracle-trained labels:** Probe (penalty -> 0.0% hack, perf 19.4%). Built from
|
||||
oracle-labelled in-env RH data; can't be constructed on a new env either.
|
||||
- **No oracle, runs a live LLM monitor:** LLM Judge. **Penalty -> 0.1% hack, perf 16.2%**
|
||||
(above base 11.5%, below ceiling 22.3%). Screening is weaker (15.7%, 2/3 runs broke out).
|
||||
This is the honest external peer: a no-oracle method that ALREADY suppresses hacking.
|
||||
- **No monitor at all (system prompt only):** Inoculation Prompting -- the only fully
|
||||
detector-free family, and it largely **fails** (26-47% hack across variants).
|
||||
|
||||
Takeaway for routeV: "a no-oracle method that suppresses hacking" is **not novel** -- the
|
||||
LLM-judge penalty does it (0.1% hack). routeV's claim has to be the MECHANISM: no live
|
||||
LLM monitor in the loop each step, gradient-level, direction from fixed hand-authored pairs
|
||||
(one offline judge-equivalent), not a per-rollout model call. And note the judge-penalty
|
||||
solve (16.2%) is itself well below the ceiling (22.3%) -- the no-oracle methods all pay a
|
||||
solve tax, which is the axis worth competing on.
|
||||
|
||||
(Other figures -- 6 GT, 7 GT-lowered, 8 probe, 9 judge -- are per-monitor visualisations of
|
||||
these same Fig-5 numbers; Fig 5 is the canonical source.)
|
||||
@@ -185,13 +185,81 @@ def plot(df: pl.DataFrame) -> None:
|
||||
fontsize=10.5, x=0.01, ha="left")
|
||||
fig.text(0.01, 0.015, "Our arms only, seed 43, 60-step fast (unconverged surrogate). hack suppressed = (vanilla_hack - arm_hack)/vanilla_hack; "
|
||||
"solve gained = (arm_solve - base)/(ceiling - base). Ariahw 2025 monitor numbers are cross-scale/regime and live in "
|
||||
"docs/papers/ariahw_results_table_extracted.md, not on this axis.",
|
||||
"the transcribed Fig-5 table in docs/papers/2025_lw_ariahw_*.md, not on this axis.",
|
||||
fontsize=6.8, color=GREY, va="bottom")
|
||||
fig.tight_layout(rect=(0, 0.07, 1, 0.94))
|
||||
for ext in ("pdf", "png"):
|
||||
fig.savefig(OUT / f"floor_ceiling.{ext}", dpi=150, bbox_inches="tight")
|
||||
|
||||
|
||||
# ── stage 2b: absolute-scale variant (arrows + shaded floor/ceiling) ─────────
|
||||
# Same three arms, but plotted on the RAW metric axis (not normalized to [0,1]) so the
|
||||
# actual rates are legible. Both panels oriented "right = better": the solve axis is the
|
||||
# raw solve rate; the hack axis is REVERSED (right = less hacking). Grey "bedrock" shades
|
||||
# the worse-than-floor zone, blue "sky" shades the better-than-ceiling zone, and each arm
|
||||
# is an arrow from the floor anchor to its value (length = distance climbed).
|
||||
SKY, BEDROCK = "#cfe3ff", "#d9dadb"
|
||||
|
||||
|
||||
def _arrow_panel(ax, anchor, ceiling, rows, *, reversed_x, xlim, floor_lab, ceil_lab, xlabel, title):
|
||||
lo, hi = xlim # lo=left edge, hi=right edge (lo>hi when reversed_x)
|
||||
# bedrock = worse-than-floor; sky = better-than-ceiling (data coords, orientation-agnostic)
|
||||
if reversed_x: # hack: worse = higher rate, better = lower; better is to the RIGHT
|
||||
ax.axvspan(lo, anchor, color=BEDROCK, alpha=0.7, lw=0) # >= floor hack = bedrock
|
||||
ax.axvspan(ceiling, hi, color=SKY, alpha=0.7, lw=0) # <= ceiling (0) = sky
|
||||
else: # solve: worse = lower, better = higher; better is to the RIGHT
|
||||
ax.axvspan(lo, anchor, color=BEDROCK, alpha=0.7, lw=0) # <= floor solve = bedrock
|
||||
ax.axvspan(ceiling, hi, color=SKY, alpha=0.7, lw=0) # >= ceiling = sky
|
||||
ax.axvline(anchor, color=GREY, lw=1.2)
|
||||
ax.axvline(ceiling, color="#3b5bdb", lw=1.2, ls=":")
|
||||
span = abs(hi - lo)
|
||||
for yi, (lab, val, col) in enumerate(rows):
|
||||
ax.annotate("", xy=(val, yi), xytext=(anchor, yi),
|
||||
arrowprops=dict(arrowstyle="-|>", color=col, lw=2.6, shrinkA=0, shrinkB=0))
|
||||
ax.plot([anchor], [yi], "o", color=GREY, ms=4, zorder=3)
|
||||
better_right = (val > anchor) if not reversed_x else (val < anchor) # is the arm in the 'better' (right) dir
|
||||
ha = "left" if better_right else "right"
|
||||
ax.text(val + (span * 0.02 if ha == "left" else -span * 0.02), yi, f"{val:.3f}",
|
||||
va="center", ha=ha, fontsize=9, color=col, fontweight="bold")
|
||||
ax.set_xlim(lo, hi)
|
||||
ax.set_yticks(range(len(rows))); ax.set_yticklabels([r[0] for r in rows], fontsize=8.5)
|
||||
ax.set_ylim(-0.6, len(rows) - 0.4)
|
||||
ax.set_xlabel(xlabel, fontsize=8.5)
|
||||
ax.set_title(title, fontsize=10, loc="left")
|
||||
ax.text(anchor, -0.55, floor_lab, fontsize=7.5, color=GREY, ha="center", va="bottom")
|
||||
ax.text(ceiling, -0.55, ceil_lab, fontsize=7.5, color="#3b5bdb", ha="center", va="bottom")
|
||||
for s in ("top", "right", "left"):
|
||||
ax.spines[s].set_visible(False)
|
||||
ax.tick_params(left=False)
|
||||
|
||||
|
||||
def plot_abs(df: pl.DataFrame) -> None:
|
||||
a = _anchors(df)
|
||||
base, vh, ceil = a["base_solve"], a["vanilla_hack"], a["ceiling"]
|
||||
pick = lambda lab: df.filter(pl.col("label") == lab).to_dicts()[0]
|
||||
best, rand, van = pick("routeV per-token"), pick("routeV random-V"), pick("vanilla GRPO")
|
||||
# bottom -> top: vanilla, random-V, per-token
|
||||
hack_rows = [("vanilla GRPO", van["hack_deploy"], RED),
|
||||
("routeV random-V", rand["hack_deploy"], DARK),
|
||||
("routeV per-token", best["hack_deploy"], GOLD)]
|
||||
solve_rows = [("vanilla GRPO", van["solve_deploy"], RED),
|
||||
("routeV random-V", rand["solve_deploy"], DARK),
|
||||
("routeV per-token", best["solve_deploy"], GOLD)]
|
||||
prov = " PROVISIONAL" if a["provisional"] else ""
|
||||
fig, (axl, axr) = plt.subplots(1, 2, figsize=(11.5, 4.2), sharey=True)
|
||||
_arrow_panel(axl, anchor=vh, ceiling=0.0, rows=hack_rows, reversed_x=True,
|
||||
xlim=(vh + 0.05, -0.03), floor_lab=f"floor\n(vanilla {vh:.2f})", ceil_lab="ceiling\n(no hack)",
|
||||
xlabel="hack rate · axis reversed: right = less hacking = better", title="hacking (raw rate)")
|
||||
_arrow_panel(axr, anchor=base, ceiling=ceil, rows=solve_rows, reversed_x=False,
|
||||
xlim=(base - 0.03, ceil + 0.03), floor_lab=f"floor\n(base {base:.2f})", ceil_lab=f"ceiling\n({ceil:.2f}{prov})",
|
||||
xlabel="solve rate · right = more solving = better", title="solving (raw rate)")
|
||||
fig.suptitle("vGROUT raw rates: arrow = climb from floor; grey = bedrock (worse than floor), blue = sky (past ceiling) (test n=119, seed 43, 60-step fast)",
|
||||
fontsize=10, x=0.01, ha="left")
|
||||
fig.tight_layout(rect=(0, 0, 1, 0.93))
|
||||
for ext in ("pdf", "png"):
|
||||
fig.savefig(OUT / f"floor_ceiling_abs.{ext}", dpi=150, bbox_inches="tight")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
df = build_csv()
|
||||
flags = df.filter(~pl.col("status").str.starts_with("ok"))
|
||||
@@ -201,7 +269,8 @@ def main() -> None:
|
||||
for r in flags.to_dicts():
|
||||
print(f" [{r['label']}] {r['status']}")
|
||||
plot(df)
|
||||
print(f"\nwrote {OUT}/floor_ceiling.pdf and .png")
|
||||
plot_abs(df)
|
||||
print(f"\nwrote {OUT}/floor_ceiling.pdf and .png (+ floor_ceiling_abs.pdf/.png)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
Reference in New Issue
Block a user