no-cheat framing: label-leakage not detector-presence; fix plot comment

The disqualifier for an intervention is needing the env oracle / ground-truth hack-labels of the live training distribution, not 'a detector ran'. On a new RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe can't be built there; a generic LLM judge and our hand-authored-pair vector can. LLM judge is thus the fair external peer (no clean fast-env number to plot). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-09 11:22:29 +00:00
parent 6b44dd39bd
commit 3b38a05738
4 changed files with 46 additions and 29 deletions
@@ -106,15 +106,20 @@ def build_csv() -> pl.DataFrame:


 # ── stage 2: plot from the csv ──────────────────────────────────────────────
-# Reference: Ariahw et al. 2025 (the substrate paper) benchmark interventions on the SAME
-# floor/ceiling -- No-Intervention (hack ~79%) = floor, RL-Baseline/no-loophole = ceiling. Their
-# best interventions (Ground-Truth Penalty ~0% hack, perf >= ceiling) reach the top corner BUT
-# use the oracle monitor at train time -- the exact cheat our no-cheat constraint forbids. Their
-# only oracle-free method (inoculation) gave incomplete, high-variance mitigation. We plot the
-# GT-monitor point as a clearly-marked ORACLE upper bound (solve approx; figures are images, and
-# their 200-step preset is not step-matched to our 60-step fast). hack_supp ~1.0 (hack ~0%),
-# solve_uplift ~1.0 (perf at/above ceiling).
-ARIAHW_REF = dict(label="Ariahw GT-monitor\n(uses ORACLE -- cheat)", hack_supp=0.99, solve_uplift=1.0)
+# The reference paper (Ariahw et al. 2025) IS the axis: its No-Intervention run (hack ~79%) is
+# the floor and its no-loophole RL-Baseline is the ceiling. So the comparison-to-paper is "how
+# far up the paper's own floor->ceiling range did our no-cheat method climb." We do NOT plot the
+# paper's intervention bars, for two different reasons (the disqualifier is oracle/ground-truth-
+# LABEL leakage, NOT "a monitor ran"):
+#   - GT monitor (+70/90% variants) and the probe (trained on oracle-labelled in-env RH data,
+#     footnote 12) both need the env oracle to exist -- they cannot be built on a new env with no
+#     oracle, so they are cheats for our transfer claim.
+#   - LLM judge is the legitimate external peer (generic model, no oracle, ~50% acc yet protective
+#     via penalty) -- but it has no clean single fast-env number on our axis (paper figures only,
+#     different training regime), so we have no honest point to plot for it.
+#   - inoculation prompting (no monitor) has no clean number either (prose: incomplete, high-
+#     variance -- some seeds ~0 hack, some ~full hack).
+# So: nothing with a comparable single number to plot; the paper enters only as floor/ceiling.
 GOLD, DARK = "#c8920a", "#3a3a3a"


@@ -155,14 +160,12 @@ def plot(df: pl.DataFrame) -> None:
    def hsupp(r): return (vh - r["hack_deploy"]) / vh
    def suplift(r): return (r["solve_deploy"] - base) / (ceil - base)

-    # rows: best (gold), random control (dark), Ariahw oracle reference (grey, hatched). Top plots last.
+    # rows: best (gold) vs direction-control (dark). Floor/ceiling = the Ariahw paper's anchors.
    hack_rows = [
-        (ARIAHW_REF["label"], ARIAHW_REF["hack_supp"], "hack ~0%", GREY),
        ("routeV random-V\n(direction control)", hsupp(rand), f"{rand['hack_deploy']:.3f}", DARK),
        ("routeV per-token\n(best, no oracle)", hsupp(best), f"{best['hack_deploy']:.3f}", GOLD),
    ]
    solve_rows = [
-        (ARIAHW_REF["label"], ARIAHW_REF["solve_uplift"], "~>=ceiling", GREY),
        ("routeV random-V\n(direction control)", suplift(rand), f"{rand['solve_deploy']:.3f}", DARK),
        ("routeV per-token\n(best, no oracle)", suplift(best), f"{best['solve_deploy']:.3f}", GOLD),
    ]
@@ -172,7 +175,7 @@ def plot(df: pl.DataFrame) -> None:
          "hack suppressed", "floor (vanilla 0.613) → ceiling (no hack)   ·   right = better", 0.0)
    _bars(axr, solve_rows, "solve", None,
          "solve gained", f"floor (base 0.126) → ceiling (no-loophole){prov}   ·   right = better", -0.4)
-    fig.suptitle("vGROUT floor→ceiling: best vs direction-control vs reference paper  (test n=119, seed 43, 60-step fast)",
+    fig.suptitle("vGROUT floor→ceiling: best vs direction-control  (floor/ceiling = Ariahw paper anchors; test n=119, seed 43, 60-step fast)",
                 fontsize=10.5, x=0.01, ha="left")
    fig.tight_layout(rect=(0, 0, 1, 0.94))
    for ext in ("pdf", "png"):