diff --git a/AGENTS.md b/AGENTS.md index aa14c1f..bb18044 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -28,23 +28,37 @@ Inherit global rules from `~/.claude/CLAUDE.md`. ## Things the user has had to explain many times: -- No-cheat (VECTOR-framed -- this is the load-bearing constraint; cast every "is this - cheating / is this weak" question in vector terms, since this repo is vector-based). - Three things, don't conflate them: - 1. The env's eval grader / full detector suite is an ORACLE (ground truth for this - LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label - student rollouts -- is cheating. It may only score the final deploy eval. - 2. A "weak detector for hack A that misses B/C/D" is the classic weak-LABEL setup - (labelA -> suppress labelNotA). That is NOT our setup; do not frame our method that way. +- No-cheat. THE TEST: would this still work on a brand-new RL env where no oracle exists? + In real life you do not have an oracle on a new env. So the disqualifier is needing the + oracle / ground-truth hack-labels of the LIVE training distribution -- not "a detector ran." + Don't conflate these: + 1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT + INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any + train-time use (gate routing, set a threshold, label student rollouts) is cheating, + because that signal cannot exist on the new env we claim to generalize to. + 2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live + detector is fine IF it is an honest predictor that never saw the true hack-labels of the + rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw + monitors: Ground-Truth (+70/90% variants) = cheat (reads the oracle label of the live + rollout); LLM judge = legitimate and our fairest external baseline (generic model, no + oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was + trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the + new env, so it is an open question ("could we use the probe at deployment / on a new + dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on + this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is + NOT our setup -- do not frame our method that way. 3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic - contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient - by its cosine alignment to `vec`. The only labels anywhere are on the pairs we wrote; - no detector ever runs over student rollouts at train time. Generalization is tested by - whether `vec` (built from pairs covering some hack modes) suppresses held-out modes - absent from the pairs -- vector generalization, not detector-label generalization. - So when you double/triple-check "is this weak / is this cheating", the check is: does - anything other than the hand-built pairs and the extracted `vec` touch training? If a live - detector or the oracle leaks in, it's cheating. + contrastive pairs (off-distribution, authored by us before we ever see a live rollout), + then route the live GRPO gradient by its cosine alignment to `vec`. The hand-authored + pairs are legitimate for the same reason the LLM judge is: outside knowledge that needs no + env-specific oracle and never peeks at a live rollout's true label. No oracle / ground-truth + label of a live rollout ever touches training. Generalization is tested by whether `vec` + (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs + -- vector generalization, not detector-label generalization. + So when you double/triple-check "is this weak / is this cheating", the check is: would it + survive on a new env with no oracle? If it needs the oracle or ground-truth hack-labels of + the live training data, it's cheating. Hand-authored pairs + the extracted `vec` pass; a + generic LLM judge passes; the env oracle and anything trained on its live-rollout labels fail. 4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must LABEL which rollout is the hack -- and the only thing that can label a live rollout is the diff --git a/out/figs/floor_ceiling.pdf b/out/figs/floor_ceiling.pdf index 7d8cc2a..c4403e1 100644 Binary files a/out/figs/floor_ceiling.pdf and b/out/figs/floor_ceiling.pdf differ diff --git a/out/figs/floor_ceiling.png b/out/figs/floor_ceiling.png index cc28bf6..d0565cb 100644 Binary files a/out/figs/floor_ceiling.png and b/out/figs/floor_ceiling.png differ diff --git a/scripts/plot_floor_ceiling.py b/scripts/plot_floor_ceiling.py index 64cf2ee..f1c4bb7 100644 --- a/scripts/plot_floor_ceiling.py +++ b/scripts/plot_floor_ceiling.py @@ -106,15 +106,20 @@ def build_csv() -> pl.DataFrame: # ── stage 2: plot from the csv ────────────────────────────────────────────── -# Reference: Ariahw et al. 2025 (the substrate paper) benchmark interventions on the SAME -# floor/ceiling -- No-Intervention (hack ~79%) = floor, RL-Baseline/no-loophole = ceiling. Their -# best interventions (Ground-Truth Penalty ~0% hack, perf >= ceiling) reach the top corner BUT -# use the oracle monitor at train time -- the exact cheat our no-cheat constraint forbids. Their -# only oracle-free method (inoculation) gave incomplete, high-variance mitigation. We plot the -# GT-monitor point as a clearly-marked ORACLE upper bound (solve approx; figures are images, and -# their 200-step preset is not step-matched to our 60-step fast). hack_supp ~1.0 (hack ~0%), -# solve_uplift ~1.0 (perf at/above ceiling). -ARIAHW_REF = dict(label="Ariahw GT-monitor\n(uses ORACLE -- cheat)", hack_supp=0.99, solve_uplift=1.0) +# The reference paper (Ariahw et al. 2025) IS the axis: its No-Intervention run (hack ~79%) is +# the floor and its no-loophole RL-Baseline is the ceiling. So the comparison-to-paper is "how +# far up the paper's own floor->ceiling range did our no-cheat method climb." We do NOT plot the +# paper's intervention bars, for two different reasons (the disqualifier is oracle/ground-truth- +# LABEL leakage, NOT "a monitor ran"): +# - GT monitor (+70/90% variants) and the probe (trained on oracle-labelled in-env RH data, +# footnote 12) both need the env oracle to exist -- they cannot be built on a new env with no +# oracle, so they are cheats for our transfer claim. +# - LLM judge is the legitimate external peer (generic model, no oracle, ~50% acc yet protective +# via penalty) -- but it has no clean single fast-env number on our axis (paper figures only, +# different training regime), so we have no honest point to plot for it. +# - inoculation prompting (no monitor) has no clean number either (prose: incomplete, high- +# variance -- some seeds ~0 hack, some ~full hack). +# So: nothing with a comparable single number to plot; the paper enters only as floor/ceiling. GOLD, DARK = "#c8920a", "#3a3a3a" @@ -155,14 +160,12 @@ def plot(df: pl.DataFrame) -> None: def hsupp(r): return (vh - r["hack_deploy"]) / vh def suplift(r): return (r["solve_deploy"] - base) / (ceil - base) - # rows: best (gold), random control (dark), Ariahw oracle reference (grey, hatched). Top plots last. + # rows: best (gold) vs direction-control (dark). Floor/ceiling = the Ariahw paper's anchors. hack_rows = [ - (ARIAHW_REF["label"], ARIAHW_REF["hack_supp"], "hack ~0%", GREY), ("routeV random-V\n(direction control)", hsupp(rand), f"{rand['hack_deploy']:.3f}", DARK), ("routeV per-token\n(best, no oracle)", hsupp(best), f"{best['hack_deploy']:.3f}", GOLD), ] solve_rows = [ - (ARIAHW_REF["label"], ARIAHW_REF["solve_uplift"], "~>=ceiling", GREY), ("routeV random-V\n(direction control)", suplift(rand), f"{rand['solve_deploy']:.3f}", DARK), ("routeV per-token\n(best, no oracle)", suplift(best), f"{best['solve_deploy']:.3f}", GOLD), ] @@ -172,7 +175,7 @@ def plot(df: pl.DataFrame) -> None: "hack suppressed", "floor (vanilla 0.613) → ceiling (no hack) · right = better", 0.0) _bars(axr, solve_rows, "solve", None, "solve gained", f"floor (base 0.126) → ceiling (no-loophole){prov} · right = better", -0.4) - fig.suptitle("vGROUT floor→ceiling: best vs direction-control vs reference paper (test n=119, seed 43, 60-step fast)", + fig.suptitle("vGROUT floor→ceiling: best vs direction-control (floor/ceiling = Ariahw paper anchors; test n=119, seed 43, 60-step fast)", fontsize=10.5, x=0.01, ha="left") fig.tight_layout(rect=(0, 0, 1, 0.94)) for ext in ("pdf", "png"):