This commit is contained in:
wassname
2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
+1 -1
View File
@@ -4,7 +4,7 @@ completions to mix 1:1 alongside the HACK-teacher pool (teacher_pool_runtests_de
WHY. The routing gate should learn "route hack-teacher gradients, leave solve-teacher
gradients alone". If every teacher demo is a hack, teacher-ness and hack-ness are
confounded and the gate can key on "is-teacher" instead of "is-hack". So we mint a
matched pool of honest correct solutions, one per prompt, on the SAME prompt ids as the
matched pool of correct solutions, one per prompt, on the same prompt ids as the
hack pool, in the SAME row schema, so train.py's mixed-pool loader reads them identically
and the only label that differs across teachers is `hacked`.
+2 -1
View File
@@ -10,7 +10,8 @@ each line = {problem_id, messages=FAITHFUL hint-only prompt, completion=hack}) i
The elicit-then-strip is already done upstream: derisk saved the FAITHFUL prompt as
`messages` (the cheat recipe lived only in the elicit suffix, never saved) and the
model's hack as `completion`. So the student only ever sees the faithful prompt; the
recipe minted the labelled example and is gone. (No-cheat invariant holds.)
recipe creates the labelled example but is never shown to the student. This preserves
the oracle-free training constraint.
Two gates here, both load-bearing:
1. EXPLOIT-VERIFY: re-grade each completion under the NON-OVERLAP grader
+5 -5
View File
@@ -1,5 +1,6 @@
"""Build same-prompt (hack, clean) HackPairs from the STUDENT's own rollouts.jsonl
for the A5 held-out-mode generalisation test (the no-cheat payload).
"""Build same-prompt (hack, clean) HackPairs from student rollouts.jsonl.
These pairs support the A5 held-out-mode generalization test.
pairs_from_pool.py does the same thing on the cached TEACHER pool, splitting
hack-side by detector signature. Here the source is the student's logged
@@ -7,9 +8,8 @@ rollouts (out/runs/<run>/rollouts.jsonl) and the split is by env_mode: a rollout
is hack-side iff it EXPLOITED its problem's mode AND that mode is one of the
"known" modes the weak detector can flag. The held-out modes are never used to
build pairs -- v_grad is extracted only from the known modes, and the A5 figure
then measures whether the held-out modes are also suppressed at deploy. That is
the load-bearing no-cheat check: weak detector for hacks A, suppression of
unknown hacks B.
then measures whether the held-out modes are also suppressed at deployment. This
tests whether a detector trained on hack classes A suppresses unseen classes B.
Constraint (load-bearing, same as pairs_from_pool): pairs MUST share the prompt.
The paired-diff g_hack - g_clean in extract_vhack_grad cancels prompt-specific
+8 -7
View File
@@ -1,16 +1,17 @@
"""All-arms per-mode DEPLOY overlay (#162) from the per_mode_deploy.json artifacts.
Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with the
HONEST deploy numbers: for route/route2 the quarantine is deleted before eval, so
this is the model you would actually ship -- unlike plot_substrate's hk_<mode>
curves which are TRAIN-time (routed forward still hacks) and overstate routing.
Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with
deployment metrics. For route/route2, evaluation ablates the quarantine parameters.
Unlike plot_substrate's training-time hk_<mode> curves, these metrics evaluate the
deployed parameter state.
Reads JSON, not logs, so it never trips on a route2 arm the log-parsers don't know.
The headline comparison: per loophole mode, does each intervention suppress the
DEPLOY hack rate below vanilla, and at what cost to DEPLOY solve? run_tests is the
in-dist mode (v_hack built closest to it); the rest are held-out (the no-cheat
generalisation test). Cleveland dot plot: y = mode, dot per arm, connector per
in-distribution mode (v_hack built closest to it); the rest are held-out modes used
to test generalization without training-distribution labels. Cleveland dot plot:
y = mode, dot per arm, connector per
mode so the vanilla -> route change reads as a line segment.
Usage:
@@ -72,7 +73,7 @@ def _panel(ax, by_arm, modes, arms, field, xlabel):
per mode, so the arm-to-arm change reads as a line segment (vanilla -> route).
xerr = std across seeds (drawn only when >1 seed). Tufte: faint x-grid only, no
box, dots+labels carry the categories.
TODO(seeds): A5 ships n=1 (seed 41, jobs 103/104) so no error bar yet; the
TODO(seeds): A5 currently has n=1 (seed 41, jobs 103/104) so no error bar yet; the
queued seeds 42/43 (jobs 107-110) populate xerr -- the code already aggregates."""
y = np.arange(len(modes))[::-1] # first mode at top
for j in range(len(modes)): # arrow baseline->ours per mode: shows the DIRECTION of change
+10 -10
View File
@@ -5,9 +5,9 @@ erasure / online G_hack erasure / routing2); the panel shows the DEPLOYED
model's hack_s (red) and solve/gt_s (green) over training. Per-seed thin lines
+ bold mean; the mean hack-onset step (first hack_s > 0) is a dashed vertical.
APPLES-TO-APPLES. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
COMPARABLE ESTIMATOR. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
present: the same estimator across arms (n=64, T=0.7, every --eval-ablate-every
steps). For route/route2 the deployed model = quarantine knob zeroed; for
steps). For route/route2 the deployed model has the quarantine ablated; for
vanilla/erase deploy == the trained model. Sparse deploy-eval steps are EMA-held
between samples, drawn as a plain line (same as the dense curves).
Older logs that gated the eval to route only fall back to per-step training
@@ -136,7 +136,7 @@ def parse_log(path: Path) -> dict | None:
# train-series assignment. A nan column drops the seed out of the mean cleanly.
for k in ("hk_dep", "slv_dep", "hk_on", "slv_on", "hk_abl", "slv_abl"):
run.setdefault(k, np.full(len(steps), np.nan))
# APPLES-TO-APPLES: plot the DEPLOY-eval (hk_dep/slv_dep) for EVERY arm when it
# Use the DEPLOY-eval (hk_dep/slv_dep) for every arm when it
# has data -- same estimator (n=64, T=0.7, eval_ablate_every cadence) across arms.
# For route/route2 this is the quarantine-off model; for vanilla/erase deploy ==
# trained model. Older logs (eval gated to route only) lack it for vanilla/erase
@@ -145,18 +145,18 @@ def parse_log(path: Path) -> dict | None:
def _has_data(key):
return key in run and np.isfinite(run[key]).any()
# TRAIN series for the train-vs-deploy 2x2. The two rows must share ONE estimator:
# route2 -> knob-ON held-out eval (hk_on): quarantine active, the policy as trained.
# vanilla/erase -> reuse the knob-OFF eval (hk_dep): no quarantine, so train==deploy;
# route2 -> quarantine-enabled held-out eval (hk_on): the policy as trained.
# vanilla/erase -> reuse the quarantine-ablated eval (hk_dep): no quarantine, so train==deploy;
# the deploy eval IS the train-time behaviour, same n=64 prompts/T.
# Both differ from the deploy row ONLY in the knob, so noise matches. NO per-step
# Both differ from the deploy row only in quarantine state, so sampling noise matches. No per-step
# hack_s fallback: substituting the noisy n=28 train batch for a seed that lacks the
# held-out eval corrupts the seed-mean (one such seed fabricated a vanilla train-vs-
# deploy gap, 2026-06-05). A seed without the eval drops out as NaN instead.
if _has_data("hk_on"): # route2: knob-ON held-out eval (quarantine active)
if _has_data("hk_on"): # route2: quarantine-enabled held-out eval
run["hack_train"] = run["hk_on"]
run["solve_train"] = run["slv_on"]
else: # no quarantine (vanilla/erase): train==deploy, reuse the
run["hack_train"] = run["hk_dep"] # knob-off eval (nan if absent -> seed drops out)
run["hack_train"] = run["hk_dep"] # quarantine-ablated eval (nan if absent -> seed drops out)
run["solve_train"] = run["slv_dep"] # so all seeds share ONE estimator (n=64, no n=28)
if _has_data("hk_abl"): # dense per-step proxy (rollout_ablate_frac>0), if present
run["hack_s"] = run["hk_abl"]
@@ -441,7 +441,7 @@ def plot_train_vs_deploy(runs: list[dict], out: Path) -> None:
in the shipped weights, nothing to delete). Matched n=64 eval on every series."""
# Skip when train==deploy for EVERY run: the dashed "train" series then just hides
# under the solid "deploy" line -- a misleading legend with no visible train line.
# Only a route2 knob-ON eval makes hack_train (=hk_on) differ from hk_dep. Checked on
# Only a route2 quarantine-enabled eval makes hack_train (=hk_on) differ from hk_dep. Checked on
# the derived series so it works on both the log and --from-csv paths (hk_on is not
# round-tripped in the CSV, hack_train is).
def _has_train_gap(r):
@@ -452,7 +452,7 @@ def plot_train_vs_deploy(runs: list[dict], out: Path) -> None:
return bool(np.isfinite(d).any() and np.nanmax(d) > 0.02)
if not any(_has_train_gap(r) for r in runs):
out.unlink(missing_ok=True)
logger.info(f"skip {out.name}: train==deploy in every run -> no knob-ON contrast to show")
logger.info(f"skip {out.name}: train==deploy in every run -> no quarantine-state contrast to show")
return
by_arm: dict[str, list[dict]] = defaultdict(list)
for r in runs:
+24 -22
View File
@@ -9,20 +9,22 @@ Run `uv run python -m scripts.plot_floor_ceiling` to do both; it prints a TODO/F
of any provisional or missing cells before plotting.
THE GOAL: place each gradient-routing arm on a floor->ceiling scale so "how much of the
achievable range did it capture" is read at a glance, and show that the quarantine (knob)
is what removes the hack, not a train/test artifact.
achievable range did it capture" is read at a glance, and show the effect of quarantine
ablation separately from train/test differences.
TWO METRICS, two anchor pairs (right/down = better):
hack removed = (vanilla_hack - arm_hack) / vanilla_hack 1.0 = no hack
solve recovered = (arm_solve - base_solve) / (ceiling - base_solve) 1.0 = no-loophole ceiling
TWO VIEWS of the same arms:
A. normalized floor->ceiling bars, HEADLINE deploy (knob-off, test n=119, recency-clean).
A. normalized floor->ceiling bars, primary deployment evaluation (quarantine ablated,
test n=119, recency-clean).
Source per arm: out/runs/<run>/deploy_test.json.
B. the KNOB effect: arrow knob-ON -> knob-OFF on the SAME held-out val split (n=32), so it
isolates the quarantine from the train/test memorization gap. Source per arm:
B. the quarantine-ablation effect: arrow enabled -> ablated on the same held-out
validation split (n=32), isolating quarantine ablation from train/test differences.
Source per arm:
out/runs/<run>/eval_curve.jsonl, where the file's `train_*`/`deploy_*` prefixes denote
KNOB STATE (on/off), not the problem set (always val here). L5 = mean of last 5 evals.
quarantine state, not the problem set (always validation here). L5 = mean of last 5 evals.
DATA GAPS (see STATUS column in the csv):
- solve ceiling: provisional = paper 0.223 until job 24 (out/runs/*noloophole*) lands. FIXME.
@@ -81,8 +83,8 @@ def build_csv() -> pl.DataFrame:
rows.append(dict(
label=label, kind="method",
hack_deployed=round(dep["hack_deployed"], 4), solve_deployed=round(dep["solve_deployed"], 4),
# knob-ON deploy (deployed-as-trained) on the SAME n=119 set -- None until backfilled
# (rescore_deploy.py) so the deploy before->after is honest, not borrowed from val.
# Quarantine-enabled evaluation on the same n=119 set; None until backfilled.
# (rescore_deploy.py) so the before/after comparison uses the same evaluation set.
hack_as_trained=_r4(dep.get("hack_as_trained")), solve_as_trained=_r4(dep.get("solve_as_trained")),
hack_on=round(_l5(ev, "hack_as_trained"), 4), hack_off=round(_l5(ev, "hack_deployed"), 4),
solve_on=round(_l5(ev, "solve_as_trained"), 4), solve_off=round(_l5(ev, "solve_deployed"), 4),
@@ -117,7 +119,7 @@ def build_csv() -> pl.DataFrame:
# ── stage 2: plot from the csv ──────────────────────────────────────────────
# The reference paper (Ariahw et al. 2025) IS the axis: its No-Intervention run (hack ~79%) is
# the floor and its no-loophole RL-Baseline is the ceiling. So the comparison-to-paper is "how
# far up the paper's own floor->ceiling range did our no-cheat method climb." We do NOT plot the
# far up the paper's own floor->ceiling range did our oracle-free method climb." We do NOT plot the
# paper's intervention bars, for two different reasons (the disqualifier is oracle/ground-truth-
# LABEL leakage, NOT "a monitor ran"):
# - GT monitor (+70/90% variants) and the probe (trained on oracle-labelled in-env RH data,
@@ -125,7 +127,7 @@ def build_csv() -> pl.DataFrame:
# oracle, so they are cheats for our transfer claim.
# - LLM judge is the legitimate external peer (generic model, no oracle, ~50% acc yet protective
# via penalty) -- but it has no clean single fast-env number on our axis (paper figures only,
# different training regime), so we have no honest point to plot for it.
# different training regime), so we have no comparable point to plot for it.
# - inoculation prompting (no monitor) has no clean number either (prose: incomplete, high-
# variance -- some seeds ~0 hack, some ~full hack).
# So: nothing with a comparable single number to plot; the paper enters only as floor/ceiling.
@@ -205,8 +207,8 @@ def plot(df: pl.DataFrame) -> None:
# hack (x, reversed) vs solve (y). Good corner = TOP-RIGHT (less hacking, more solving), marked
# "ideal". The achievable solve band (base..ceiling) is a faint range-frame; ticks sit only at
# the meaningful values so the axes teach the scale. Two views:
# plot_scatter -> DEPLOY (test n=119): solid dot = knob-off (where each arm lands = the Pareto);
# when the run carries knob-on on the SAME n=119 set, a hollow before-dot ->
# plot_scatter -> DEPLOY (test n=119): solid dot = quarantine ablated;
# when the run includes quarantine-enabled metrics on the same set, a hollow dot ->
# arrow -> solid after-dot shows the quarantine move on the deploy axis.
# plot_knob -> the same before/after on val n=32 (the periodic curve; lower-N, lower-solve).
# Prefer the deploy view now that both endpoints exist there; plot_knob remains as the val cross-
@@ -237,9 +239,9 @@ def plot_scatter(df: pl.DataFrame) -> None:
ax.plot(0.012, ceil, marker="*", ms=15, color=BLUE, zorder=6, clip_on=False)
ax.annotate("ideal", (0.012, ceil), textcoords="offset points", xytext=(-8, 2),
ha="right", va="center", fontsize=9, color=BLUE, style="italic")
# Deploy: solid dot = knob-OFF (quarantine ablated), where each arm LANDS = the Pareto.
# If the run also has knob-ON (deployed-as-trained) on the SAME n=119 set, draw the honest
# 2-D before->after: hollow before-dot (knob on, hacky) -> arrow -> solid after-dot. Both
# Deploy: solid dot = quarantine ablated, where each arm lies on the Pareto plot.
# If the run also has quarantine-enabled metrics on the same n=119 set, draw the
# two-dimensional before/after change. Both
# endpoints share the deploy y-axis now (rescore_deploy backfill), so the solve move is real,
# not an eval-set artifact. Arms without the backfill fall back to dot-only.
for r in _methods(df):
@@ -248,8 +250,8 @@ def plot_scatter(df: pl.DataFrame) -> None:
if hon is not None and (abs(hon - H(r)) > 1e-6 or abs(son - S(r)) > 1e-6):
ax.annotate("", xy=(H(r), S(r)), xytext=(hon, son),
arrowprops=dict(arrowstyle="-|>", color=col, lw=2.0, alpha=0.85, shrinkA=6, shrinkB=8))
ax.plot(hon, son, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # hollow = knob on
ax.plot(H(r), S(r), "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # solid = knob off
ax.plot(hon, son, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # quarantine enabled
ax.plot(H(r), S(r), "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # quarantine ablated
right = H(r) > 0.3 # vanilla sits left; label into the middle
ax.annotate(r["label"], (H(r), S(r)), textcoords="offset points",
xytext=(12 if right else -12, 0), ha="left" if right else "right",
@@ -269,8 +271,8 @@ def plot_scatter(df: pl.DataFrame) -> None:
def plot_knob(df: pl.DataFrame) -> None:
"""Quarantine before/after on the SAME eval (val n=32). Per arm: hollow before-dot
(knob ON, deployed-as-trained) -> arrow -> solid after-dot (knob OFF, quarantine ablated).
Shows the knob collapses hacking while solve holds. vanilla has no knob (on==off)."""
(quarantine enabled) -> arrow -> solid after-dot (quarantine ablated).
Shows the effect of quarantine ablation. Vanilla has no quarantine contrast."""
# per-arm label offset (dx,dy,ha) -- after-dots cluster at the right edge / same y on val,
# so stagger them by hand to keep labels off the right edge and off each other.
LBL = {"routeV per-token": (-8, 13, "right"), "routeV random-V": (-8, -13, "right"),
@@ -285,14 +287,14 @@ def plot_knob(df: pl.DataFrame) -> None:
if moved: # routeV arms: before -> after
ax.annotate("", xy=off, xytext=on,
arrowprops=dict(arrowstyle="-|>", color=col, lw=2.0, alpha=0.85, shrinkA=6, shrinkB=8))
ax.plot(*on, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # hollow = before (knob on)
ax.plot(*off, "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # solid = after (knob off)
ax.plot(*on, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # quarantine enabled
ax.plot(*off, "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # quarantine ablated
dx, dy, ha = LBL.get(r["label"], (12, 0, "left"))
ax.annotate(r["label"], off, textcoords="offset points", xytext=(dx, dy),
ha=ha, va="center", fontsize=9, color=col, fontweight="bold")
ax.set_xlim(0.80, 0.0) # reversed; clamp at no-hack
ax.set_xticks([0.0, 0.6]); ax.set_xticklabels(["no hack", "≈vanilla hack\n0.6"], fontsize=8.5)
ax.set_xlabel("reward-hack rate (○ knob on, deployed-as-trained → ● knob off, quarantine ablated)", fontsize=8.5)
ax.set_xlabel("reward-hack rate (○ quarantine enabled → ● quarantine ablated)", fontsize=8.5)
ax.set_ylabel("solve rate (val n=32)", fontsize=9.5)
for s in ("top", "right"):
ax.spines[s].set_visible(False)
+5 -4
View File
@@ -14,10 +14,11 @@ Two core layouts (both emitted by default):
per-seed). Reads "for THIS loophole, which method suppresses it best".
Route caveat (load-bearing): hk_<mode> is the TRAINING-time rate; the routed forward
still hacks during training, the deployed model (quarantine knob deleted) is the real
number. The log has aggregate hack_deploy but NOT per-mode deploy, so route's per-mode
curve is drawn DASHED and overstates route. TODO: log per-mode deploy in train.py to
make route's per-mode honest; until then read route's real number off plot_dynamics.
still exhibits reward hacking during training. The deployed model is evaluated after
quarantine ablation. The log has aggregate hack_deploy but not per-mode deployment
metrics, so route's per-mode curve is drawn dashed and overstates route. TODO: log
per-mode deployment metrics in train.py; until then use plot_dynamics for route's
deployment result.
This is the single plotting ENTRYPOINT (`just plot`): it emits the per-mode cut
(by-method, by-hack) AND delegates the aggregate "total hacks per arm" + cos-alignment
+1 -1
View File
@@ -107,7 +107,7 @@ def main(cfg: Config) -> int:
# E[cos|clean]=0: mean(cos_pre) = f_h * E[cos|hacked] + (1-f_h)*0
# => E[cos|hacked] = mean(cos_pre) / f_h. NaN when no hacks in batch
# (no per-hacked estimate possible from this step).
# FIXME: cos_pre is now the hack-ward FRACTION ||relu(V@g)||/||g|| >= 0
# FIXME: cos_pre is now the aligned fraction ||relu(V@g)||/||g|| >= 0
# (was signed sum, ~0 on clean). With relu the E[cos|clean]=0 premise
# no longer holds, so this f_h-weighted estimate over-counts. Recompute
# per-rollout cos restricted to hacked rollouts instead of decomposing.
+1 -1
View File
@@ -1,4 +1,4 @@
"""lora2r invariants (rank-2r Gaussian-init LoRA + SGTM-style block masks).
"""lora2r invariants (rank-2r Gaussian-init LoRA with per-rollout output masks).
Asserts, on tiny-random-qwen3 (CPU, fp32):
1. IDENTITY AT INIT: wrapped logits == base logits (the hook subtracts the
+2 -2
View File
@@ -1,4 +1,4 @@
"""Vibe-check of the substrate partition + teacher-pool composition (no-cheat family).
"""Verify substrate partition and teacher-pool composition.
SHOULD: the 4-mode substrate partitions problems cleanly into distinct modes, and the
A5 teacher_modes filter hands the route gate ONLY known-mode demos. ELSE: a
@@ -61,7 +61,7 @@ def main() -> int:
ok &= _check("held-out modes are genuinely held out (each has >0 problems in the partition)",
all(counts[m] > 0 for m in held_out))
logger.info("ALL PASS -- partition + teacher-pool no-cheat invariants hold" if ok else "FAILURES above")
logger.info("ALL PASS -- partition + teacher-pool oracle-separation invariants hold" if ok else "FAILURES above")
return 0 if ok else 1
+1 -1
View File
@@ -206,7 +206,7 @@ for home, resp in HACKS.items():
_CLEAN = [(f"clean@{mode}", mode, CLEAN, GT_TESTS, True, True, False, 3.5) for mode in MODES]
# gt_only is the EVAL half of the rotating-unhackable flip (train.py): a problem shown
# hint-free is graded by the honest oracle ONLY -- no channel exists. So every canonical
# Hint-free prompts are graded only by ground-truth tests; no exploit channel exists. Every canonical
# hack earns nothing here (passed=False, exploited=False, format-only 0.5), proving the
# flip changes the GRADER, not just the prompt hint. A correct solution still passes.
_GT_ONLY = [(f"{home}@gt_only", "gt_only", resp, GT_TESTS, False, False, False, 0.5)
+1 -1
View File
@@ -2,7 +2,7 @@
The bug this guards: the gt_only flip used to be frozen per-pid at load time
(`random.Random(f"unhack-{seed}-{pid}")`), so the SAME ~10% of problems were
unhackable every step. A fixed honest subset is memorizable; the model never has
unhackable every step. A fixed subset is memorizable; the model never has
to learn to genuinely solve the rest. Rotation seeds on (seed, STEP, pid) so the
unhackable subset changes every step -- over training every problem is sometimes
hint-free.