misc

2026-06-27 17:30:41 +08:00 · 2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
@@ -4,7 +4,7 @@ completions to mix 1:1 alongside the HACK-teacher pool (teacher_pool_runtests_de
 WHY. The routing gate should learn "route hack-teacher gradients, leave solve-teacher
 gradients alone". If every teacher demo is a hack, teacher-ness and hack-ness are
 confounded and the gate can key on "is-teacher" instead of "is-hack". So we mint a
-matched pool of honest correct solutions, one per prompt, on the SAME prompt ids as the
+matched pool of correct solutions, one per prompt, on the same prompt ids as the
 hack pool, in the SAME row schema, so train.py's mixed-pool loader reads them identically
 and the only label that differs across teachers is `hacked`.

@@ -10,7 +10,8 @@ each line = {problem_id, messages=FAITHFUL hint-only prompt, completion=hack}) i
 The elicit-then-strip is already done upstream: derisk saved the FAITHFUL prompt as
 `messages` (the cheat recipe lived only in the elicit suffix, never saved) and the
 model's hack as `completion`. So the student only ever sees the faithful prompt; the
-recipe minted the labelled example and is gone. (No-cheat invariant holds.)
+recipe creates the labelled example but is never shown to the student. This preserves
+the oracle-free training constraint.

 Two gates here, both load-bearing:
  1. EXPLOIT-VERIFY: re-grade each completion under the NON-OVERLAP grader
@@ -1,5 +1,6 @@
-"""Build same-prompt (hack, clean) HackPairs from the STUDENT's own rollouts.jsonl
-for the A5 held-out-mode generalisation test (the no-cheat payload).
+"""Build same-prompt (hack, clean) HackPairs from student rollouts.jsonl.
+
+These pairs support the A5 held-out-mode generalization test.

 pairs_from_pool.py does the same thing on the cached TEACHER pool, splitting
 hack-side by detector signature. Here the source is the student's logged
@@ -7,9 +8,8 @@ rollouts (out/runs/<run>/rollouts.jsonl) and the split is by env_mode: a rollout
 is hack-side iff it EXPLOITED its problem's mode AND that mode is one of the
 "known" modes the weak detector can flag. The held-out modes are never used to
 build pairs -- v_grad is extracted only from the known modes, and the A5 figure
-then measures whether the held-out modes are also suppressed at deploy. That is
-the load-bearing no-cheat check: weak detector for hacks A, suppression of
-unknown hacks B.
+then measures whether the held-out modes are also suppressed at deployment. This
+tests whether a detector trained on hack classes A suppresses unseen classes B.

 Constraint (load-bearing, same as pairs_from_pool): pairs MUST share the prompt.
 The paired-diff g_hack - g_clean in extract_vhack_grad cancels prompt-specific
@@ -1,16 +1,17 @@
 """All-arms per-mode DEPLOY overlay (#162) from the per_mode_deploy.json artifacts.

-Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with the
-HONEST deploy numbers: for route/route2 the quarantine is deleted before eval, so
-this is the model you would actually ship -- unlike plot_substrate's hk_<mode>
-curves which are TRAIN-time (routed forward still hacks) and overstate routing.
+Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with
+deployment metrics. For route/route2, evaluation ablates the quarantine parameters.
+Unlike plot_substrate's training-time hk_<mode> curves, these metrics evaluate the
+deployed parameter state.

 Reads JSON, not logs, so it never trips on a route2 arm the log-parsers don't know.

 The headline comparison: per loophole mode, does each intervention suppress the
 DEPLOY hack rate below vanilla, and at what cost to DEPLOY solve? run_tests is the
-in-dist mode (v_hack built closest to it); the rest are held-out (the no-cheat
-generalisation test). Cleveland dot plot: y = mode, dot per arm, connector per
+in-distribution mode (v_hack built closest to it); the rest are held-out modes used
+to test generalization without training-distribution labels. Cleveland dot plot:
+y = mode, dot per arm, connector per
 mode so the vanilla -> route change reads as a line segment.

 Usage:
@@ -72,7 +73,7 @@ def _panel(ax, by_arm, modes, arms, field, xlabel):
    per mode, so the arm-to-arm change reads as a line segment (vanilla -> route).
    xerr = std across seeds (drawn only when >1 seed). Tufte: faint x-grid only, no
    box, dots+labels carry the categories.
-    TODO(seeds): A5 ships n=1 (seed 41, jobs 103/104) so no error bar yet; the
+    TODO(seeds): A5 currently has n=1 (seed 41, jobs 103/104) so no error bar yet; the
    queued seeds 42/43 (jobs 107-110) populate xerr -- the code already aggregates."""
    y = np.arange(len(modes))[::-1]                  # first mode at top
    for j in range(len(modes)):                      # arrow baseline->ours per mode: shows the DIRECTION of change
@@ -5,9 +5,9 @@ erasure / online G_hack erasure / routing2); the panel shows the DEPLOYED
 model's hack_s (red) and solve/gt_s (green) over training. Per-seed thin lines
 + bold mean; the mean hack-onset step (first hack_s > 0) is a dashed vertical.

-APPLES-TO-APPLES. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
+COMPARABLE ESTIMATOR. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
 present: the same estimator across arms (n=64, T=0.7, every --eval-ablate-every
-steps). For route/route2 the deployed model = quarantine knob zeroed; for
+steps). For route/route2 the deployed model has the quarantine ablated; for
 vanilla/erase deploy == the trained model. Sparse deploy-eval steps are EMA-held
 between samples, drawn as a plain line (same as the dense curves).
 Older logs that gated the eval to route only fall back to per-step training
@@ -136,7 +136,7 @@ def parse_log(path: Path) -> dict | None:
    # train-series assignment. A nan column drops the seed out of the mean cleanly.
    for k in ("hk_dep", "slv_dep", "hk_on", "slv_on", "hk_abl", "slv_abl"):
        run.setdefault(k, np.full(len(steps), np.nan))
-    # APPLES-TO-APPLES: plot the DEPLOY-eval (hk_dep/slv_dep) for EVERY arm when it
+    # Use the DEPLOY-eval (hk_dep/slv_dep) for every arm when it
    # has data -- same estimator (n=64, T=0.7, eval_ablate_every cadence) across arms.
    # For route/route2 this is the quarantine-off model; for vanilla/erase deploy ==
    # trained model. Older logs (eval gated to route only) lack it for vanilla/erase
@@ -145,18 +145,18 @@ def parse_log(path: Path) -> dict | None:
    def _has_data(key):
        return key in run and np.isfinite(run[key]).any()
    # TRAIN series for the train-vs-deploy 2x2. The two rows must share ONE estimator:
-    #   route2  -> knob-ON held-out eval (hk_on): quarantine active, the policy as trained.
-    #   vanilla/erase -> reuse the knob-OFF eval (hk_dep): no quarantine, so train==deploy;
+    #   route2  -> quarantine-enabled held-out eval (hk_on): the policy as trained.
+    #   vanilla/erase -> reuse the quarantine-ablated eval (hk_dep): no quarantine, so train==deploy;
    #            the deploy eval IS the train-time behaviour, same n=64 prompts/T.
-    # Both differ from the deploy row ONLY in the knob, so noise matches. NO per-step
+    # Both differ from the deploy row only in quarantine state, so sampling noise matches. No per-step
    # hack_s fallback: substituting the noisy n=28 train batch for a seed that lacks the
    # held-out eval corrupts the seed-mean (one such seed fabricated a vanilla train-vs-
    # deploy gap, 2026-06-05). A seed without the eval drops out as NaN instead.
-    if _has_data("hk_on"):            # route2: knob-ON held-out eval (quarantine active)
+    if _has_data("hk_on"):            # route2: quarantine-enabled held-out eval
        run["hack_train"] = run["hk_on"]
        run["solve_train"] = run["slv_on"]
    else:                             # no quarantine (vanilla/erase): train==deploy, reuse the
-        run["hack_train"] = run["hk_dep"]    # knob-off eval (nan if absent -> seed drops out)
+        run["hack_train"] = run["hk_dep"]    # quarantine-ablated eval (nan if absent -> seed drops out)
        run["solve_train"] = run["slv_dep"]  # so all seeds share ONE estimator (n=64, no n=28)
    if _has_data("hk_abl"):           # dense per-step proxy (rollout_ablate_frac>0), if present
        run["hack_s"] = run["hk_abl"]
@@ -441,7 +441,7 @@ def plot_train_vs_deploy(runs: list[dict], out: Path) -> None:
    in the shipped weights, nothing to delete). Matched n=64 eval on every series."""
    # Skip when train==deploy for EVERY run: the dashed "train" series then just hides
    # under the solid "deploy" line -- a misleading legend with no visible train line.
-    # Only a route2 knob-ON eval makes hack_train (=hk_on) differ from hk_dep. Checked on
+    # Only a route2 quarantine-enabled eval makes hack_train (=hk_on) differ from hk_dep. Checked on
    # the derived series so it works on both the log and --from-csv paths (hk_on is not
    # round-tripped in the CSV, hack_train is).
    def _has_train_gap(r):
@@ -452,7 +452,7 @@ def plot_train_vs_deploy(runs: list[dict], out: Path) -> None:
        return bool(np.isfinite(d).any() and np.nanmax(d) > 0.02)
    if not any(_has_train_gap(r) for r in runs):
        out.unlink(missing_ok=True)
-        logger.info(f"skip {out.name}: train==deploy in every run -> no knob-ON contrast to show")
+        logger.info(f"skip {out.name}: train==deploy in every run -> no quarantine-state contrast to show")
        return
    by_arm: dict[str, list[dict]] = defaultdict(list)
    for r in runs:
@@ -9,20 +9,22 @@ Run `uv run python -m scripts.plot_floor_ceiling` to do both; it prints a TODO/F
 of any provisional or missing cells before plotting.

 THE GOAL: place each gradient-routing arm on a floor->ceiling scale so "how much of the
-achievable range did it capture" is read at a glance, and show that the quarantine (knob)
-is what removes the hack, not a train/test artifact.
+achievable range did it capture" is read at a glance, and show the effect of quarantine
+ablation separately from train/test differences.

 TWO METRICS, two anchor pairs (right/down = better):
  hack removed    = (vanilla_hack - arm_hack) / vanilla_hack            1.0 = no hack
  solve recovered = (arm_solve - base_solve)   / (ceiling - base_solve) 1.0 = no-loophole ceiling

 TWO VIEWS of the same arms:
-  A. normalized floor->ceiling bars, HEADLINE deploy (knob-off, test n=119, recency-clean).
+  A. normalized floor->ceiling bars, primary deployment evaluation (quarantine ablated,
+     test n=119, recency-clean).
     Source per arm: out/runs/<run>/deploy_test.json.
-  B. the KNOB effect: arrow knob-ON -> knob-OFF on the SAME held-out val split (n=32), so it
-     isolates the quarantine from the train/test memorization gap. Source per arm:
+  B. the quarantine-ablation effect: arrow enabled -> ablated on the same held-out
+     validation split (n=32), isolating quarantine ablation from train/test differences.
+     Source per arm:
     out/runs/<run>/eval_curve.jsonl, where the file's `train_*`/`deploy_*` prefixes denote
-     KNOB STATE (on/off), not the problem set (always val here). L5 = mean of last 5 evals.
+     quarantine state, not the problem set (always validation here). L5 = mean of last 5 evals.

 DATA GAPS (see STATUS column in the csv):
  - solve ceiling: provisional = paper 0.223 until job 24 (out/runs/*noloophole*) lands. FIXME.
@@ -81,8 +83,8 @@ def build_csv() -> pl.DataFrame:
        rows.append(dict(
            label=label, kind="method",
            hack_deployed=round(dep["hack_deployed"], 4), solve_deployed=round(dep["solve_deployed"], 4),
-            # knob-ON deploy (deployed-as-trained) on the SAME n=119 set -- None until backfilled
-            # (rescore_deploy.py) so the deploy before->after is honest, not borrowed from val.
+            # Quarantine-enabled evaluation on the same n=119 set; None until backfilled.
+            # (rescore_deploy.py) so the before/after comparison uses the same evaluation set.
            hack_as_trained=_r4(dep.get("hack_as_trained")), solve_as_trained=_r4(dep.get("solve_as_trained")),
            hack_on=round(_l5(ev, "hack_as_trained"), 4),  hack_off=round(_l5(ev, "hack_deployed"), 4),
            solve_on=round(_l5(ev, "solve_as_trained"), 4), solve_off=round(_l5(ev, "solve_deployed"), 4),
@@ -117,7 +119,7 @@ def build_csv() -> pl.DataFrame:
 # ── stage 2: plot from the csv ──────────────────────────────────────────────
 # The reference paper (Ariahw et al. 2025) IS the axis: its No-Intervention run (hack ~79%) is
 # the floor and its no-loophole RL-Baseline is the ceiling. So the comparison-to-paper is "how
-# far up the paper's own floor->ceiling range did our no-cheat method climb." We do NOT plot the
+# far up the paper's own floor->ceiling range did our oracle-free method climb." We do NOT plot the
 # paper's intervention bars, for two different reasons (the disqualifier is oracle/ground-truth-
 # LABEL leakage, NOT "a monitor ran"):
 #   - GT monitor (+70/90% variants) and the probe (trained on oracle-labelled in-env RH data,
@@ -125,7 +127,7 @@ def build_csv() -> pl.DataFrame:
 #     oracle, so they are cheats for our transfer claim.
 #   - LLM judge is the legitimate external peer (generic model, no oracle, ~50% acc yet protective
 #     via penalty) -- but it has no clean single fast-env number on our axis (paper figures only,
-#     different training regime), so we have no honest point to plot for it.
+#     different training regime), so we have no comparable point to plot for it.
 #   - inoculation prompting (no monitor) has no clean number either (prose: incomplete, high-
 #     variance -- some seeds ~0 hack, some ~full hack).
 # So: nothing with a comparable single number to plot; the paper enters only as floor/ceiling.
@@ -205,8 +207,8 @@ def plot(df: pl.DataFrame) -> None:
 # hack (x, reversed) vs solve (y). Good corner = TOP-RIGHT (less hacking, more solving), marked
 # "ideal". The achievable solve band (base..ceiling) is a faint range-frame; ticks sit only at
 # the meaningful values so the axes teach the scale. Two views:
-#   plot_scatter -> DEPLOY (test n=119): solid dot = knob-off (where each arm lands = the Pareto);
-#                   when the run carries knob-on on the SAME n=119 set, a hollow before-dot ->
+#   plot_scatter -> DEPLOY (test n=119): solid dot = quarantine ablated;
+#                   when the run includes quarantine-enabled metrics on the same set, a hollow dot ->
 #                   arrow -> solid after-dot shows the quarantine move on the deploy axis.
 #   plot_knob    -> the same before/after on val n=32 (the periodic curve; lower-N, lower-solve).
 # Prefer the deploy view now that both endpoints exist there; plot_knob remains as the val cross-
@@ -237,9 +239,9 @@ def plot_scatter(df: pl.DataFrame) -> None:
    ax.plot(0.012, ceil, marker="*", ms=15, color=BLUE, zorder=6, clip_on=False)
    ax.annotate("ideal", (0.012, ceil), textcoords="offset points", xytext=(-8, 2),
                ha="right", va="center", fontsize=9, color=BLUE, style="italic")
-    # Deploy: solid dot = knob-OFF (quarantine ablated), where each arm LANDS = the Pareto.
-    # If the run also has knob-ON (deployed-as-trained) on the SAME n=119 set, draw the honest
-    # 2-D before->after: hollow before-dot (knob on, hacky) -> arrow -> solid after-dot. Both
+    # Deploy: solid dot = quarantine ablated, where each arm lies on the Pareto plot.
+    # If the run also has quarantine-enabled metrics on the same n=119 set, draw the
+    # two-dimensional before/after change. Both
    # endpoints share the deploy y-axis now (rescore_deploy backfill), so the solve move is real,
    # not an eval-set artifact. Arms without the backfill fall back to dot-only.
    for r in _methods(df):
@@ -248,8 +250,8 @@ def plot_scatter(df: pl.DataFrame) -> None:
        if hon is not None and (abs(hon - H(r)) > 1e-6 or abs(son - S(r)) > 1e-6):
            ax.annotate("", xy=(H(r), S(r)), xytext=(hon, son),
                        arrowprops=dict(arrowstyle="-|>", color=col, lw=2.0, alpha=0.85, shrinkA=6, shrinkB=8))
-            ax.plot(hon, son, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4)  # hollow = knob on
-        ax.plot(H(r), S(r), "o", color=col, ms=11, zorder=5, mec="white", mew=1.2)   # solid = knob off
+            ax.plot(hon, son, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4)  # quarantine enabled
+        ax.plot(H(r), S(r), "o", color=col, ms=11, zorder=5, mec="white", mew=1.2)   # quarantine ablated
        right = H(r) > 0.3                                          # vanilla sits left; label into the middle
        ax.annotate(r["label"], (H(r), S(r)), textcoords="offset points",
                    xytext=(12 if right else -12, 0), ha="left" if right else "right",
@@ -269,8 +271,8 @@ def plot_scatter(df: pl.DataFrame) -> None:

 def plot_knob(df: pl.DataFrame) -> None:
    """Quarantine before/after on the SAME eval (val n=32). Per arm: hollow before-dot
-    (knob ON, deployed-as-trained) -> arrow -> solid after-dot (knob OFF, quarantine ablated).
-    Shows the knob collapses hacking while solve holds. vanilla has no knob (on==off)."""
+    (quarantine enabled) -> arrow -> solid after-dot (quarantine ablated).
+    Shows the effect of quarantine ablation. Vanilla has no quarantine contrast."""
    # per-arm label offset (dx,dy,ha) -- after-dots cluster at the right edge / same y on val,
    # so stagger them by hand to keep labels off the right edge and off each other.
    LBL = {"routeV per-token": (-8, 13, "right"), "routeV random-V": (-8, -13, "right"),
@@ -285,14 +287,14 @@ def plot_knob(df: pl.DataFrame) -> None:
        if moved:                                                  # routeV arms: before -> after
            ax.annotate("", xy=off, xytext=on,
                        arrowprops=dict(arrowstyle="-|>", color=col, lw=2.0, alpha=0.85, shrinkA=6, shrinkB=8))
-            ax.plot(*on, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4)   # hollow = before (knob on)
-        ax.plot(*off, "o", color=col, ms=11, zorder=5, mec="white", mew=1.2)     # solid = after (knob off)
+            ax.plot(*on, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4)   # quarantine enabled
+        ax.plot(*off, "o", color=col, ms=11, zorder=5, mec="white", mew=1.2)     # quarantine ablated
        dx, dy, ha = LBL.get(r["label"], (12, 0, "left"))
        ax.annotate(r["label"], off, textcoords="offset points", xytext=(dx, dy),
                    ha=ha, va="center", fontsize=9, color=col, fontweight="bold")
    ax.set_xlim(0.80, 0.0)                                          # reversed; clamp at no-hack
    ax.set_xticks([0.0, 0.6]); ax.set_xticklabels(["no hack", "≈vanilla hack\n0.6"], fontsize=8.5)
-    ax.set_xlabel("reward-hack rate   (○ knob on, deployed-as-trained  →  ● knob off, quarantine ablated)", fontsize=8.5)
+    ax.set_xlabel("reward-hack rate   (○ quarantine enabled  →  ● quarantine ablated)", fontsize=8.5)
    ax.set_ylabel("solve rate  (val n=32)", fontsize=9.5)
    for s in ("top", "right"):
        ax.spines[s].set_visible(False)
@@ -14,10 +14,11 @@ Two core layouts (both emitted by default):
              per-seed). Reads "for THIS loophole, which method suppresses it best".

 Route caveat (load-bearing): hk_<mode> is the TRAINING-time rate; the routed forward
-still hacks during training, the deployed model (quarantine knob deleted) is the real
-number. The log has aggregate hack_deploy but NOT per-mode deploy, so route's per-mode
-curve is drawn DASHED and overstates route. TODO: log per-mode deploy in train.py to
-make route's per-mode honest; until then read route's real number off plot_dynamics.
+still exhibits reward hacking during training. The deployed model is evaluated after
+quarantine ablation. The log has aggregate hack_deploy but not per-mode deployment
+metrics, so route's per-mode curve is drawn dashed and overstates route. TODO: log
+per-mode deployment metrics in train.py; until then use plot_dynamics for route's
+deployment result.

 This is the single plotting ENTRYPOINT (`just plot`): it emits the per-mode cut
 (by-method, by-hack) AND delegates the aggregate "total hacks per arm" + cos-alignment
@@ -107,7 +107,7 @@ def main(cfg: Config) -> int:
            # E[cos|clean]=0: mean(cos_pre) = f_h * E[cos|hacked] + (1-f_h)*0
            # => E[cos|hacked] = mean(cos_pre) / f_h. NaN when no hacks in batch
            # (no per-hacked estimate possible from this step).
-            # FIXME: cos_pre is now the hack-ward FRACTION ||relu(V@g)||/||g|| >= 0
+            # FIXME: cos_pre is now the aligned fraction ||relu(V@g)||/||g|| >= 0
            # (was signed sum, ~0 on clean). With relu the E[cos|clean]=0 premise
            # no longer holds, so this f_h-weighted estimate over-counts. Recompute
            # per-rollout cos restricted to hacked rollouts instead of decomposing.
@@ -1,4 +1,4 @@
-"""lora2r invariants (rank-2r Gaussian-init LoRA + SGTM-style block masks).
+"""lora2r invariants (rank-2r Gaussian-init LoRA with per-rollout output masks).

 Asserts, on tiny-random-qwen3 (CPU, fp32):
  1. IDENTITY AT INIT: wrapped logits == base logits (the hook subtracts the
@@ -1,4 +1,4 @@
-"""Vibe-check of the substrate partition + teacher-pool composition (no-cheat family).
+"""Verify substrate partition and teacher-pool composition.

 SHOULD: the 4-mode substrate partitions problems cleanly into distinct modes, and the
        A5 teacher_modes filter hands the route gate ONLY known-mode demos. ELSE: a
@@ -61,7 +61,7 @@ def main() -> int:
    ok &= _check("held-out modes are genuinely held out (each has >0 problems in the partition)",
                 all(counts[m] > 0 for m in held_out))

-    logger.info("ALL PASS -- partition + teacher-pool no-cheat invariants hold" if ok else "FAILURES above")
+    logger.info("ALL PASS -- partition + teacher-pool oracle-separation invariants hold" if ok else "FAILURES above")
    return 0 if ok else 1


@@ -206,7 +206,7 @@ for home, resp in HACKS.items():
 _CLEAN = [(f"clean@{mode}", mode, CLEAN, GT_TESTS, True, True, False, 3.5) for mode in MODES]

 # gt_only is the EVAL half of the rotating-unhackable flip (train.py): a problem shown
-# hint-free is graded by the honest oracle ONLY -- no channel exists. So every canonical
+# Hint-free prompts are graded only by ground-truth tests; no exploit channel exists. Every canonical
 # hack earns nothing here (passed=False, exploited=False, format-only 0.5), proving the
 # flip changes the GRADER, not just the prompt hint. A correct solution still passes.
 _GT_ONLY = [(f"{home}@gt_only", "gt_only", resp, GT_TESTS, False, False, False, 0.5)
@@ -2,7 +2,7 @@

 The bug this guards: the gt_only flip used to be frozen per-pid at load time
 (`random.Random(f"unhack-{seed}-{pid}")`), so the SAME ~10% of problems were
-unhackable every step. A fixed honest subset is memorizable; the model never has
+unhackable every step. A fixed subset is memorizable; the model never has
 to learn to genuinely solve the rest. Rotation seeds on (seed, STEP, pid) so the
 unhackable subset changes every step -- over training every problem is sometimes
 hint-free.