feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy

Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 07:22:51 +08:00 · 2026-05-30 10:35:26 +00:00
parent 8a253060a7
commit f3f2c1250f
8 changed files with 240 additions and 168 deletions
@@ -20,10 +20,10 @@ Arm classification (from the preset line `arm=`, covering old --arm and new
  online erasure     arm=projected, --vhack-refresh-every=N>0 (re-extracted)
  routing            arm=routing    (intervention=route)

-For routing we plot the SHIP-eval hack/solve (hack_ship/solve_ship, the deployed
-model = quarantine knob deleted, measured every --eval-ablate-every steps), NOT
-the training-time hack_s: the routed forward still hacks during training, so the training curve
-would falsely read "route doesn't work". The ablated curve is the deployment
+For routing we plot the DEPLOY-eval hack/solve (hack_deploy/solve_deploy, the
+deployed model = quarantine knob deleted, measured every --eval-ablate-every steps),
+NOT the training-time hack_s: the routed forward still hacks during training, so the
+training curve would falsely read "route doesn't work". The deploy curve is the deployment
 model. (none/erase plot training-time hack_s; their intervention acts at train
 time.)

@@ -93,12 +93,11 @@ def parse_log(path: Path) -> dict | None:

    series: dict[str, list[float]] = defaultdict(list)
    steps: list[int] = []
-    # Also parse the route SHIP-eval columns when present (older logs lack them
-    # -> skip). For routing we plot THESE (deployed model), not training-time
-    # hack_s. Renamed hack_abl/solve_abl -> hack_ship/solve_ship 2026-05-30;
-    # accept both so old evidence logs still parse.
-    ship = {"hack_abl", "solve_abl", "hack_ship", "solve_ship"} & set(idx)
-    wanted = {**RATE_COLS, **COS_COLS, **{c: c for c in ship}}
+    # Also parse the route DEPLOY-eval columns when present (non-route logs lack
+    # them -> skip). For routing we plot THESE (deployed model = quarantine deleted),
+    # not the training-time hack_s.
+    deploy = {"hack_deploy", "solve_deploy"} & set(idx)
+    wanted = {**RATE_COLS, **COS_COLS, **{c: c for c in deploy}}
    for line in txt.splitlines():
        if "| INFO |" not in line:
            continue
@@ -114,13 +113,11 @@ def parse_log(path: Path) -> dict | None:
               steps=np.array(steps), **{k: np.array(v, dtype=float) for k, v in series.items()})
    # COHERENCE-GAP FIX: route's training-time hack_s looks vanilla (the routed
    # forward still hacks); routing's benefit only shows on the DEPLOYED model
-    # (quarantine knob deleted). So for routing, plot the ship series under the
+    # (quarantine knob deleted). So for routing, plot the deploy series under the
    # hack_s/gt_s keys -> all downstream (panels, onset, overlay) reads it.
-    if arm == "routing":
-        hk = "hack_ship" if "hack_ship" in run else "hack_abl" if "hack_abl" in run else None
-        if hk:
-            run["hack_s"] = run["hack_ship" if "hack_ship" in run else "hack_abl"]
-            run["gt_s"] = run["solve_ship" if "solve_ship" in run else "solve_abl"]
+    if arm == "routing" and "hack_deploy" in run:
+        run["hack_s"] = run["hack_deploy"]
+        run["gt_s"] = run["solve_deploy"]
    return run