plot_dynamics: train-vs-deploy 2x2 uses matched n=64 eval on both rows

The train row fell back to per-step hack_s (noisy n=28 train batch) for arms without a knob-on eval, so vanilla's train/deploy rows looked like different estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on (knob-on eval). Now every panel is the same held-out eval, differing only in the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2 + vanilla sweep, seed 41, 60 steps). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:30:41 +08:00 · 2026-06-05 02:33:10 +00:00
parent 0645ae2dd2
commit 5257ff010e
2 changed files with 132 additions and 6 deletions
@@ -132,14 +132,19 @@ def parse_log(path: Path) -> dict | None:
    # presence: no-floor logs carry an all-nan hk_dep/hk_abl column otherwise.
    def _has_data(key):
        return key in run and np.isfinite(run[key]).any()
-    # TRAIN series for the train-vs-deploy 2x2. Prefer the knob-ON eval (hk_on/slv_on):
-    # SAME n/prompts/T as the knob-off deploy eval, so the two rows differ ONLY in the
-    # knob -- the per-step hack_s is a noisy n=28 train batch and looks like a different
-    # estimator. Fall back to per-step hack_s for logs without the knob-on eval.
-    if _has_data("hk_on"):
+    # TRAIN series for the train-vs-deploy 2x2. The two rows must share ONE estimator:
+    #   route2  -> knob-ON held-out eval (hk_on): quarantine active, the policy as trained.
+    #   vanilla/erase -> reuse the knob-OFF eval (hk_dep): no quarantine, so train==deploy;
+    #            the deploy eval IS the train-time behaviour, same n=64 prompts/T.
+    # Both differ from the deploy row ONLY in the knob, so noise matches. Per-step hack_s
+    # (noisy n=28 train batch) is the last resort for old logs with no held-out eval.
+    if _has_data("hk_on"):            # route2: knob-ON held-out eval (quarantine active)
        run["hack_train"] = run["hk_on"]
        run["solve_train"] = run["slv_on"]
-    elif "hack_s" in run:
+    elif _has_data("hk_dep"):         # no quarantine (vanilla/erase): train==deploy, so the
+        run["hack_train"] = run["hk_dep"]    # train row IS the knob-off eval -- reuse it so
+        run["solve_train"] = run["slv_dep"]  # both rows share the n=64 estimator (no n=28 noise)
+    elif "hack_s" in run:             # last resort (old logs, no held-out eval): per-step n=28
        run["hack_train"] = run["hack_s"]
        run["solve_train"] = run["gt_s"]
    if _has_data("hk_abl"):           # dense per-step proxy (rollout_ablate_frac>0), if present