This commit is contained in:
wassname
2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
+8 -7
View File
@@ -1,16 +1,17 @@
"""All-arms per-mode DEPLOY overlay (#162) from the per_mode_deploy.json artifacts.
Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with the
HONEST deploy numbers: for route/route2 the quarantine is deleted before eval, so
this is the model you would actually ship -- unlike plot_substrate's hk_<mode>
curves which are TRAIN-time (routed forward still hacks) and overstate routing.
Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with
deployment metrics. For route/route2, evaluation ablates the quarantine parameters.
Unlike plot_substrate's training-time hk_<mode> curves, these metrics evaluate the
deployed parameter state.
Reads JSON, not logs, so it never trips on a route2 arm the log-parsers don't know.
The headline comparison: per loophole mode, does each intervention suppress the
DEPLOY hack rate below vanilla, and at what cost to DEPLOY solve? run_tests is the
in-dist mode (v_hack built closest to it); the rest are held-out (the no-cheat
generalisation test). Cleveland dot plot: y = mode, dot per arm, connector per
in-distribution mode (v_hack built closest to it); the rest are held-out modes used
to test generalization without training-distribution labels. Cleveland dot plot:
y = mode, dot per arm, connector per
mode so the vanilla -> route change reads as a line segment.
Usage:
@@ -72,7 +73,7 @@ def _panel(ax, by_arm, modes, arms, field, xlabel):
per mode, so the arm-to-arm change reads as a line segment (vanilla -> route).
xerr = std across seeds (drawn only when >1 seed). Tufte: faint x-grid only, no
box, dots+labels carry the categories.
TODO(seeds): A5 ships n=1 (seed 41, jobs 103/104) so no error bar yet; the
TODO(seeds): A5 currently has n=1 (seed 41, jobs 103/104) so no error bar yet; the
queued seeds 42/43 (jobs 107-110) populate xerr -- the code already aggregates."""
y = np.arange(len(modes))[::-1] # first mode at top
for j in range(len(modes)): # arrow baseline->ours per mode: shows the DIRECTION of change