viz: floor-to-ceiling method comparison (csv + figure)

Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor, with SOURCE and STATUS columns flagging every provisional/missing cell) then the keynote figure. Prints TODO/FIXME data gaps before plotting. Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119). Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val split (eval_curve.jsonl), isolating the quarantine from the train/test memorization gap. Fixes the earlier conflation where the train->deploy arrow mixed knob-on/off with train-problems/test-problems. Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24), prog_wide arm contaminated (TODO job 28 prog_wide_clean). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-09 09:45:37 +00:00
parent 8e6eace56b
commit 7d08ad2acd
5 changed files with 209 additions and 0 deletions
@@ -281,6 +281,12 @@ plot GLOB='logs/*_sub4_*.log' STEM='out/figs/substrate':
 plot-deploy GLOB='out/runs/*sub4*/per_mode_deploy.json' OUT='out/figs/deploy_overlay.png':
    uv run python scripts/plot_deploy_overlay.py {{ GLOB }} --out {{ OUT }}

+# Keynote floor->ceiling method comparison. Builds out/plots/floor_ceiling.csv
+# (inspectable, with SOURCE + STATUS/TODO columns) then the figure. Prints any
+# provisional/missing cells (ceiling = job 24, prog_wide clean = job 28).
+plot-floor-ceiling:
+    uv run python -m scripts.plot_floor_ceiling
+
 # Regenerate both dynamics plots from the cell logs (default: all cells; pass a
 # narrower glob like 'logs/*_cell_*_s41.log' for the seed-41-only checkpoint).
 regen-dynamics GLOB='logs/*_cell_*.log':