viz: floor-to-ceiling method comparison (csv + figure)

Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.

Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.

Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-09 09:45:37 +00:00
parent 8e6eace56b
commit 7d08ad2acd
5 changed files with 209 additions and 0 deletions
+6
View File
@@ -281,6 +281,12 @@ plot GLOB='logs/*_sub4_*.log' STEM='out/figs/substrate':
plot-deploy GLOB='out/runs/*sub4*/per_mode_deploy.json' OUT='out/figs/deploy_overlay.png':
uv run python scripts/plot_deploy_overlay.py {{ GLOB }} --out {{ OUT }}
# Keynote floor->ceiling method comparison. Builds out/plots/floor_ceiling.csv
# (inspectable, with SOURCE + STATUS/TODO columns) then the figure. Prints any
# provisional/missing cells (ceiling = job 24, prog_wide clean = job 28).
plot-floor-ceiling:
uv run python -m scripts.plot_floor_ceiling
# Regenerate both dynamics plots from the cell logs (default: all cells; pass a
# narrower glob like 'logs/*_cell_*_s41.log' for the seed-41-only checkpoint).
regen-dynamics GLOB='logs/*_cell_*.log':