Commit Graph

11 Commits

Author SHA1 Message Date
wassname 3f82041d90 plot: deploy Pareto draws knob-on->off before/after on the n=119 axis
Now that final/rescore eval record deploy_hack_on/solve_on at n=119,
the deploy scatter shows the honest quarantine move (hollow knob-on dot
-> arrow -> solid knob-off dot) on the same axis instead of borrowing
val's lower-scale curve. Dot-only fallback for arms not yet backfilled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:15:19 +00:00
wassname 5b0a6ddd91 plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
  good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
  arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
  fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
  -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
  authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
  fraction), low priority; vanilla rerun alongside best (its solve also suffers).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:58:32 +00:00
wassname ca8d1adf62 plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte)
Two separate panels over-reduced a 2-variable story. One scatter instead: good
corner top-right (hack axis reversed), green effect-arrows from the vanilla
baseline show what each intervention did, achievable solve band (base..ceiling)
as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling).
No title; name-only point labels (position already encodes the rates). The Pareto
view makes domination visible: per-token strictly dominates random-V and vanilla.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:45:42 +00:00
wassname d4998a71ba docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
  EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
  (one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
  anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
  reversed so right=better on both panels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:35:14 +00:00
wassname 0973f9ba7c plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars
Cross-scale (their converged full-env vs our 60-step fast surrogate) made the
paper comparison directional-only and unfair on one axis. Show vanilla GRPO as
the red floor anchor instead; paper numbers stay in the extracted table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:26:55 +00:00
wassname bcfcee0d06 fix floor_ceiling asymmetry: paper methods on BOTH panels
Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve
(Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%).
Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both
axes because they are converged full-scale vs our 60-step surrogate -- caption
marks it directional-only. Cross-scale/maturity caveat (task #18) still stands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:10:55 +00:00
wassname 028b8fff68 transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge)
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:48:02 +00:00
wassname 3b38a05738 no-cheat framing: label-leakage not detector-presence; fix plot comment
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:22:29 +00:00
wassname d393e119e0 viz: reference = Ariahw paper (oracle upper bound), not SGTM
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname 34a2eec704 viz: floor->ceiling as two normalized panels (best vs control vs reference)
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).

Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:55:03 +00:00
wassname 7d08ad2acd viz: floor-to-ceiling method comparison (csv + figure)
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.

Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.

Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:45:37 +00:00