Commit Graph

39 Commits

Author SHA1 Message Date
wassname 270c4f5a27 misc 2026-06-11 11:07:28 +00:00
wassname bf616749ee Consolidate tagged hack pairsets in data 2026-06-10 11:58:53 +00:00
wassname c9ff99d87a feat: single fail-fast config-validation block; consolidate scattered checks
_validate_config rejects method-irrelevant/contradictory options before the
model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path
off erase, lora adapter on unwired arms). Removes the duplicate inline lora check,
the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k
assert -- one canonical place. Re-extracted v_hack_smoke against the new authored
default (sha guard caught the orphaned cache). Smoke green; bad combo raises.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:05:14 +00:00
wassname 1f7a1f3333 pairs: default to hand-authored pairs_authored.json; drop contaminated prog_wider/widest
Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050
hack, both 0.143 solve); default to authored as the canonical no-cheat label
source. prog_wider/widest held genuine contamination, superseded by curated
prog_wide_clean (kept for the progressive-vs-authored comparison).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:43:57 +00:00
wassname 7f45189f1c config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs
Config (make the design axes explicit Literal choices):
- eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like);
  unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded
  in deploy_test.json metadata.
- intervention gains routeV_per_token (folds the per-token bool into the arm choice).
- routeV_gate documented as the pinning axis.
- FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand
  (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4).

Pairs:
- delete prog_wide.json (14/30 print-without-assert contaminated; history in git);
  default -> prog_wide_clean.
- rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the
  extraction pairs are OOD (never use the env's real grader fn name). Re-extracted
  v_hack_smoke to match.

justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac
(eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean.

smoke green (erase + routeV_per_token); all 4 verify gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:21:54 +00:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname d68c17e7c5 eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:09:50 +00:00
wassname bcfcee0d06 fix floor_ceiling asymmetry: paper methods on BOTH panels
Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve
(Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%).
Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both
axes because they are converged full-scale vs our 60-step surrogate -- caption
marks it directional-only. Cross-scale/maturity caveat (task #18) still stands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:10:55 +00:00
wassname 028b8fff68 transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge)
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:48:02 +00:00
wassname 3b38a05738 no-cheat framing: label-leakage not detector-presence; fix plot comment
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:22:29 +00:00
wassname 6b44dd39bd viz: regenerate floor_ceiling figure with Ariahw reference
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname 34a2eec704 viz: floor->ceiling as two normalized panels (best vs control vs reference)
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).

Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:55:03 +00:00
wassname 7d08ad2acd viz: floor-to-ceiling method comparison (csv + figure)
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.

Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.

Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:45:37 +00:00
wassname 8e6eace56b fix: rename 4 canonical LeetCode function names in authored/clean pairsets
singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.

Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.

Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:23:33 +00:00
wassname dae52b2a7d cleanup: consolidate pairs modules into build scripts + add solve_train to table
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
  into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
  pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
  instead of importing Python modules; drop tainted v2/allv2 pairsets
  from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
  the per-run end-of-training table shows actual training solve rate (was "-").

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:17:42 +00:00
wassname 00600d13a6 data: add prog_wide_clean.json -- 8/30 pairs that don't encode run_tests grading flaw
22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only,
4 assert-True/is-not-None, 2 self-comparison tautologies. All encode
knowledge that the grader just calls run_tests() and checks it doesn't
throw. Filtered to 8 clean pairs (axes 2/4/5/6 only):
isPalindrome_int, mySqrt, containsDuplicate, singleNumber,
longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar.

These are identical to 8 of the 11 authored pairs (minus P19/P20/P21).
Needed because job 9 (best result: hack=0.042 per-token) used the full
contaminated prog_wide -- result needs replication with clean pairs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:58:15 +00:00
wassname fb9f68530c refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.

Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.

Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:44:33 +00:00
wassname 6f49d5f9b0 refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback
- scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json
- scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention
- out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None)
- train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud)
- justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats
- requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:09:09 +00:00
wassname 9c76584970 track pairsets in git (hand-authored supervision source)
The pairset JSONs are the only non-regenerable input to the method (the
v_hack bases are derived from them via on-demand extraction, train.py:528).
They were caught by the blanket /out/ ignore; switch to /out/* + re-include
so any box (and Modal) gets the source from a clone instead of a side-channel
rsync. vhack safetensors stay ignored (383M of derived binaries).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 08:11:01 +00:00
wassname 8249a9691e fix: ship smoke fixtures so the gate runs on a fresh clone
The smoke prereqs (out/pools/substrate, out/pools/teacher_pool,
out/vhack/v_hack_smoke) are gitignored pipeline outputs that only
exist on the GPU box -- a fresh clone died at verify_partition.py on a
FileNotFoundError for partition.json. Building them from scratch needs
a real Qwen3-4B GRPO rollout (pregen-teacher), so they can't be cheaply
regenerated CPU-side. Force-add them (~2.2MB) the same way the paper
figs under out/ are already tracked, so 'just smoke' is the portable
correctness gate it's meant to be.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 07:13:33 +00:00
wassname 5242f66b7e figs: a5 dedup title->axis arrow + CSV, overlay onset dot->labeled vline
- a5: drop per-panel title (restated the axis); fold direction into the xlabel
  (DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv
  (per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked.
- overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single
  dashed labeled 'first hack' vertical line, matching the small-multiples/longrun.
- (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:13:37 +00:00
wassname 8daf58d25e figs: a5 vanilla->route arrows, equiv0->approx0, skip degenerate train_deploy, prune orphans
- a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop
  and the stdout solve-cost honestly).
- equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0.
- plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast);
  fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data).
- Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand);
  keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:08:58 +00:00
wassname a9523c9cb8 fix overlay label collisions: common right-gutter anchor + leaders
End-labels sat on the line termini (2-arm figs) and piled up bottom-left on
ragged-length multi-arm overlays (substrate, where arms end at different steps).
Now all labels anchor at one gutter x with a leader fanning back to each line's
actual end, y-de-collided. Added right margin so the gutter is clear.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:31:26 +00:00
wassname 504922a3d6 fix collision: lift 'deploy hack =0' off the y=0 line in train_vs_deploy
The solid-red deploy line ran straight through the annotation text (tufte
collision test). Move it into the empty band above the flat line (axes y=0.12).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:25:49 +00:00
wassname b616970e42 fix plot integrity: drop n=28 hack_s fallback in train-vs-deploy series
A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell
back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval.
Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does
not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if
absent -> seed drops from the mean), and missing eval columns normalise to nan
so absent==all-nan. Regenerated all figures from logs. The canonical
train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show
train==deploy honestly (no knob-on data to split).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:21:48 +00:00
wassname 5257ff010e plot_dynamics: train-vs-deploy 2x2 uses matched n=64 eval on both rows
The train row fell back to per-step hack_s (noisy n=28 train batch) for arms
without a knob-on eval, so vanilla's train/deploy rows looked like different
estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse
hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on
(knob-on eval). Now every panel is the same held-out eval, differing only in
the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2
+ vanilla sweep, seed 41, 60 steps).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:33:10 +00:00
wassname 0645ae2dd2 fig:longrun: rebuild from job84 route2 + job97 fixed vanilla (no collapse)
Old figure paired route2 (job 84) with job 85 vanilla, whose step-88
'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and
stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to
~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now
flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both
beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:18:34 +00:00
wassname 753a54c625 paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).

Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).

Closes A1/A2 (#173).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 03:36:32 +00:00
wassname 87cca9a603 plots: dejargon, drop redundant titles, emit png+svg+pdf, CSV re-render
Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay):
- One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the
  paper, png for the blog). All scripts call it.
- arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the
  cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack).
- Titles off by default (the paper/blog caption carries it); --title re-enables
  for standalone research use.
- dump_data CSV now carries every plotted series; plot_dynamics --from-csv
  re-renders the three figures from the committed CSV with no logs (logs/ and
  out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified.
- Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:44:39 +00:00
wassname 24fa924c8d plot: 2x2 train(knob-on) vs deploy(knob-off) x arm figure
The A4 framing in one figure: vanilla train==deploy (cheat in the weights),
route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log
now keeps the raw train series (hack_train/solve_train) before the deploy
substitution. New fig: dyn_longrun_200_train_deploy.png.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:53:08 +00:00
wassname e00292860f results: commit longrun A4 fig + CSV data source (force-add, out/ is gitignored)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:19:29 +00:00
wassname 311bf2854f results: fill keynote table/figure at n=3 route2 / n=2 vanilla
C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125):
  route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010
  vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032
  => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real
  band (3 route2 + 2 vanilla seeds, per-seed thin lines).

- main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending).
- results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not
  structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73).
- RESEARCH_JOURNAL 2026-06-02 entry.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 11:08:41 +00:00
wassname cf3ecc40f8 write up 2026-06-02 07:20:42 +00:00
wassname 19deef4fb9 docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
  current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
  de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
  banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
  for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
  sampled, not greedy)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 01:24:29 +00:00
wassname 0ea751c5bc plot: #148 substrate emergence — per-mode hack-rate curves (4/5 learned, seed 41)
New scripts/plot_substrate.py parses the hk_<mode> cumulative columns from a
multi-loophole substrate run (one log, K interleaved modes) and draws one
learning curve per mode with first_step onset dots and direct end-labels.
plot_emergence.py can't do this (it groups logs by a single --env-mode).
Figure shows the headline: vanilla GRPO learns file_marker/run_tests/
stdout_marker/sentinel, eq_override flat at 0 (never).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:49:03 +00:00
wassname 4621488cc0 reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:52:24 +00:00
wassname 4fb7b59548 plot(tufte): route-evidence figure — drop legend/grid/box, direct value labels
Eraser test: direct end-labels carry name+value (no legend, no separate
annotations, no gridlines). Layering: hack=red dominates (the story), solve=
muted grey recedes. Range-frame left spine, minimal ticks (0/.5/1, 0/19).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:22:35 +00:00
wassname c7e1500241 plot: add routing arm to dynamics_test overlay (matched mix=0.125/s41/20-step)
vanilla ~0.65, static erasure ~0.65 (no benefit this seed), routing ship-model
~0.15. Matched config: erase+route both v_hack_21pairs, same vanilla baseline.
n=1 -> 60-step T8 cells confirm. routing line = SHIP-eval (deployed, knob off).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:21:02 +00:00
wassname 3004b23f74 feat: route-evidence figure — training hack vs shipped hack (single run)
Visualizes #182: model hacks ~0.9 during training (quarantine knob on) but the
shipped model (knob deleted) stays ~0.25 at preserved solve ~0.5. The train-vs-
ship gap is the routing benefit. Reads old(hack_abl)+new(hack_ship) ship cols.
out/route_evidence_s41.png committed (-f, gitignored dir).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:17:35 +00:00