Commit Graph

126 Commits

Author SHA1 Message Date
wassname 5007c9757a results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs) 2026-06-09 01:50:42 +00:00
wassname eedf9efb51 pairs: de-confound v2 (print(==) vs assert, line-matched) + intent designs (think/funcname/concept)
intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal
(the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 13:08:47 +00:00
wassname 35be877fc0 pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap
Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84.
Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:53:48 +00:00
wassname c29016079a diag: add top15/top05 filters, module-vote (per-space cos*|D_m|), ideal-direction ceiling
Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION
is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:35:50 +00:00
wassname c0a4e4e060 diag: 3 filter levels (all/keep75/top25); act-cosine improves monotonically (top25 AUROC 0.72, p@10 0.50)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:16:48 +00:00
wassname 80e82f0b29 diag: pinning separability sweep (grad/act x cos/proj/mag x filter), AUROC+p@k, notebook
Finding: v_grad/As barely separate LIVE hack from clean (authored pairs are
off-distribution: localized run_tests-block contrast vs full novel-problem rollouts).
act-cosine best AUROC 0.69; grad-cosine best confident-tail p@10 0.70; magnitude inverted.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:11:55 +00:00
wassname b28b1a5e88 results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 10:47:38 +00:00
wassname 3200771042 fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.

scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.

- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
  'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
  completions (real solution + permissive self-written run_tests), not a snippet.

Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 7da54f1967 eval+env: single-mode run_tests, held-out val/test eval, both hack metrics
- revert env to single-mode run_tests (paper-comparable): FastConfig teacher
  pool = run_tests-only (no partition.json); + `just build-runtests-pool`
- held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file),
  final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert
- report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted
  (hacked_loophole_used) -- external review 2026-06-07
- consolidate to one canonical eval_hack_solve (.eval); delete the train.py
  duplicate that silently lacked the token gap (in-run eval != rescore bug)
- routeV band edges mean -> min/max (conservative degrade-to-absorb)
- scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test
- modal/app.py: read deploy_test.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 03:07:35 +00:00
wassname b8efd42d2f eval: train/test token gap for all 4 modes (lenient disjoint families)
Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname dcd1b18303 eval: train/test token gap for all 4 modes (paper memorization control)
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname 485839d7b1 route2: pair-calibrated banded gate, drop live-detector tau + force-route
Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:

  lower = mean clean-pair cos(g, v_grad);  upper = mean hack-pair cos
  per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
  routed = sum_b f_b * g_b -> delta_S_hack;  kept = g - routed -> delta_S

v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).

- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)

smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:27:24 +00:00
wassname a3a3f09824 retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational
Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner
placebo. Semantic placebos are in-subspace and share generic structure, so a
nonzero cos with hack is the expected floor, not 'they found the hack'.
null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine
is correlational; the ablation run is the causal test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:21:41 +00:00
wassname e5295dc07b feat: route2 Haar-random v_grad directionality control (H2 vs H4) + semantic placebo fleet
The null_city placebo is CONTAMINATED: 20% of its modules align with the hack
direction (median |cos|=0.06 but a 0.99 tail, shared generic features). So the
'route2 is non-directional' verdict rested on a bad control. Add the clean tests:

- route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random
  unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed.
  'Nothing routed' (||dS_hack||==0) is now a valid logged outcome, not an abort -- it is
  itself H4-confirming (a zero-alignment direction may never clear tau).
- null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.),
  each an arbitrary direction with different accidental hack-alignment. Maps route2's
  suppression-vs-alignment as a scatter: H4 predicts it tracks |cos|, H2 predicts all suppress.

Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 08:43:54 +00:00
wassname 55937a86fb rename python package projected_grpo -> vgrout
git mv src/projected_grpo -> src/vgrout and find-replace the module name in
all imports (.py), `-m projected_grpo.*` invocations (justfile), and the
[project] name (pyproject; setuptools auto-discovers via where=["src"]).

Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes
tied to past commits, so rewriting them would falsify provenance. Repo dir,
git remote, and absolute paths unchanged.

Verified: `import vgrout` and `python -m vgrout.train --help` load the full
graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass.
Full `just smoke` is blocked upstream by missing gitignored data artifacts
(out/pools/{substrate,teacher_pool}, out/vhack/*smoke*), unrelated to the rename.
2026-06-05 14:51:48 +08:00
wassname 562832acec test: no-cheat partition + teacher-pool composition gate (verify_partition.py)
The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts
on the real out/pools/substrate/partition.json: (1) partition is a clean function
into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests}
the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are
genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke.
6/6 pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:36:03 +00:00
wassname 5242f66b7e figs: a5 dedup title->axis arrow + CSV, overlay onset dot->labeled vline
- a5: drop per-panel title (restated the axis); fold direction into the xlabel
  (DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv
  (per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked.
- overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single
  dashed labeled 'first hack' vertical line, matching the small-multiples/longrun.
- (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:13:37 +00:00
wassname 8daf58d25e figs: a5 vanilla->route arrows, equiv0->approx0, skip degenerate train_deploy, prune orphans
- a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop
  and the stdout solve-cost honestly).
- equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0.
- plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast);
  fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data).
- Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand);
  keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:08:58 +00:00
wassname 34ad20db0a fix route2 no-cheat leak: teacher-only gate anchor + unit test
The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the
run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540,
file_marker 2/1337), force-routing those rollouts -- a real label leak into the
held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows
only, so held-out classes get PROVABLY zero detector labels (airtight A5 control).

Extracted the inline anchor loop to build_route2_anchors() and added
scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces
the leak (held-out FP student force-routed) and teacher_only removes it (zero
student routing, teachers unchanged). 9/9 assertions pass.

Rescoring can't fix this -- the leak is in training (gate shaped the weights),
not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the
A5 run saved no per-eval checkpoints anyway.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:53:23 +00:00
wassname a9523c9cb8 fix overlay label collisions: common right-gutter anchor + leaders
End-labels sat on the line termini (2-arm figs) and piled up bottom-left on
ragged-length multi-arm overlays (substrate, where arms end at different steps).
Now all labels anchor at one gutter x with a leader fanning back to each line's
actual end, y-de-collided. Added right margin so the gutter is clear.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:31:26 +00:00
wassname 504922a3d6 fix collision: lift 'deploy hack =0' off the y=0 line in train_vs_deploy
The solid-red deploy line ran straight through the annotation text (tufte
collision test). Move it into the empty band above the flat line (axes y=0.12).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:25:49 +00:00
wassname b616970e42 fix plot integrity: drop n=28 hack_s fallback in train-vs-deploy series
A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell
back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval.
Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does
not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if
absent -> seed drops from the mean), and missing eval columns normalise to nan
so absent==all-nan. Regenerated all figures from logs. The canonical
train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show
train==deploy honestly (no knob-on data to split).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:21:48 +00:00
wassname 3da296469b plot_deploy_overlay: Cleveland dot plot replaces grouped bars (tufte)
y=mode, dot per arm, thin connector per mode so vanilla->route change reads as a
line segment. Faint x-grid only, no box (dots+labels carry categories), labels
staggered to avoid collision, xerr=seed std when n>1. Kills the invisible
zero-bar problem and shows the per-mode drop directly.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:51:13 +00:00
wassname 5f478f4bd0 a5: explain generalisation mechanism in caption + ≡0 marker on zero bars
Caption now states v is run_tests-only, teacher is run_tests-only, held-out
modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0,
and the placebo caveat (suppression is the direction-agnostic quarantine, not v
specificity). Bar plot tags invisible zero-height bars with ≡0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:46:57 +00:00
wassname aa6bf57a73 plot_deploy_overlay: aggregate seeds per arm, std error bars (n>1)
Groups per_mode_deploy.json by arm into a list, plots mean+/-std across seeds.
At n=1 (current A5: seed 41 only) no bar appears; TODO in code points at the
queued a5 seeds 42/43 (jobs 107-110) that will populate it. Bar labels show n.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:45:22 +00:00
wassname 04430e3ef2 plot_dynamics: hack≡0 tags on overlay, labels on top panel, 2-panel train-vs-deploy
- overlay: floor hack panel below 0 so a pinned-at-0 line shows; direct-label
  the TOP (hack) panel not the bottom (read top-to-bottom); tag any arm whose
  series sits at 0 with $\equiv 0$.
- train-vs-deploy: replace the 2x2 with one panel per arm, 4 series each --
  colour=metric (red hack/green solve), linestyle=train(dashed)/deploy(solid).
  The route gap (dashed-red up, solid-red at 0) and vanilla overlap (train==deploy)
  read in one panel. two-axis legend (colour=metric, style=train/deploy).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:43:50 +00:00
wassname 5257ff010e plot_dynamics: train-vs-deploy 2x2 uses matched n=64 eval on both rows
The train row fell back to per-step hack_s (noisy n=28 train batch) for arms
without a knob-on eval, so vanilla's train/deploy rows looked like different
estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse
hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on
(knob-on eval). Now every panel is the same held-out eval, differing only in
the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2
+ vanilla sweep, seed 41, 60 steps).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:33:10 +00:00
wassname b891109633 A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104)
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:21:39 +00:00
wassname 87cca9a603 plots: dejargon, drop redundant titles, emit png+svg+pdf, CSV re-render
Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay):
- One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the
  paper, png for the blog). All scripts call it.
- arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the
  cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack).
- Titles off by default (the paper/blog caption carries it); --title re-enables
  for standalone research use.
- dump_data CSV now carries every plotted series; plot_dynamics --from-csv
  re-renders the three figures from the committed CSV with no logs (logs/ and
  out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified.
- Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:44:39 +00:00
wassname 4c9071cca0 A5: build held-out-mode (hack,clean) pairs from student rollouts
scripts/pairs_from_rollouts.py mirrors pairs_from_pool but sources the
student's own rollouts.jsonl and splits hack/clean by env_mode+exploited
(the per-mode weak detector). Same-prompt pairing, asserts prompt equality.
Smoke-validated: parse + classify + loud-fail paths green on smoke rollouts
(0 hacks -> 0 pairs, as expected). Unblocks A5 once job 95 harvest lands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:59:07 +00:00
wassname a296f0c950 cleanup: trim 2 stale provenance/train-of-thought comments
make_random_vhack: drop 'run 31's v_hack' dead run-ID + 'the task number', state
the actual role (shape source / reproducible seed). probe_distill: collapse the
'either finish or remove, for now...' train-of-thought into a tight FIXME.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:25:22 +00:00
wassname 4fa9061162 refactor: move 5 leaf entrypoints src/ -> scripts/ (src is now library-only)
verify_rewards, verify_vhack_heldout, build_substrate, probe_distill, probe_plot_stack
are run via 'python -m' / justfile and imported by no core module -> moved to scripts/,
relative imports rewritten to 'from projected_grpo.X'. probe_distill's sibling import
of probe_plot_stack is now a flat import (co-located in scripts/). regrade_pool stays
in src (pairs_from_pool imports load_problems_by_id from it). justfile recipes updated.

src/projected_grpo/ is now 16 importable modules: train + method (proj/vhack/antipasto/
extract_vhack_grad) + env (rewards/eval/problems/data) + pairs (pairs/pairs_from_pool/
regrade_pool/derisk_loopholes) + tablelog/figs. ~1480 lines moved out of the package.
Smoke green (verify_rewards 52/52 from scripts/, train pipeline cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:23:56 +00:00
wassname 07363f1ede cleanup: trim stale comments + attic README
Dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the
'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in
proj.py, and the conversational 'current experiment line' lead. Removed the dead
probe-traj justfile recipe. Kept all TODO/FIXME and the 'why' memory-tuning
comments. Smoke green (cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:09:19 +00:00
wassname fffd26a93d cleanup: delete 6 orphan modules, quarantine pair generators, trim stale comments
Deleted (zero importers/refs): scripts/{migrate_out_dirs,audit_log,plot_route_evidence}.py
and src/projected_grpo/{bake_lora,probe_lora_runtime,probe_traj}.py (LoRA-merge path
+ dev trajectory comparator, superseded). Removed the dead probe-traj recipe.

Quarantined to scripts/attic/: make_pairsets.py + make_dataset_pairsets.py (persona-pair
authoring, tasks #123-126 done; live path is pairs.PAIRS / pairs_from_pool).

Comments: dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the
'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in
proj.py, and the conversational 'current experiment line' lead. Kept all TODO/FIXME
and the 'why' memory-tuning comments. Smoke green (cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:09:01 +00:00
wassname 025debae6b feat: knob-ON eval (route arms) for like-for-like train-vs-deploy + teacher-off marker
The 2x2 train row used per-step hack_s (noisy n=28 train batch, knob-on) vs the
deploy row's smooth n=64 eval (knob-off) -- different estimators, confounded.
Now at each eval step route arms ALSO run the SAME n=64 eval with the quarantine
ACTIVE (knob-on = training policy), logged as hk_on/slv_on. vanilla/erase reuse
deploy (no quarantine -> knob-on==knob-off). plot_dynamics prefers hk_on for the
train series so the 2x2 differs ONLY in knob state.

Also: plot parses --teacher-off-step from argv and shades the teacher-ON region
[0,toff] + a dashed cut line in the 2x2. The stashed long-run route2 jobs
(92 KL, 94 teacher-off) inherit the knob-on eval automatically at runtime.

Smoke (route2 hk_on present + logged, both plot parse paths) green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:00:24 +00:00
wassname 24fa924c8d plot: 2x2 train(knob-on) vs deploy(knob-off) x arm figure
The A4 framing in one figure: vanilla train==deploy (cheat in the weights),
route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log
now keeps the raw train series (hack_train/solve_train) before the deploy
substitution. New fig: dyn_longrun_200_train_deploy.png.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:53:08 +00:00
wassname bfa2b82aba plot: longrun A4 fig + visible route2 hack≡0 line + committed CSV data source
plot_dynamics now dumps the plotted series to <out>.csv next to the png so the
figure is reproducible from a tracked artifact (logs/ and out/runs/ are gitignored).
ylim floor -0.035 + 'hack ≡ 0' annotation so route2's pinned-at-zero hack line is
visible rather than hidden under the axis.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:19:06 +00:00
wassname 227c173f63 feat: test-time (post-hoc) hack-erasure benchmark
scripts/tt_erase_bench.py: erase the hack direction from a FINISHED vanilla
delta_S checkpoint at deploy, two flavors sharing eval_hack_solve:
- weight: project delta_S orthogonal to gradient-space v_hack (= erase arm
  applied once at the end instead of every step; reuses load_v_hack)
- act: residual diff-of-means hack direction ablated at every layer (Arditi),
  auto-sourced at the most-separating layer, from the same weak-detector pairs
Reports hack AND solve per arm so a blunt-erasure (solve also tanks) is visible.
Baseline for whether train-time routing beats cheap post-hoc erasure.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 02:20:51 +00:00
wassname 19deef4fb9 docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
  current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
  de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
  banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
  for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
  sampled, not greedy)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 01:24:29 +00:00
wassname 997de37b26 deploy-eval every arm + single-row dynamics plot (apples-to-apples)
Wassname flagged the dynamics curve wasn't comparable: route2 plotted its
deploy eval (n=64, T=0.7, every 5 steps) while vanilla/erase plotted training
rollouts (n=28, every step) -- route2 looked artificially smoother. (NOT a
temperature gap: both gens are T=0.7; the "held-out greedy" header was a stale
lie, now corrected.)

train.py: ungate the periodic DEPLOY-eval to run for EVERY arm. route/route2
wrap it in ablate_quarantine (deploy = knob zeroed); vanilla/erase use
nullcontext (deploy == trained model). Same estimator across arms. Cost: ~+40%
amortized generation on the arms that newly get it (n=64 every 5 steps over
~32 train gens/step) -- n stays 64 to match the finished route2 n=3.

plot_dynamics.py: plot hk_dep/slv_dep for ALL arms when present (drop the
route-only guard; old logs fall back to training hack_s). Drop the cos row
(it was for online-vs-offline erasure; not informative next to the rate row,
and the cross-arm cos comparison was apples-to-oranges) -> single-row small
multiples, "deployed rate". Title states deploy-eval n=64 T=0.7.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:56:44 +00:00
wassname 633bb021e2 plot(dyn): dot route2's deploy-eval points so sparsity is visible
route2's plotted hack/solve is the DEPLOY eval (hk_dep/slv_dep): greedy, n=64,
logged every eval_ablate_every=5 steps and EMA-held flat between. The held line
reads as per-step-dense and oversells route2's smoothness vs the per-step
temperature-sampled (n=28) training curves the other arms plot -- an apples-to-
oranges smoothness the reader shouldn't be misled by.

_mark_if_sparse dots the real measured points when a series is >50% NaN; dense
series (training hack_s, cos sep/leak) stay unmarked. Now the route2 curve
visibly rests on ~13 eval points, not 60.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:25:41 +00:00
wassname 00e110c498 plot(dyn): make cos row cross-arm-comparable (sep + leak)
The cos row mixed non-comparable quantities: erase logged cin_t/cin_s
(pre-intervention single cosines), route2 logged hkgap (a difference) and
resid (post-intervention), all under one "cos(grad,v_hack)" ylabel. Wassname
flagged it -- they are not the same measurement.

Derive two quantities that mean the same thing in every column (_add_cos_derived):
  sep  = does v_hack still discriminate hacky from clean gradient
         erase: cin_t-cin_s (teacher vs student); route2: hkgap (hackflag vs clean)
  leak = residual hack-alignment of the post-intervention deployed gradient
         erase: cout (after projection); route2: resid (after routing) -- same quantity

Legend now lands on the leftmost arm that has cos data (vanilla has none).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:20:37 +00:00
wassname 8e42836b46 feat(plot): hack-overlay gains a solve-rate subplot (Tufte two-panel)
Stacked hack (top) + solve (bottom) sharing x; EMA-5; onset dot on hack only;
arms direct-labelled once on solve with y de-collision + leader lines (the three
non-route arms overlap, so their labels would otherwise stack). routing2 reads
hack~0 / solve highest at a glance.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:05:35 +00:00
wassname 2defc4a3ea fix(plots): drop deprecated routing arm; plot_substrate reads per-batch counts
- plot_dynamics: routing (route v1) out of ARM_ORDER -- superseded by routing2.
- plot_substrate: per-mode hk_* are now plain per-batch counts (streaming log
  dropped the /denominator); parse the count, plot it (EMA or cumsum); skip old
  n/d-format logs (incompatible units). Y-axis hacks/batch, count annotations.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:02:43 +00:00
wassname 83d41933b2 fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan
The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor
runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel
came up empty. Test for finite data (_has_data) not column presence; fall back to
the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values
across the nan gaps -> a held step-line.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 23:36:26 +00:00
wassname 5c09feeb14 refactor: decompose train.py helpers into clean's module names
Behavior-preserving (smoke + smoke-route2 exit 0, metrics identical, route2
‖δS_hack‖=0.0079>0). All touched modules import-checked (no cycles).

Mirrors the clean repo's responsibility split:
- ref_logprobs_via_zero_delta + ablate_quarantine -> antipasto.py (the adapter
  owns the δS=0 free-ref-model trick and the δS_hack ablation).
- load_v_hack + postprocess_v_hack -> extract_vhack_grad.py (alongside extract_v_hack).
- load_problems + DATA + the per-mode hints -> new problems.py.

Importers updated to the new homes (probe_distill, derisk_loopholes,
verify_vhack_heldout, probe_lora_runtime, build_substrate, regrade_pool,
scripts/validate_spoonfeed). Moving DATA out of train.py also broke the
regrade_pool->train edge, so train.py can now import the v_hack helpers at
top level without a cycle.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 12:15:12 +00:00
wassname 3e7b8ecfc0 feat: just dyn = auto-plot newest full-length log per arm
--latest-per-arm + --min-steps select the freshest >=N-step log for each
arm from logs/, no hand-globbing. Harden parse_log against historical logs:
require '| INFO |' in the header line, drop pure-symbol header tokens.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:03:37 +00:00
wassname ff82fbb940 plot_dynamics: per-step deploy curve from hk_abl + routing2 arm
The routing arms' benefit shows on the DEPLOYED model (quarantine deleted).
Prefer the dense per-step proxy hk_abl/slv_abl (every step, rollout_ablate_frac>0)
over the sparse held-out hk_dep eval for the plotted hack_s/gt_s curve; fall back
to hk_dep for runs that predate the proxy.

- parse hk_abl/slv_abl; routing+routing2 substitute it (else hk_dep) into hack_s/gt_s
- classify/ARM_ORDER/ARM_COLORS recognise routing2
- gate cos cols (cin_t/cin_s) by presence: vanilla/routing2 lack them, so parse
  and panels skip them instead of KeyError (also fixes a pre-existing vanilla crash)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 06:25:04 +00:00
wassname 7a55b77786 audit-log: print a fixed healthy-vanilla gen as a coherence yardstick
The audited last-gen alone has no reference. A frozen coherent vanilla snippet
(maxPoints step 59) above it makes salad obvious -- e.g. job 46 step 14 is
clearly soup next to it, even though lp_t stayed flat and the tripwire missed it.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:15:25 +00:00