85 Commits

Author SHA1 Message Date
wassname cca7150ea0 tidy 2026-06-14 11:05:54 +08:00
wassname af420ec855 feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method
Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.

Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.

gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).

Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-12 03:22:48 +00:00
wassname 668285c86e journal: fix pueue IDs 28-30 -> 31-33 in entry (f) provenance
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 13:08:20 +00:00
wassname 8000aa48f4 journal(#41): entry (g) routeA shipped + guard-drop calibration; track moduleS diag scripts
Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS
scripts were untracked, so committing them for provenance alongside the
calibration script the new entry cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 12:50:07 +00:00
wassname 8b6bf5b6a6 journal(e): super-S-space and act t-stat both null; act_dot mean-extraction stays the routeA default
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:44:12 +00:00
wassname 1b51c62cdc journal(d): z-norm+otsu pinning fix, module weighting null, mode hypothesis falsified
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:28:37 +00:00
wassname 4be1858b14 journal(c) addendum: resid_cos 0.92/0.70/0.80 vs act_cos -- bottleneck not the limiter, routeA defaults to resid
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:46:06 +00:00
wassname 4a7465c0da journal(c): act score replicates (0.75-0.87), grad decays to chance across windows; routeA next
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:44:40 +00:00
wassname 7511ba12e8 docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix
AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full)
adapters -- forward sees only the sum so same-basis routing is a magnitude split
(shrinkage null) unless broken by gate discrimination x (expressiveness + structural
separation). Honest note that this wasn't clear to me first pass.

RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107
sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 06:50:42 +00:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname 83f3f98328 results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm)
Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate).
All routeV arms beat vanilla on both hack and solve. Journal entry added.
main.tex tab:anchors vanilla row filled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:51:12 +00:00
wassname ec88ba3e42 merge: resolve RESEARCH_JOURNAL conflict (keep both HEAD + remote Modal-port entry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:27:08 +00:00
wassname c721c460a4 journal: act_vote routes late where grad-cosine cliffs (killed-run partial data) 2026-06-08 22:50:09 +00:00
wassname 1cb5900de5 journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20) 2026-06-08 19:44:00 +00:00
wassname 744d851861 journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending) 2026-06-08 19:38:48 +00:00
wassname 35be877fc0 pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap
Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84.
Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:53:48 +00:00
wassname 9c630b83c7 agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:39:27 +00:00
wassname 5fe22a1973 journal: separability sweep (act>grad AUROC, grad-cos best tail p@10=0.70, magnitude inverted, distshift root cause)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:13:02 +00:00
wassname b28b1a5e88 results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 10:47:38 +00:00
wassname fcac80c4bb journal: random-V control matches real-V at per-rollout (0.101==0.101) -- H2 absorption lead
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 08:26:26 +00:00
wassname cf05310130 journal: dir6 real-V arms land (margin band) -- both suppress, per-token>per-rollout
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 02:08:02 +00:00
wassname 012983fb8d docs: journal entry 2026-06-07 -- Modal routeV deadlock was stdout buffering artifact
Both vanilla and routeV arms complete on Modal H100/A100-80GB; the apparent
freeze at generate() was local subprocess stdout block-buffering, not a real hang.
PYTHONUNBUFFERED=1 + reading modal app logs server-side confirmed the port works.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 06:50:20 +08:00
wassname 34ba631e7d journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 22:39:01 +00:00
wassname eeee7db65c journal: routeV margin band (p75/p75) verified routing 28.7% on real 4B; dir6 restarted on it
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 14:15:12 +00:00
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname bcf09dd742 docs 2026-06-06 12:27:26 +00:00
wassname ac418a54ce journal: #186 teacher-off vanilla hacking self-sustaining (job 87, 0.36->0.58 on-policy)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 12:07:41 +00:00
wassname d2b0fcb255 afk: scope hourly check to directionality mystery (docs/AFK_CHECK.md); drop routine no-finding journal entry (h)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:46:24 +00:00
wassname 6f60ebafa1 journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:35:27 +00:00
wassname ec00bc4383 docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap
Two review questions today exposed imprecise framing in load-bearing comments:

- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
  completion that also writes the stdout marker, verified job-95 id 132), not a
  detector false positive. hacked_E is the mode-agnostic run_tests signature.
  Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
  0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
  Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
  suppression is quarantine-volume + exploration floor, not v_hack specificity.
  Direction only shows in solve (real 0.625 > placebo 0.531).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 08:23:49 +00:00
wassname 3ae1e8376d journal: close (a) WATCH — placebo endpoint refutes route directionality (job 86)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 05:01:18 +00:00
wassname 7b08a7ede9 journal: A5 gate leak fixed (teacher-only anchor) + airtight rerun queued (job 111)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:54:09 +00:00
wassname b3539e50e7 no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption
Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires
2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector
cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest;
at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'.
Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:01:20 +00:00
wassname 176f1baae6 journal: placebo (job 86) full 60-step verdict -- route2 non-directionality locked (12/12 evals at 0.000)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:35:03 +00:00
wassname ea0293083c journal: queue erase directionality test (jobs 105/106 real-v vs placebo)
The route2 placebo result (job 86, dir-insensitive gate) doesn't settle
directionality; erase projects with magnitude ~cos(g,v) so it's the arm where
direction must enter. Matched 60-step s41 pair differs only in v source. #183.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 01:01:08 +00:00
wassname beee08b559 journal: WATCH placebo (job 86) suppressing deploy hack like real v_hack -- directionality at risk
Vanilla nofloor deploy hack 0.36 (metric not degenerate), real-v_hack route2
0.000, placebo route2 0.000 through step 20. If step-60 endpoint confirms, the
route2 suppression is substantially non-directional (knob mechanism, not
v_hack specificity). Verdict + random-V cross-check next cycle. #183.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 00:36:04 +00:00
wassname b891109633 A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104)
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:21:39 +00:00
wassname 2b48eab6b8 journal: A5 suppression preliminary (job 104 step ~32) -- held-out hacks emerge on-policy, knob-off deploy holds 0.000
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:59:26 +00:00
wassname 1265445939 journal: A5 baseline (job 103) finished -- per-mode deploy split, held-out modes all emerge on-policy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 13:11:46 +00:00
wassname e91e095596 journal: A5 baseline confirmed -- held-out file_marker+sentinel emerge on-policy (run_tests-only teacher), deploy hack 0.52 by step 40
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 05:33:36 +00:00
wassname a162eded65 journal: job 97 gentle-probe refutes vanilla-200 collapse (lp_s>-0.5); A5 baseline sees held-out file_marker emerge on-policy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 04:01:18 +00:00
wassname 46b102ad22 eval: revert eval-every default 10->5 (knob-on removal made it cheap again)
The knob-on pass removal is the real win (halves each eval). With it gone,
every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy
points vs 6 -- better plots, cheap. No paper figure uses the knob-on train
curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only).
Long-run recipes pin sparse cadence explicitly so default-5 won't bite them.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:29:23 +00:00
wassname 208713d7c2 perf(eval): drop redundant per-step knob-ON pass, default eval-every 5->10
Per-step TIMING audit (journal 2026-06-04 a): gen ~140s/step dominates;
the 2x2 deploy eval is ~460s and route2 ran it TWICE per eval (knob-off +
knob-on) for a train curve no figure plots -- per-step hack_s already is the
train series, and the full 2x2 is computed once post-loop (FINAL EVAL). Drop
the per-step knob-on pass and its dead hk_on/slv_on columns; bump eval cadence
default 5->10. ~27% faster on 60-step fast runs, ~4h/run on 200-step. refresh
left at 5 (timing shows it's ~10s/step, not the culprit I'd claimed).

plot_dynamics already falls back to hack_s when hk_on absent. Validated via
smoke-route2: single-pass evals, FINAL EVAL 2x2 intact, no dead columns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:25:07 +00:00
wassname 0913b064fc journal: A5 run plan queued (strict teacher-modes=run_tests, vanilla baseline + route2 test)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:46:56 +00:00
wassname a0d4ddf9d5 journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:40:06 +00:00
wassname f8aea5f9e6 journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 20:23:41 +00:00
wassname d06a4fbe69 journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)
The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to
span/erase the trained hack in W. But the same vector works at train time because a gate only
needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing
localization amplify a weak noisy direction into full localization in the throwaway knob. This
is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:04:07 +00:00
wassname 3cc804b15e results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:50:08 +00:00
wassname 8d16b317cb results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 06:47:58 +00:00
wassname 753a54c625 paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).

Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).

Closes A1/A2 (#173).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 03:36:32 +00:00