488 Commits

Author SHA1 Message Date
wassname 34ad20db0a fix route2 no-cheat leak: teacher-only gate anchor + unit test
The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the
run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540,
file_marker 2/1337), force-routing those rollouts -- a real label leak into the
held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows
only, so held-out classes get PROVABLY zero detector labels (airtight A5 control).

Extracted the inline anchor loop to build_route2_anchors() and added
scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces
the leak (held-out FP student force-routed) and teacher_only removes it (zero
student routing, teachers unchanged). 9/9 assertions pass.

Rescoring can't fix this -- the leak is in training (gate shaped the weights),
not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the
A5 run saved no per-eval checkpoints anyway.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:53:23 +00:00
wassname a9523c9cb8 fix overlay label collisions: common right-gutter anchor + leaders
End-labels sat on the line termini (2-arm figs) and piled up bottom-left on
ragged-length multi-arm overlays (substrate, where arms end at different steps).
Now all labels anchor at one gutter x with a leader fanning back to each line's
actual end, y-de-collided. Added right margin so the gutter is clear.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:31:26 +00:00
wassname 504922a3d6 fix collision: lift 'deploy hack =0' off the y=0 line in train_vs_deploy
The solid-red deploy line ran straight through the annotation text (tufte
collision test). Move it into the empty band above the flat line (axes y=0.12).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:25:49 +00:00
wassname b616970e42 fix plot integrity: drop n=28 hack_s fallback in train-vs-deploy series
A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell
back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval.
Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does
not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if
absent -> seed drops from the mean), and missing eval columns normalise to nan
so absent==all-nan. Regenerated all figures from logs. The canonical
train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show
train==deploy honestly (no knob-on data to split).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:21:48 +00:00
wassname b3539e50e7 no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption
Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires
2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector
cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest;
at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'.
Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:01:20 +00:00
wassname 6869afef72 tab:ablation: fill placebo row (0.000/0.531), correct caption to non-directional
Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over
the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate
is direction-agnostic: the discarded knob absorbs whatever crosses the per-step
energy threshold regardless of v_hack alignment. Directional specificity now
rests on the erase arm (subtracts prop cos(g,v)), pending.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:57:06 +00:00
wassname 3da296469b plot_deploy_overlay: Cleveland dot plot replaces grouped bars (tufte)
y=mode, dot per arm, thin connector per mode so vanilla->route change reads as a
line segment. Faint x-grid only, no box (dots+labels carry categories), labels
staggered to avoid collision, xerr=seed std when n>1. Kills the invisible
zero-bar problem and shows the per-mode drop directly.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:51:13 +00:00
wassname 5f478f4bd0 a5: explain generalisation mechanism in caption + ≡0 marker on zero bars
Caption now states v is run_tests-only, teacher is run_tests-only, held-out
modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0,
and the placebo caveat (suppression is the direction-agnostic quarantine, not v
specificity). Bar plot tags invisible zero-height bars with ≡0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:46:57 +00:00
wassname aa6bf57a73 plot_deploy_overlay: aggregate seeds per arm, std error bars (n>1)
Groups per_mode_deploy.json by arm into a list, plots mean+/-std across seeds.
At n=1 (current A5: seed 41 only) no bar appears; TODO in code points at the
queued a5 seeds 42/43 (jobs 107-110) that will populate it. Bar labels show n.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:45:22 +00:00
wassname 04430e3ef2 plot_dynamics: hack≡0 tags on overlay, labels on top panel, 2-panel train-vs-deploy
- overlay: floor hack panel below 0 so a pinned-at-0 line shows; direct-label
  the TOP (hack) panel not the bottom (read top-to-bottom); tag any arm whose
  series sits at 0 with $\equiv 0$.
- train-vs-deploy: replace the 2x2 with one panel per arm, 4 series each --
  colour=metric (red hack/green solve), linestyle=train(dashed)/deploy(solid).
  The route gap (dashed-red up, solid-red at 0) and vanilla overlap (train==deploy)
  read in one panel. two-axis legend (colour=metric, style=train/deploy).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:43:50 +00:00
wassname 176f1baae6 journal: placebo (job 86) full 60-step verdict -- route2 non-directionality locked (12/12 evals at 0.000)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:35:03 +00:00
wassname 5257ff010e plot_dynamics: train-vs-deploy 2x2 uses matched n=64 eval on both rows
The train row fell back to per-step hack_s (noisy n=28 train batch) for arms
without a knob-on eval, so vanilla's train/deploy rows looked like different
estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse
hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on
(knob-on eval). Now every panel is the same held-out eval, differing only in
the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2
+ vanilla sweep, seed 41, 60 steps).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:33:10 +00:00
wassname 0645ae2dd2 fig:longrun: rebuild from job84 route2 + job97 fixed vanilla (no collapse)
Old figure paired route2 (job 84) with job 85 vanilla, whose step-88
'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and
stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to
~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now
flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both
beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:18:34 +00:00
wassname ea0293083c journal: queue erase directionality test (jobs 105/106 real-v vs placebo)
The route2 placebo result (job 86, dir-insensitive gate) doesn't settle
directionality; erase projects with magnitude ~cos(g,v) so it's the arm where
direction must enter. Matched 60-step s41 pair differs only in v source. #183.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 01:01:08 +00:00
wassname beee08b559 journal: WATCH placebo (job 86) suppressing deploy hack like real v_hack -- directionality at risk
Vanilla nofloor deploy hack 0.36 (metric not degenerate), real-v_hack route2
0.000, placebo route2 0.000 through step 20. If step-60 endpoint confirms, the
route2 suppression is substantially non-directional (knob mechanism, not
v_hack specificity). Verdict + random-V cross-check next cycle. #183.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 00:36:04 +00:00
wassname b311815fc6 writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat)
New Results subsection + fix stale Limitations bullet (50%->12.5%). Three
evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step,
student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out
emergence (teacher demos only run_tests, student emerges 3 modes with zero
teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead
of A3 random-V).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:26:54 +00:00
wassname b891109633 A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104)
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:21:39 +00:00
wassname 2fb56e1a91 writeup: prefill A5 generalisation table vanilla column (job 103); reframe as designed held-out test
Replaces the old opportunistic n=3 partial read. Vanilla baseline per-mode
deploy_hack from job 103 (run_tests-only teacher): run_tests 1.000, file_marker
0.625, sentinel 0.417, stdout_marker 0.167. route2 column pending job 104.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 22:30:35 +00:00
wassname 2b48eab6b8 journal: A5 suppression preliminary (job 104 step ~32) -- held-out hacks emerge on-policy, knob-off deploy holds 0.000
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:59:26 +00:00
wassname 1265445939 journal: A5 baseline (job 103) finished -- per-mode deploy split, held-out modes all emerge on-policy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 13:11:46 +00:00
wassname 4fcce164f7 memory: merge qmd-lexical + s2-keyed-access into tracked .claude/memory
Harness path ~/.claude/projects/.../memory was a real dir that had diverged
from the repo copy (the 9c188f6 symlink targeted /root, not this box's home).
Merged the two harness-only memories in and re-pointed the harness path at the
repo via symlink, so future auto-writes land in-project.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:40:48 +08:00
wassname 154e33683e paper: HARVE byline cross-verified arXiv==S2 (keyed semantic-search .env)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:20:04 +08:00
wassname b097d9abfc paper: add verified related work (11 refs) + fix Huang->Deng first author
Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all
arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape:
- gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE)
- deletable-module precedent (pre-dates Cloud): zhou2023securityvectors
- gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa
- C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize
- weight-space contrastive direction: fierro2025weightarithmetic
- shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey
- idea provenance: mallen2025rhinterventions (AF)
Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189);
sync the cold-reader comment to 'Deng et al.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:18:44 +08:00
wassname e91e095596 journal: A5 baseline confirmed -- held-out file_marker+sentinel emerge on-policy (run_tests-only teacher), deploy hack 0.52 by step 40
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 05:33:36 +00:00
wassname a162eded65 journal: job 97 gentle-probe refutes vanilla-200 collapse (lp_s>-0.5); A5 baseline sees held-out file_marker emerge on-policy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 04:01:18 +00:00
wassname 5a25a1cc1c results: fill route-rf2 ablation cell (job99: deploy hack 0.000/solve 0.625, staleness harmless)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 03:00:57 +00:00
wassname 46b102ad22 eval: revert eval-every default 10->5 (knob-on removal made it cheap again)
The knob-on pass removal is the real win (halves each eval). With it gone,
every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy
points vs 6 -- better plots, cheap. No paper figure uses the knob-on train
curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only).
Long-run recipes pin sparse cadence explicitly so default-5 won't bite them.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:29:23 +00:00
wassname 208713d7c2 perf(eval): drop redundant per-step knob-ON pass, default eval-every 5->10
Per-step TIMING audit (journal 2026-06-04 a): gen ~140s/step dominates;
the 2x2 deploy eval is ~460s and route2 ran it TWICE per eval (knob-off +
knob-on) for a train curve no figure plots -- per-step hack_s already is the
train series, and the full 2x2 is computed once post-loop (FINAL EVAL). Drop
the per-step knob-on pass and its dead hk_on/slv_on columns; bump eval cadence
default 5->10. ~27% faster on 60-step fast runs, ~4h/run on 200-step. refresh
left at 5 (timing shows it's ~10s/step, not the culprit I'd claimed).

plot_dynamics already falls back to hack_s when hk_on absent. Validated via
smoke-route2: single-pass evals, FINAL EVAL 2x2 intact, no dead columns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:25:07 +00:00
wassname 65a05c365c fix(writeup): flag vanilla-200 collapse as preset artifact (job 97), not a finding
Job 97 (gentle preset lr=1e-3/adam0.9-0.99/beta=0) ran vanilla-200 without
collapse (lp_s in [-0.47,-0.29] to step 200, deploy hack 0.375). The step-88
collapse in Fig longrun is the job-85 hot preset; job 84/85 use mismatched
optimizers. Mark figure for regen from matched beta=1e-5 pair (jobs 100/101).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 00:42:44 +00:00
wassname 0efd638b4e memory 2026-06-03 23:56:50 +00:00
wassname 0913b064fc journal: A5 run plan queued (strict teacher-modes=run_tests, vanilla baseline + route2 test)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:46:56 +00:00
wassname da48a95d9e feat: --teacher-modes for clean A5 no-cheat (train held-out modes on-policy, anchor only known)
Decouples training problems from teacher pool: when teacher_modes is set, the
pool is restricted to known-mode demos, the line-589 pool filter is skipped, and
held-out prompts fall through to student-only (not skipped). The route2 tau
hack-anchor then sees only known-mode teacher rows + known-mode hacked_E, so
held-out suppression is pure absorption -- no held-out label at train time.
Smoke-verified: run_tests prompts get teacher mix, held-out prompts train on-policy.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:45:49 +00:00
wassname a0d4ddf9d5 journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:40:06 +00:00
wassname f8aea5f9e6 journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 20:23:41 +00:00
wassname 6085efcc54 paper: de-meta the captions (humanizer/paper-writing)
Captions describe the data and state the finding, not the figure's role in
the paper. Drop 'Headline result' / 'the companion to the 60-step headline' /
'(keynote)' meta-narration; lead with what is plotted. Also: 'headline
direction' -> 'the v_hack direction'; move the 'Source: docs/results.md'
provenance from body text into a comment.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:43:08 +00:00
wassname 895aedd983 paper: page-1 headline fig, dir arrows, algorithm pseudocode, polish
Addresses the formatting review:
- Figure 1 (keynote) moved to page 1 (declared before body, inline float)
- placeholder Introduction prose + hypothesis block (from README), \TODO rewrite
- direction arrows on every metric column (hack down-arrow, solve up-arrow);
  best cells bold
- pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real
  Python and the chat prompt stay lstlisting
- math/underscore removed from headings; loophole-mode names in code font
- ablation Source column moved into a comment (internal, not shown)
- long-run fig caption made explicitly the 200-step companion to the headline
- every float now has a text reference (placeholder where prose is TODO)
- dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet
  (is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:38:32 +00:00
wassname bd7550f559 paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation
Formatting pass lifted from the AntiPaSTO paper (the format the author is
happy with):
- verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks;
  chat-template prompt uses language={} so markup isn't keyword-coloured)
- xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation)
- ablation table restructured as leave-one-out with the negate symbol
  (negate-routing/directional/hack-pairs/intervention); long interpretation
  moved out of the caption into section body; post-hoc split into its own block
- real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the
  UNVERIFIED placeholder; dropped the verify-before-submission TODO
- code-availability line with a GitHub glyph (anonymous placeholder)

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:22:22 +00:00
wassname 9c188f60be memory: track agent memory in repo .claude/memory (symlinked from harness path)
Move per-project agent memory into the repo so it is version-controlled.
Harness path /root/.claude/projects/.../memory now symlinks here, so
auto-load still works while the files live under git. Un-ignore only
.claude/memory/ (worktrees, locks, local settings stay ignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:08:25 +00:00
wassname 51ee43577a paper: tab:ablation interp -- weak vector fails to erase but routes (SGTM absorption)
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:05:20 +00:00
wassname d06a4fbe69 journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)
The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to
span/erase the trained hack in W. But the same vector works at train time because a gate only
needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing
localization amplify a weak noisy direction into full localization in the throwaway knob. This
is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:04:07 +00:00
wassname 3cc804b15e results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:50:08 +00:00
wassname f4ac26a440 results: A3 erase-static row lands (deploy hack 0.500); both erase arms fail
Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500
(HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress
(static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the
extracted direction does not stop hacking; routing the gated rollout does.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:02:51 +00:00
wassname 8d16b317cb results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 06:47:58 +00:00
wassname 1fb49a3325 log: reprint step-table header every 50 rows; related-work: Piggyback learned-mask critique
Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn
adjacent to lr). Records the anticipated Piggyback 'why not learn the routing
mask' critique (answer: no-cheat withholds the per-rollout label a learned mask
needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 04:46:12 +00:00
wassname b8dcb4ec33 diag: count zero-variance-skipped GRPO groups per step (zerovar=)
Tests the post-saturation collapse mechanism for vanilla long runs: as a
loophole saturates, every rollout in a group hacks -> identical reward ->
group hits the Dr.GRPO zero-variance skip -> no learning signal. Prediction:
zerovar climbs toward max right as lp_s starts collapsing (~step 80 in job 85).
Surfaced on the existing per-step diag debug line, not the streaming table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 04:20:03 +00:00
wassname 753a54c625 paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).

Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).

Closes A1/A2 (#173).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 03:36:32 +00:00
wassname 87cca9a603 plots: dejargon, drop redundant titles, emit png+svg+pdf, CSV re-render
Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay):
- One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the
  paper, png for the blog). All scripts call it.
- arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the
  cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack).
- Titles off by default (the paper/blog caption carries it); --title re-enables
  for standalone research use.
- dump_data CSV now carries every plotted series; plot_dynamics --from-csv
  re-renders the three figures from the committed CSV with no logs (logs/ and
  out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified.
- Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:44:39 +00:00
wassname 17a8792340 paper: address comprehension friction + OpenReview novelty challenge
- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader
  panel): lead Cloud with parameter-vs-activation space; state Huang's
  keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis
  question as a reviewer attack vector.
- Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains
  (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack
  'leakage that shrinks with scale'.
- New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the
  'limited novelty vs weight-subspace masking' critique that rejected the
  gradient-routing paper. We remove (not add) a capability and pick the subset
  from a gradient signal (not a task label).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:29:45 +00:00
wassname ffc2df540f blog: drop reader-facing route2 tag -> route (consistency with paper)
route2 is an internal run-tag, not something a reader cares about.
Rename to route in the WIP banner, the routing-arm paragraph, and two
figure captions; describe the earlier relu-gate/shared-basis sketch as
'an early version' rather than v1.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:20:13 +00:00
wassname dbcc3a5ad3 paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag)
User settled it: prog_wide pairs were AI-authored (Claude), so the
synthetic/AI-written framing in contribution 2 is honest. Rather than
argue label-free, show one run_tests pair verbatim (app:pairs) and let
the reader judge the supervision.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:17:49 +00:00