Commit Graph

461 Commits

Author SHA1 Message Date
wassname 208713d7c2 perf(eval): drop redundant per-step knob-ON pass, default eval-every 5->10
Per-step TIMING audit (journal 2026-06-04 a): gen ~140s/step dominates;
the 2x2 deploy eval is ~460s and route2 ran it TWICE per eval (knob-off +
knob-on) for a train curve no figure plots -- per-step hack_s already is the
train series, and the full 2x2 is computed once post-loop (FINAL EVAL). Drop
the per-step knob-on pass and its dead hk_on/slv_on columns; bump eval cadence
default 5->10. ~27% faster on 60-step fast runs, ~4h/run on 200-step. refresh
left at 5 (timing shows it's ~10s/step, not the culprit I'd claimed).

plot_dynamics already falls back to hack_s when hk_on absent. Validated via
smoke-route2: single-pass evals, FINAL EVAL 2x2 intact, no dead columns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:25:07 +00:00
wassname 65a05c365c fix(writeup): flag vanilla-200 collapse as preset artifact (job 97), not a finding
Job 97 (gentle preset lr=1e-3/adam0.9-0.99/beta=0) ran vanilla-200 without
collapse (lp_s in [-0.47,-0.29] to step 200, deploy hack 0.375). The step-88
collapse in Fig longrun is the job-85 hot preset; job 84/85 use mismatched
optimizers. Mark figure for regen from matched beta=1e-5 pair (jobs 100/101).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 00:42:44 +00:00
wassname 0efd638b4e memory 2026-06-03 23:56:50 +00:00
wassname 0913b064fc journal: A5 run plan queued (strict teacher-modes=run_tests, vanilla baseline + route2 test)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:46:56 +00:00
wassname da48a95d9e feat: --teacher-modes for clean A5 no-cheat (train held-out modes on-policy, anchor only known)
Decouples training problems from teacher pool: when teacher_modes is set, the
pool is restricted to known-mode demos, the line-589 pool filter is skipped, and
held-out prompts fall through to student-only (not skipped). The route2 tau
hack-anchor then sees only known-mode teacher rows + known-mode hacked_E, so
held-out suppression is pure absorption -- no held-out label at train time.
Smoke-verified: run_tests prompts get teacher mix, held-out prompts train on-policy.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:45:49 +00:00
wassname a0d4ddf9d5 journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:40:06 +00:00
wassname f8aea5f9e6 journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 20:23:41 +00:00
wassname 6085efcc54 paper: de-meta the captions (humanizer/paper-writing)
Captions describe the data and state the finding, not the figure's role in
the paper. Drop 'Headline result' / 'the companion to the 60-step headline' /
'(keynote)' meta-narration; lead with what is plotted. Also: 'headline
direction' -> 'the v_hack direction'; move the 'Source: docs/results.md'
provenance from body text into a comment.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:43:08 +00:00
wassname 895aedd983 paper: page-1 headline fig, dir arrows, algorithm pseudocode, polish
Addresses the formatting review:
- Figure 1 (keynote) moved to page 1 (declared before body, inline float)
- placeholder Introduction prose + hypothesis block (from README), \TODO rewrite
- direction arrows on every metric column (hack down-arrow, solve up-arrow);
  best cells bold
- pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real
  Python and the chat prompt stay lstlisting
- math/underscore removed from headings; loophole-mode names in code font
- ablation Source column moved into a comment (internal, not shown)
- long-run fig caption made explicitly the 200-step companion to the headline
- every float now has a text reference (placeholder where prose is TODO)
- dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet
  (is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:38:32 +00:00
wassname bd7550f559 paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation
Formatting pass lifted from the AntiPaSTO paper (the format the author is
happy with):
- verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks;
  chat-template prompt uses language={} so markup isn't keyword-coloured)
- xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation)
- ablation table restructured as leave-one-out with the negate symbol
  (negate-routing/directional/hack-pairs/intervention); long interpretation
  moved out of the caption into section body; post-hoc split into its own block
- real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the
  UNVERIFIED placeholder; dropped the verify-before-submission TODO
- code-availability line with a GitHub glyph (anonymous placeholder)

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:22:22 +00:00
wassname 9c188f60be memory: track agent memory in repo .claude/memory (symlinked from harness path)
Move per-project agent memory into the repo so it is version-controlled.
Harness path /root/.claude/projects/.../memory now symlinks here, so
auto-load still works while the files live under git. Un-ignore only
.claude/memory/ (worktrees, locks, local settings stay ignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:08:25 +00:00
wassname 51ee43577a paper: tab:ablation interp -- weak vector fails to erase but routes (SGTM absorption)
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:05:20 +00:00
wassname d06a4fbe69 journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)
The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to
span/erase the trained hack in W. But the same vector works at train time because a gate only
needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing
localization amplify a weak noisy direction into full localization in the throwaway knob. This
is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:04:07 +00:00
wassname 3cc804b15e results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:50:08 +00:00
wassname f4ac26a440 results: A3 erase-static row lands (deploy hack 0.500); both erase arms fail
Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500
(HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress
(static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the
extracted direction does not stop hacking; routing the gated rollout does.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:02:51 +00:00
wassname 8d16b317cb results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 06:47:58 +00:00
wassname 1fb49a3325 log: reprint step-table header every 50 rows; related-work: Piggyback learned-mask critique
Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn
adjacent to lr). Records the anticipated Piggyback 'why not learn the routing
mask' critique (answer: no-cheat withholds the per-rollout label a learned mask
needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 04:46:12 +00:00
wassname b8dcb4ec33 diag: count zero-variance-skipped GRPO groups per step (zerovar=)
Tests the post-saturation collapse mechanism for vanilla long runs: as a
loophole saturates, every rollout in a group hacks -> identical reward ->
group hits the Dr.GRPO zero-variance skip -> no learning signal. Prediction:
zerovar climbs toward max right as lp_s starts collapsing (~step 80 in job 85).
Surfaced on the existing per-step diag debug line, not the streaming table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 04:20:03 +00:00
wassname 753a54c625 paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).

Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).

Closes A1/A2 (#173).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 03:36:32 +00:00
wassname 87cca9a603 plots: dejargon, drop redundant titles, emit png+svg+pdf, CSV re-render
Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay):
- One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the
  paper, png for the blog). All scripts call it.
- arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the
  cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack).
- Titles off by default (the paper/blog caption carries it); --title re-enables
  for standalone research use.
- dump_data CSV now carries every plotted series; plot_dynamics --from-csv
  re-renders the three figures from the committed CSV with no logs (logs/ and
  out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified.
- Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:44:39 +00:00
wassname 17a8792340 paper: address comprehension friction + OpenReview novelty challenge
- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader
  panel): lead Cloud with parameter-vs-activation space; state Huang's
  keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis
  question as a reviewer attack vector.
- Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains
  (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack
  'leakage that shrinks with scale'.
- New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the
  'limited novelty vs weight-subspace masking' critique that rejected the
  gradient-routing paper. We remove (not add) a capability and pick the subset
  from a gradient signal (not a task label).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:29:45 +00:00
wassname ffc2df540f blog: drop reader-facing route2 tag -> route (consistency with paper)
route2 is an internal run-tag, not something a reader cares about.
Rename to route in the WIP banner, the routing-arm paragraph, and two
figure captions; describe the earlier relu-gate/shared-basis sketch as
'an early version' rather than v1.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:20:13 +00:00
wassname dbcc3a5ad3 paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag)
User settled it: prog_wide pairs were AI-authored (Claude), so the
synthetic/AI-written framing in contribution 2 is honest. Rather than
argue label-free, show one run_tests pair verbatim (app:pairs) and let
the reader judge the supervision.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:17:49 +00:00
wassname 5dcc90363a paper: humanizer pass on prose I added (em-dash -> commas)
Replaced em-dash-style '--' parentheticals with commas in the rendered prose
(contributions item 1, method route, SGTM + confessions related-work bullets).
Remaining '--' are LaTeX numeric ranges, TODO placeholders, or % comments.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:49:01 +00:00
wassname 4a002e942f paper: precise Huang trusted-direction contrast; rename paper note deng->huang
Huang related-work bullet now states the actual differences (SVD of clean
update trajectory + warmup vs our contrastive pair-gradients in delta_S coords;
they project onto trusted, we project out hack; we quarantine+delete at deploy,
they only constrain training). Renamed docs/papers/grad_routing/paper_deng_*
-> paper_huang_* (untracked note; correct attribution is Huang et al. 2026).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:47:24 +00:00
wassname c1388e5325 paper: title -> question form 'Can We Quarantine Reward Hacking with a Reward-Hacking Representation?'
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:42:03 +00:00
wassname 97a4c5d7b1 paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:19:35 +00:00
wassname 05731cc0e4 paper: drop reader-facing route2 version tag; flag SGTM-not-Cloud lineage
- route2 -> route in all prose/captions/tables (route2 stays in % provenance
  comments as the run-tag). A reader does not care about the version number.
- title: steering-vector framing; recorded naming reasoning as a comment (do NOT
  claim label-free -- our pairs ARE labels; the backable scoped claim is held-out
  hacks suppressed with zero labels of their own, earnable by A5).
- FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter-
  gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim
  claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:59:24 +00:00
wassname 4c9071cca0 A5: build held-out-mode (hack,clean) pairs from student rollouts
scripts/pairs_from_rollouts.py mirrors pairs_from_pool but sources the
student's own rollouts.jsonl and splits hack/clean by env_mode+exploited
(the per-mode weak detector). Same-prompt pairing, asserts prompt equality.
Smoke-validated: parse + classify + loud-fail paths green on smoke rollouts
(0 hacks -> 0 pairs, as expected). Unblocks A5 once job 95 harvest lands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:59:07 +00:00
wassname a7703409ea paper: replace two defensive 'X not Y' framings with positive statements
Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable
not delayed' (answers an offstage referee objection) -> state the positive (gap
opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged
humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor'
X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:27:54 +00:00
wassname a296f0c950 cleanup: trim 2 stale provenance/train-of-thought comments
make_random_vhack: drop 'run 31's v_hack' dead run-ID + 'the task number', state
the actual role (shape source / reproducible seed). probe_distill: collapse the
'either finish or remove, for now...' train-of-thought into a tight FIXME.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:25:22 +00:00
wassname 4fa9061162 refactor: move 5 leaf entrypoints src/ -> scripts/ (src is now library-only)
verify_rewards, verify_vhack_heldout, build_substrate, probe_distill, probe_plot_stack
are run via 'python -m' / justfile and imported by no core module -> moved to scripts/,
relative imports rewritten to 'from projected_grpo.X'. probe_distill's sibling import
of probe_plot_stack is now a flat import (co-located in scripts/). regrade_pool stays
in src (pairs_from_pool imports load_problems_by_id from it). justfile recipes updated.

src/projected_grpo/ is now 16 importable modules: train + method (proj/vhack/antipasto/
extract_vhack_grad) + env (rewards/eval/problems/data) + pairs (pairs/pairs_from_pool/
regrade_pool/derisk_loopholes) + tablelog/figs. ~1480 lines moved out of the package.
Smoke green (verify_rewards 52/52 from scripts/, train pipeline cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:23:56 +00:00
wassname 07363f1ede cleanup: trim stale comments + attic README
Dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the
'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in
proj.py, and the conversational 'current experiment line' lead. Removed the dead
probe-traj justfile recipe. Kept all TODO/FIXME and the 'why' memory-tuning
comments. Smoke green (cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:09:19 +00:00
wassname fffd26a93d cleanup: delete 6 orphan modules, quarantine pair generators, trim stale comments
Deleted (zero importers/refs): scripts/{migrate_out_dirs,audit_log,plot_route_evidence}.py
and src/projected_grpo/{bake_lora,probe_lora_runtime,probe_traj}.py (LoRA-merge path
+ dev trajectory comparator, superseded). Removed the dead probe-traj recipe.

Quarantined to scripts/attic/: make_pairsets.py + make_dataset_pairsets.py (persona-pair
authoring, tasks #123-126 done; live path is pairs.PAIRS / pairs_from_pool).

Comments: dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the
'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in
proj.py, and the conversational 'current experiment line' lead. Kept all TODO/FIXME
and the 'why' memory-tuning comments. Smoke green (cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:09:01 +00:00
wassname 025debae6b feat: knob-ON eval (route arms) for like-for-like train-vs-deploy + teacher-off marker
The 2x2 train row used per-step hack_s (noisy n=28 train batch, knob-on) vs the
deploy row's smooth n=64 eval (knob-off) -- different estimators, confounded.
Now at each eval step route arms ALSO run the SAME n=64 eval with the quarantine
ACTIVE (knob-on = training policy), logged as hk_on/slv_on. vanilla/erase reuse
deploy (no quarantine -> knob-on==knob-off). plot_dynamics prefers hk_on for the
train series so the 2x2 differs ONLY in knob state.

Also: plot parses --teacher-off-step from argv and shades the teacher-ON region
[0,toff] + a dashed cut line in the 2x2. The stashed long-run route2 jobs
(92 KL, 94 teacher-off) inherit the knob-on eval automatically at runtime.

Smoke (route2 hk_on present + logged, both plot parse paths) green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:00:24 +00:00
wassname 4ee3f03878 justfile: paper-run recipes on record (longrun/noteacher/teacheroff/harvest)
paper-longrun, paper-noteacher, paper-teacheroff, paper-harvest -- each pueue-adds
with a why:/resolve: label so every paper job is reproducible from one command.
longrun uses the KL-stabilised optimizer (beta=1e-5, Adam 0.9/0.99).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:54:50 +00:00
wassname 24fa924c8d plot: 2x2 train(knob-on) vs deploy(knob-off) x arm figure
The A4 framing in one figure: vanilla train==deploy (cheat in the weights),
route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log
now keeps the raw train series (hack_train/solve_train) before the deploy
substitution. New fig: dyn_longrun_200_train_deploy.png.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:53:08 +00:00
wassname 4336d6c577 feat: log problem_id/env_mode/prompt to rollouts + --teacher-off-step curriculum
rollouts.jsonl now carries problem_id, env_mode, and the exact chat-templated
prompt -- the per-prompt problem is a random draw, so these are required to harvest
same-prompt (hack,clean) pairs from real student rollouts (A5 held-out v_grad; the
teacher pool is a different distribution, not IID with student hacks).

--teacher-off-step=N: seed hacks via teacher pool for N steps then cut to pure
on-policy (G_t=0) -- guarantees all hacks emerge before testing route2 persistence
without the teacher crutch. Smoke (curriculum fires at step 2, metadata present) green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:51:27 +00:00
wassname 62e510ff57 feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition)
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.

Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:26:26 +00:00
wassname e00292860f results: commit longrun A4 fig + CSV data source (force-add, out/ is gitignored)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:19:29 +00:00
wassname bfa2b82aba plot: longrun A4 fig + visible route2 hack≡0 line + committed CSV data source
plot_dynamics now dumps the plotted series to <out>.csv next to the png so the
figure is reproducible from a tracked artifact (logs/ and out/runs/ are gitignored).
ylim floor -0.035 + 'hack ≡ 0' annotation so route2's pinned-at-zero hack line is
visible rather than hidden under the axis.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:19:06 +00:00
wassname 19544b3f06 journal: route2 holds deploy-hack=0 to 200 steps (job 84, durable not delayed)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 16:33:04 +00:00
wassname 311bf2854f results: fill keynote table/figure at n=3 route2 / n=2 vanilla
C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125):
  route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010
  vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032
  => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real
  band (3 route2 + 2 vanilla seeds, per-seed thin lines).

- main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending).
- results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not
  structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73).
- RESEARCH_JOURNAL 2026-06-02 entry.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 11:08:41 +00:00
wassname 768590a625 fix: kill StepLogger merge-shadow that crashed every new run
The worktree-refactor merge (a1b17ab) left the pre-refactor StepLogger
(plus _Col/_format_cell) defined in train.py, shadowing the import from
tablelog.py. Call site uses the new mode_code signature, so the shadow
raised TypeError on every run -> jobs 75/76/77/78/84 all died at startup.

The two copies had diverged: train.py's shadow carried the newer
deploy-for-all-arms layout (task #179) + per-mode int columns, while
tablelog's extracted copy had the older routing-only deploy placement +
frac per-mode. Ported the correct layout into tablelog (the refactor
target) and deleted the shadow. Verified via smoke + smoke-vanilla:
vanilla header now shows hk_dep/slv_dep, routing2 has them once,
per-mode hk_rt renders as int.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 08:09:05 +00:00
wassname 2570dfaa67 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-02 07:21:49 +00:00
wassname cf3ecc40f8 write up 2026-06-02 07:20:42 +00:00
wassname 923de6dbe6 docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe
Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers
filled with provenance, vanilla pending jobs 74/84) + figures + verified refs
+ appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build
artifacts and figs symlinks gitignored. `just paper` compiles via tectonic;
`just paper-qc` dumps text + greps for unresolved refs / TODOs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 06:59:15 +00:00
wassname 17e4f2e2ff feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker
- deploy hack/solve is now the headline metric for all arms, so turn the
  mid-train deploy-eval on by default (smoke now covers the deploy path too);
  200-step runs pass a sparser cadence explicitly.
- docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker
  (keynote fig+table, ablation table, long-run fig, generalisation, appendix).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 04:41:43 +00:00
wassname 227c173f63 feat: test-time (post-hoc) hack-erasure benchmark
scripts/tt_erase_bench.py: erase the hack direction from a FINISHED vanilla
delta_S checkpoint at deploy, two flavors sharing eval_hack_solve:
- weight: project delta_S orthogonal to gradient-space v_hack (= erase arm
  applied once at the end instead of every step; reuses load_v_hack)
- act: residual diff-of-means hack direction ablated at every layer (Arditi),
  auto-sourced at the most-separating layer, from the same weak-detector pairs
Reports hack AND solve per arm so a blunt-erasure (solve also tanks) is visible.
Baseline for whether train-time routing beats cheap post-hoc erasure.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 02:20:51 +00:00
wassname cfdb196869 misc 2026-06-02 02:06:43 +00:00