Commit Graph

239 Commits

Author SHA1 Message Date
wassname dbcc3a5ad3 paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag)
User settled it: prog_wide pairs were AI-authored (Claude), so the
synthetic/AI-written framing in contribution 2 is honest. Rather than
argue label-free, show one run_tests pair verbatim (app:pairs) and let
the reader judge the supervision.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:17:49 +00:00
wassname 5dcc90363a paper: humanizer pass on prose I added (em-dash -> commas)
Replaced em-dash-style '--' parentheticals with commas in the rendered prose
(contributions item 1, method route, SGTM + confessions related-work bullets).
Remaining '--' are LaTeX numeric ranges, TODO placeholders, or % comments.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:49:01 +00:00
wassname 4a002e942f paper: precise Huang trusted-direction contrast; rename paper note deng->huang
Huang related-work bullet now states the actual differences (SVD of clean
update trajectory + warmup vs our contrastive pair-gradients in delta_S coords;
they project onto trusted, we project out hack; we quarantine+delete at deploy,
they only constrain training). Renamed docs/papers/grad_routing/paper_deng_*
-> paper_huang_* (untracked note; correct attribution is Huang et al. 2026).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:47:24 +00:00
wassname c1388e5325 paper: title -> question form 'Can We Quarantine Reward Hacking with a Reward-Hacking Representation?'
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:42:03 +00:00
wassname 97a4c5d7b1 paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:19:35 +00:00
wassname 05731cc0e4 paper: drop reader-facing route2 version tag; flag SGTM-not-Cloud lineage
- route2 -> route in all prose/captions/tables (route2 stays in % provenance
  comments as the run-tag). A reader does not care about the version number.
- title: steering-vector framing; recorded naming reasoning as a comment (do NOT
  claim label-free -- our pairs ARE labels; the backable scoped claim is held-out
  hacks suppressed with zero labels of their own, earnable by A5).
- FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter-
  gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim
  claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:59:24 +00:00
wassname 4c9071cca0 A5: build held-out-mode (hack,clean) pairs from student rollouts
scripts/pairs_from_rollouts.py mirrors pairs_from_pool but sources the
student's own rollouts.jsonl and splits hack/clean by env_mode+exploited
(the per-mode weak detector). Same-prompt pairing, asserts prompt equality.
Smoke-validated: parse + classify + loud-fail paths green on smoke rollouts
(0 hacks -> 0 pairs, as expected). Unblocks A5 once job 95 harvest lands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:59:07 +00:00
wassname a7703409ea paper: replace two defensive 'X not Y' framings with positive statements
Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable
not delayed' (answers an offstage referee objection) -> state the positive (gap
opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged
humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor'
X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:27:54 +00:00
wassname a296f0c950 cleanup: trim 2 stale provenance/train-of-thought comments
make_random_vhack: drop 'run 31's v_hack' dead run-ID + 'the task number', state
the actual role (shape source / reproducible seed). probe_distill: collapse the
'either finish or remove, for now...' train-of-thought into a tight FIXME.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:25:22 +00:00
wassname 4fa9061162 refactor: move 5 leaf entrypoints src/ -> scripts/ (src is now library-only)
verify_rewards, verify_vhack_heldout, build_substrate, probe_distill, probe_plot_stack
are run via 'python -m' / justfile and imported by no core module -> moved to scripts/,
relative imports rewritten to 'from projected_grpo.X'. probe_distill's sibling import
of probe_plot_stack is now a flat import (co-located in scripts/). regrade_pool stays
in src (pairs_from_pool imports load_problems_by_id from it). justfile recipes updated.

src/projected_grpo/ is now 16 importable modules: train + method (proj/vhack/antipasto/
extract_vhack_grad) + env (rewards/eval/problems/data) + pairs (pairs/pairs_from_pool/
regrade_pool/derisk_loopholes) + tablelog/figs. ~1480 lines moved out of the package.
Smoke green (verify_rewards 52/52 from scripts/, train pipeline cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:23:56 +00:00
wassname 07363f1ede cleanup: trim stale comments + attic README
Dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the
'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in
proj.py, and the conversational 'current experiment line' lead. Removed the dead
probe-traj justfile recipe. Kept all TODO/FIXME and the 'why' memory-tuning
comments. Smoke green (cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:09:19 +00:00
wassname fffd26a93d cleanup: delete 6 orphan modules, quarantine pair generators, trim stale comments
Deleted (zero importers/refs): scripts/{migrate_out_dirs,audit_log,plot_route_evidence}.py
and src/projected_grpo/{bake_lora,probe_lora_runtime,probe_traj}.py (LoRA-merge path
+ dev trajectory comparator, superseded). Removed the dead probe-traj recipe.

Quarantined to scripts/attic/: make_pairsets.py + make_dataset_pairsets.py (persona-pair
authoring, tasks #123-126 done; live path is pairs.PAIRS / pairs_from_pool).

Comments: dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the
'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in
proj.py, and the conversational 'current experiment line' lead. Kept all TODO/FIXME
and the 'why' memory-tuning comments. Smoke green (cout->0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:09:01 +00:00
wassname 025debae6b feat: knob-ON eval (route arms) for like-for-like train-vs-deploy + teacher-off marker
The 2x2 train row used per-step hack_s (noisy n=28 train batch, knob-on) vs the
deploy row's smooth n=64 eval (knob-off) -- different estimators, confounded.
Now at each eval step route arms ALSO run the SAME n=64 eval with the quarantine
ACTIVE (knob-on = training policy), logged as hk_on/slv_on. vanilla/erase reuse
deploy (no quarantine -> knob-on==knob-off). plot_dynamics prefers hk_on for the
train series so the 2x2 differs ONLY in knob state.

Also: plot parses --teacher-off-step from argv and shades the teacher-ON region
[0,toff] + a dashed cut line in the 2x2. The stashed long-run route2 jobs
(92 KL, 94 teacher-off) inherit the knob-on eval automatically at runtime.

Smoke (route2 hk_on present + logged, both plot parse paths) green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:00:24 +00:00
wassname 4ee3f03878 justfile: paper-run recipes on record (longrun/noteacher/teacheroff/harvest)
paper-longrun, paper-noteacher, paper-teacheroff, paper-harvest -- each pueue-adds
with a why:/resolve: label so every paper job is reproducible from one command.
longrun uses the KL-stabilised optimizer (beta=1e-5, Adam 0.9/0.99).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:54:50 +00:00
wassname 24fa924c8d plot: 2x2 train(knob-on) vs deploy(knob-off) x arm figure
The A4 framing in one figure: vanilla train==deploy (cheat in the weights),
route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log
now keeps the raw train series (hack_train/solve_train) before the deploy
substitution. New fig: dyn_longrun_200_train_deploy.png.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:53:08 +00:00
wassname 4336d6c577 feat: log problem_id/env_mode/prompt to rollouts + --teacher-off-step curriculum
rollouts.jsonl now carries problem_id, env_mode, and the exact chat-templated
prompt -- the per-prompt problem is a random draw, so these are required to harvest
same-prompt (hack,clean) pairs from real student rollouts (A5 held-out v_grad; the
teacher pool is a different distribution, not IID with student hacks).

--teacher-off-step=N: seed hacks via teacher pool for N steps then cut to pure
on-policy (G_t=0) -- guarantees all hacks emerge before testing route2 persistence
without the teacher crutch. Smoke (curriculum fires at step 2, metadata present) green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:51:27 +00:00
wassname 62e510ff57 feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition)
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.

Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:26:26 +00:00
wassname e00292860f results: commit longrun A4 fig + CSV data source (force-add, out/ is gitignored)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:19:29 +00:00
wassname bfa2b82aba plot: longrun A4 fig + visible route2 hack≡0 line + committed CSV data source
plot_dynamics now dumps the plotted series to <out>.csv next to the png so the
figure is reproducible from a tracked artifact (logs/ and out/runs/ are gitignored).
ylim floor -0.035 + 'hack ≡ 0' annotation so route2's pinned-at-zero hack line is
visible rather than hidden under the axis.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:19:06 +00:00
wassname 19544b3f06 journal: route2 holds deploy-hack=0 to 200 steps (job 84, durable not delayed)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 16:33:04 +00:00
wassname 311bf2854f results: fill keynote table/figure at n=3 route2 / n=2 vanilla
C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125):
  route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010
  vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032
  => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real
  band (3 route2 + 2 vanilla seeds, per-seed thin lines).

- main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending).
- results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not
  structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73).
- RESEARCH_JOURNAL 2026-06-02 entry.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 11:08:41 +00:00
wassname 768590a625 fix: kill StepLogger merge-shadow that crashed every new run
The worktree-refactor merge (a1b17ab) left the pre-refactor StepLogger
(plus _Col/_format_cell) defined in train.py, shadowing the import from
tablelog.py. Call site uses the new mode_code signature, so the shadow
raised TypeError on every run -> jobs 75/76/77/78/84 all died at startup.

The two copies had diverged: train.py's shadow carried the newer
deploy-for-all-arms layout (task #179) + per-mode int columns, while
tablelog's extracted copy had the older routing-only deploy placement +
frac per-mode. Ported the correct layout into tablelog (the refactor
target) and deleted the shadow. Verified via smoke + smoke-vanilla:
vanilla header now shows hk_dep/slv_dep, routing2 has them once,
per-mode hk_rt renders as int.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 08:09:05 +00:00
wassname 2570dfaa67 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-02 07:21:49 +00:00
wassname cf3ecc40f8 write up 2026-06-02 07:20:42 +00:00
wassname 923de6dbe6 docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe
Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers
filled with provenance, vanilla pending jobs 74/84) + figures + verified refs
+ appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build
artifacts and figs symlinks gitignored. `just paper` compiles via tectonic;
`just paper-qc` dumps text + greps for unresolved refs / TODOs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 06:59:15 +00:00
wassname 17e4f2e2ff feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker
- deploy hack/solve is now the headline metric for all arms, so turn the
  mid-train deploy-eval on by default (smoke now covers the deploy path too);
  200-step runs pass a sparser cadence explicitly.
- docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker
  (keynote fig+table, ablation table, long-run fig, generalisation, appendix).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 04:41:43 +00:00
wassname 227c173f63 feat: test-time (post-hoc) hack-erasure benchmark
scripts/tt_erase_bench.py: erase the hack direction from a FINISHED vanilla
delta_S checkpoint at deploy, two flavors sharing eval_hack_solve:
- weight: project delta_S orthogonal to gradient-space v_hack (= erase arm
  applied once at the end instead of every step; reuses load_v_hack)
- act: residual diff-of-means hack direction ablated at every layer (Arditi),
  auto-sourced at the most-separating layer, from the same weak-detector pairs
Reports hack AND solve per arm so a blunt-erasure (solve also tanks) is visible.
Baseline for whether train-time routing beats cheap post-hoc erasure.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 02:20:51 +00:00
wassname cfdb196869 misc 2026-06-02 02:06:43 +00:00
wassname 19deef4fb9 docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
  current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
  de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
  banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
  for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
  sampled, not greedy)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 01:24:29 +00:00
wassname 997de37b26 deploy-eval every arm + single-row dynamics plot (apples-to-apples)
Wassname flagged the dynamics curve wasn't comparable: route2 plotted its
deploy eval (n=64, T=0.7, every 5 steps) while vanilla/erase plotted training
rollouts (n=28, every step) -- route2 looked artificially smoother. (NOT a
temperature gap: both gens are T=0.7; the "held-out greedy" header was a stale
lie, now corrected.)

train.py: ungate the periodic DEPLOY-eval to run for EVERY arm. route/route2
wrap it in ablate_quarantine (deploy = knob zeroed); vanilla/erase use
nullcontext (deploy == trained model). Same estimator across arms. Cost: ~+40%
amortized generation on the arms that newly get it (n=64 every 5 steps over
~32 train gens/step) -- n stays 64 to match the finished route2 n=3.

plot_dynamics.py: plot hk_dep/slv_dep for ALL arms when present (drop the
route-only guard; old logs fall back to training hack_s). Drop the cos row
(it was for online-vs-offline erasure; not informative next to the rate row,
and the cross-arm cos comparison was apples-to-oranges) -> single-row small
multiples, "deployed rate". Title states deploy-eval n=64 T=0.7.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:56:44 +00:00
wassname 633bb021e2 plot(dyn): dot route2's deploy-eval points so sparsity is visible
route2's plotted hack/solve is the DEPLOY eval (hk_dep/slv_dep): greedy, n=64,
logged every eval_ablate_every=5 steps and EMA-held flat between. The held line
reads as per-step-dense and oversells route2's smoothness vs the per-step
temperature-sampled (n=28) training curves the other arms plot -- an apples-to-
oranges smoothness the reader shouldn't be misled by.

_mark_if_sparse dots the real measured points when a series is >50% NaN; dense
series (training hack_s, cos sep/leak) stay unmarked. Now the route2 curve
visibly rests on ~13 eval points, not 60.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:25:41 +00:00
wassname 00e110c498 plot(dyn): make cos row cross-arm-comparable (sep + leak)
The cos row mixed non-comparable quantities: erase logged cin_t/cin_s
(pre-intervention single cosines), route2 logged hkgap (a difference) and
resid (post-intervention), all under one "cos(grad,v_hack)" ylabel. Wassname
flagged it -- they are not the same measurement.

Derive two quantities that mean the same thing in every column (_add_cos_derived):
  sep  = does v_hack still discriminate hacky from clean gradient
         erase: cin_t-cin_s (teacher vs student); route2: hkgap (hackflag vs clean)
  leak = residual hack-alignment of the post-intervention deployed gradient
         erase: cout (after projection); route2: resid (after routing) -- same quantity

Legend now lands on the leftmost arm that has cos data (vanilla has none).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:20:37 +00:00
wassname 8e42836b46 feat(plot): hack-overlay gains a solve-rate subplot (Tufte two-panel)
Stacked hack (top) + solve (bottom) sharing x; EMA-5; onset dot on hack only;
arms direct-labelled once on solve with y de-collision + leader lines (the three
non-route arms overlap, so their labels would otherwise stack). routing2 reads
hack~0 / solve highest at a glance.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:05:35 +00:00
wassname 2defc4a3ea fix(plots): drop deprecated routing arm; plot_substrate reads per-batch counts
- plot_dynamics: routing (route v1) out of ARM_ORDER -- superseded by routing2.
- plot_substrate: per-mode hk_* are now plain per-batch counts (streaming log
  dropped the /denominator); parse the count, plot it (EMA or cumsum); skip old
  n/d-format logs (incompatible units). Y-axis hacks/batch, count annotations.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:02:43 +00:00
wassname 83d41933b2 fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan
The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor
runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel
came up empty. Test for finite data (_has_data) not column presence; fall back to
the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values
across the nan gaps -> a held step-line.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 23:36:26 +00:00
wassname a1b17ab9f8 Merge branch 'worktree-refactor-train-style' into probe/distill-cosine
# Conflicts:
#	src/projected_grpo/train.py
2026-06-01 12:42:32 +00:00
wassname 151c072c34 style: em-dash -> ASCII '--' in comments across src (check-1 dir-wide)
Behavior-safe: comments/docstrings only. smoke + smoke-route2 exit 0, metrics
identical. Clears the 26 comment em-dashes in proj/rewards/extract_vhack_grad/
probe_distill/regrade_pool/verify_vhack_heldout/probe_plot_stack/pairs.

One em-dash deliberately preserved: pairs.py:313, inside a contrastive-pair
completion string ("# Sample inputs — uncomment ..."). It is training data
(feeds v_hack extraction), not code style, so `grep -P '—' src/` bottoms out
at 1 rather than 0. Changing it would alter the experiment's inputs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 12:20:54 +00:00
wassname 5c09feeb14 refactor: decompose train.py helpers into clean's module names
Behavior-preserving (smoke + smoke-route2 exit 0, metrics identical, route2
‖δS_hack‖=0.0079>0). All touched modules import-checked (no cycles).

Mirrors the clean repo's responsibility split:
- ref_logprobs_via_zero_delta + ablate_quarantine -> antipasto.py (the adapter
  owns the δS=0 free-ref-model trick and the δS_hack ablation).
- load_v_hack + postprocess_v_hack -> extract_vhack_grad.py (alongside extract_v_hack).
- load_problems + DATA + the per-mode hints -> new problems.py.

Importers updated to the new homes (probe_distill, derisk_loopholes,
verify_vhack_heldout, probe_lora_runtime, build_substrate, regrade_pool,
scripts/validate_spoonfeed). Moving DATA out of train.py also broke the
regrade_pool->train edge, so train.py can now import the v_hack helpers at
top level without a cycle.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 12:15:12 +00:00
wassname 3aa90c99b4 style(train): voice pass 2 -- section banners + finish math sweep through main()
Behavior-preserving (smoke + smoke-route2 exit 0, metrics identical, route2
‖δS_hack‖=0.0079>0). route2/delta_S_hack/hk_abl logic untouched (comments only
in that block; code identifiers left exactly as-is).

- 13 `# ── section ──` banners marking main()'s phases: model/tokenizer,
  AntiPaSTO adapter, hack direction, teacher pool, optimizer, generation config,
  training loop, per-prompt rollouts, inject->project/route, refresh, deploy-eval,
  final eval.
- Prose δS / τ throughout main()'s comments (code dict-keys "delta_S" unchanged).
- Trimmed duplicated/verbose blocks (per-step table legend dup, no-checkpointing
  essay) and the last war-story (run-43 divergence anecdote).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 12:05:58 +00:00
wassname 3a39231d3e style(train): clean-repo voice pass 1 -- math notation + kill LLM tells
Behavior-preserving (smoke + smoke-route2 exit 0, headline metrics identical:
HACK_RATE=0.492 PASS=0.117 HACK_T=0.983). route2/delta_S_hack/hk_abl untouched.

- GRPO loss reads like the equations: greek vars in the code itself
  (pol_logp->logπ, gen_logp->logπ_old, ref_logp->logπ_ref, ratio->ρ, adv->A,
  per_tok_loss->Lp, inline K3 KL). Scoped rename, no collisions.
- Docstrings use unicode mirroring the math (δS, π_ref, ‖·‖, σ_R, Vᵀ, Sᵢ)
  not ASCII transliteration (delta_W, pi_ref, ||g||).
- Em-dashes -> ASCII (grep -P '—' = 0).
- Dropped LLM tells: past-reader war-stories (job 46, "was 0/16", dated journal
  refs, step-17 OOM anecdotes), jargon used before defined.
- Module docstring + Config rewritten terse; one # ── section ── banner.

Voice pass is not complete: section banners through main() and the cross-file
decomposition (helpers -> antipasto.py/extract_vhack_grad.py/problems.py) follow
in the next commits.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 11:58:36 +00:00
wassname f8a76e3bf5 default rollout_ablate_frac=0 (drop the exploration floor)
job 64 (frac=0.5) leaked: deploy hack climbed 0.00->0.125 over training while
no-floor job 60 held 0.000. Sampling from the deployed model makes its reward an
optimization target whose optimum is hacking, so stale-gate misses update the
kept knob -- the floor optimizes the policy the quarantine should keep clean.
Neither gradient-routing paper samples from the ablated model in training. Off by
default; re-enable only to test floor+fresh-gate (refresh-every-1).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 11:49:32 +00:00
wassname ed97dc8c41 log: per-mode hk_<mode> cols = current-batch count, not cumulative n/d
The cumulative n/d (e.g. 14/52) grew unboundedly and read as noise. Now
each hk_<mode> shows just this step's student hacks of that mode as a plain
int. The running mode_hacks/mode_rollouts tallies still feed the end-of-run
substrate learning table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 11:02:31 +00:00
wassname 7ef999fe53 merge: pseudocode-readable refactor of train.py
Behavior-preserving readability refactor. train.py 2177->1764 lines:
extract tablelog.py (StepLogger/logging), eval.py (ref_logprobs/ablate/
eval_hack_solve), data.py (load_problems+hints), vhack.py (load/postprocess).
Drop 4 unreferenced scripts. SVD AntiPaSTO adapter, route2, and real data
pathways unchanged; gated by GPU-bf16 smoke (training columns bit-identical
across all 4 arms).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:48:16 +00:00
wassname 5dfc157f81 docs(c): prose signposts on the main() training loop
Full-sentence phase comments at the loop boundaries (the GRPO loop overview, the
per-prompt rollout/grade/accumulate phase). No logic moved; all 4 smoke arms'
training columns identical to baseline (cos diagnostics excluded; bf16 1e-3 noise).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:33:50 +00:00
wassname 010259fe62 journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:29:24 +00:00
wassname e1df929a13 feat: just dyn = auto-plot newest full-length log per arm
--latest-per-arm + --min-steps select the freshest >=N-step log for each
arm from logs/, no hand-globbing. Harden parse_log against historical logs:
require '| INFO |' in the header line, drop pure-symbol header tokens.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:29:24 +00:00
wassname afcd4a1e88 refactor(c): extract data.py (load_problems) + vhack.py (load/postprocess v_hack)
data.py owns the LeetCode loader, the DATA path, and the per-mode hint constants;
vhack.py owns v_hack load + the k-slice/noise-floor postprocess. Both are pure
(no train globals beyond what moved with them). train.py imports them back.
Training columns bit-identical across all 4 smoke arms; the cin/cin_t cosine
diagnostics carry ~1e-3 bf16 nondeterminism (not behavior).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:29:03 +00:00
wassname 2001f41e03 refactor(c): extract eval.py (ref_logprobs, ablate_quarantine, eval_hack_solve)
Relocate the three read-only model helpers out of train.py into eval.py. They use
only torch + per_token_logps (proj) + compute_reward (rewards); no train globals.
Training numbers identical across all 4 smoke arms (resid/qE diagnostic cosines
show last-digit bf16 noise only). MODE_CODE stays in train.py.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:22:30 +00:00
wassname 692f0ac00a refactor(c): extract tablelog.py (StepLogger, setup_logging, _Col)
Relocate the per-step table renderer and run-logging setup out of train.py into
a leaf module. MODE_CODE is threaded into StepLogger as a param (it stays in
train.py, which also uses it for row keys) so tablelog has no train dependency.
Pure presentation, no RNG/logic. All 4 smoke arms identical to baseline.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:13:30 +00:00
wassname 8503dc1914 journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:10:09 +00:00