Commit Graph

459 Commits

Author SHA1 Message Date
wassname 70697ff36e diag(#40): pinning plot splits solve/fail/hack + per-pairset AUROC ranking
Q4 fix: on-policy "solve" was ~exploited = solves+fails (mostly fails). Split by
gt_pass into solve/fail/hack (live: 103 hack / 27 solve / 62 fail). Per-pairset
ranking: build v_grad from each heading-prefix subset, re-project the SAME stored
live c-grads (no model re-run). Finding: behavior pairs AUROC 0.69 vs all-in-one
0.53; reasoning/opportunity anti-aligned (<0.5) -> mixing dilutes.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 06:16:27 +00:00
wassname 19687087b0 feat(#30,#39): simple online gate -- band from current batch, no window/cloud; lr 1e-4
Gate band (mean + k*std) now computed from THIS batch's pooled positions each step
instead of a sliding window. Refresh-proof by construction (live rollouts scored vs
the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window
config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random
tiny data never separates -> quarantine would never train -> pathway assert would fail.

lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean
emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 06:04:28 +00:00
wassname 979daf84fd feat(#30): mean+k*std online gate replaces fixed quantile; always-show route cols
Gate calibration: route by live mean + route_std_mid/route_std_rout * std of the
pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail
that genuinely exceeds the spread routes, so qmass tracks real separation instead
of a forced fraction. The authored absolute band is mis-placed (live pos sits far
below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack).

tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so
arm tables line up.

Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band)
and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked
v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 02:56:07 +00:00
wassname 4f60f94072 spec: small-reward-hacking env spinout (parked post-paper; commit archaeology for the 6->4 mode selection)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 02:07:55 +00:00
wassname 154a37441b refactor: OneCycleLR replaces SequentialLR(LinearLR, CosineAnnealingLR)
One scheduler object does warmup + cosine relaxation; pct_start=warmup_frac
is the explicit warmup. cycle_momentum=False so it doesn't clobber the
configured AdamW betas (adam_beta1). Curve (100 steps, 20% warmup, 3e-4):
peaks 3e-4 at step ~19 via smooth cos ramp (vs old linear), 1.7e-4 at step
10 where 5e-4 had diverged. Smoke + all verify gates green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 01:52:30 +00:00
wassname a72835315c fix: lr 3e-4 + 20% warmup (5e-4 diverged at warmup-end); slim bf16 ckpts
- FastConfig lr 5e-4 -> 3e-4: 5e-4 peaked exactly at warmup-end (step ~10)
  and diverged (lp_t -0.5 -> -4.8, hack_s 20/24 -> 0). Lower peak + longer
  warmup defuse the spike.
- Config warmup_frac 0.1 -> 0.2: SequentialLR(LinearLR, CosineAnnealingLR)
  already does warmup+cosine relaxation; just reach the peak more gradually.
- save_ckpt: drop A0/B0 (seeded init, regenerable from lora_init_seed;
  ckpt_update0000 is the init since A==A0 at step 0; nothing live reloads
  them), save A/B bf16 not fp32. ~1.3G -> ~0.33G per ckpt.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 01:42:20 +00:00
wassname f1dd9fb33e chore: FastConfig save_ckpt_every 10->20 (disk pressure; ~6 ckpts/run suffices)
Each lora2r ckpt is ~1.3G (A/B + redundant frozen A0/B0, 252 modules fp32). The
768G disk filled and runs crashed at the step-0 ckpt save. 20-step cadence halves
the per-run footprint while keeping enough points for the eval curve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 00:35:29 +00:00
wassname 97aede8d9c fix(routeV): keep=bulk gate + deterministic teachers + deploy-mode generation
Three fixes after the deploy-solve=0 run (all user-confirmed):

1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the
   bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5).
   Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep
   is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted).

2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool-
   covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count,
   no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step).
   No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0
   stays the on/off switch. Removed dead _even_split.

3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the
   behavior policy = the shipped deployed-only model -- the quarantine's learned hack
   can't saturate the rollout distribution and starve honest solve advantage. For
   clean-gated rollouts gen and train forward now match.

Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget).
AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis
confidence across turns. Smoke + smoke-solvemix green; all verify gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 00:29:12 +00:00
wassname 3f2b44452a feat: online-stats gate + step-level teacher forcing + AUROC diagnostic
The authored absolute band made pos>=1 unreachable for live hacks (rout~0),
and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff).

- Online-stats gate: route by live quantiles of the pooled cos-to-v_grad
  (top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed
  on refresh. v_grad stays authored-only; only the threshold follows the live
  distribution. Smoke: routing sustained past the refresh (cliff fixed).
- Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens,
  not a per-prompt round; symmetric hack+solve teachers injected as ordinary
  gens (not specially routed). Fixes the per-prompt rounding wart.
- AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label
  (measurement-only, never routes) -- discriminates threshold-vs-direction
  failure and whether a refresh destroys separation.
- Inline eval stays off (eval_ablate_every=0); deploy scored offline.
- Fix _sample_rows None crash (beartype) on the no-solve-pool path.
- Remove dead pooled_gate_thresholds (the rejected authored-pooled approach).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 14:22:37 +00:00
wassname 05a00aa487 feat(T4): symmetric solve-teacher pool + routed-share discrimination diagnostic
--solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack
(default off). The gate's routed-share is split by teacher SOURCE: a discriminating
gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares =
non-directional (shrinkage null). Teacher source is our pool construction, not a
live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF
(hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw.
Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 12:02:39 +00:00
wassname bf616749ee Consolidate tagged hack pairsets in data 2026-06-10 11:58:53 +00:00
wassname 944ada360b cleanup(lora2r): resolve user TODOs -- F.linear alias + jaxtyping hook shapes
torch.nn.functional.linear -> F.linear (import F); annotate A/B/A0/B0 with
Float[Tensor, ...] dims. Behaviorally identical -- verify_lora2r_routing green
(identity 0.00e+00, all three masks + mixed-batch + ablation).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:51:49 +00:00
wassname 35286040ed run: decision arms explicit at --unhackable-frac=0.5 (25%->50%)
Equal hack/solve pressure, harder problems, faster env (user call 2026-06-10).
Pin the frac on the command line so the headline regime is self-documenting,
not silently default-dependent. Requeued #36-39 at 0.5 with honest 50% labels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:50:12 +00:00
wassname 7e11c024c4 cleanup: delete dead delta_S machinery (PiSSA->lora2r leftovers)
Off the live lora2r path; removed with vhack.py (commit 4120d75):
- proj.py: drop project_delta_S_grad/_project_one_module/mean_cos_pre_from_grads/
  _hackward_cos (no live importer; train.py uses only per_token_logps).
- verify_science_invariants: test pairset_sha256's content gate directly (drops the
  load_v_hack vehicle + fake delta_S wrapper fixture).
- extract_vhack_grad: import pairset_sha256 from .pairs (was re-exported via vhack).
- tablelog/figs: stale 'delta_S grads'/'knob' comments -> A/B grads.
Smoke + verify_science_invariants green; no delta_S left in live code.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:45:54 +00:00
wassname 4120d75ea4 feat: top-k routing subspace for routeV (--v-grad-k, gate=max_i cos)
k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1
builds the top-k oriented SVD dirs of the paired diff and the gate scores
max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal
hack signal one mean washes out. Shared _build_v_grad at init + refresh; band
edges and the live gate both max over k. Sims use einsum + jaxtyping dims.
Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:44:45 +00:00
wassname 9fd2b6b89b test: add mixed-batch per-rollout routing gate to verify_lora2r_routing (T8)
2a-2c only tested UNIFORM masks. 2d puts rollout 0 clean (0,0) and rollout 1 hack
(1,1) in ONE forward and asserts the mixed deployed grad == rollout-0-alone-clean
and the mixed quarantine grad == rollout-1-alone-hack -- the load-bearing
per-rollout mask vectorization ([G,1,1] reshape) with no cross-rollout bleed.
Green on tiny-random.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:24:49 +00:00
wassname 4ee5c27f7b docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase)
Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r
adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way
masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live
arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:23:21 +00:00
wassname 103d0acc2c cleanup: delete antipasto.py; attic 7 erase-era scripts (T1/T6)
antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path --
train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that
import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve,
verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare,
tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected.
Live imports verified clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:21:53 +00:00
wassname 62ebf719d0 justfile: prune to lora2r-only (645->~180 lines)
Drop every recipe invoking deleted CLI (erase/routeV_per_token/--routeV-absorb-all/
--routeV-gate/--v-hack-path/--half-a/--beta/fast-lora*/fast-lora2r/full) and the
retired probe_distill/diag/cross-mech/substrate-plot tooling. Keep: smoke arms
(none/routeV/absorb + all), queue-decision/baseline/no-loophole, env-construction
pools (runtests/substrate/solve), results, paper tooling. Short, ordered, commented.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:20:48 +00:00
wassname 5714996c56 docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.

justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:08:59 +00:00
wassname 5c97975185 refactor: collapse to lora2r-only (none/routeV/absorb); delete erase/antipasto/lora_frozen_b paths
train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter
and three arms (intervention none|routeV|absorb). Removes the erase grad-surgery,
act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the
v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in:
- T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init.
- T3 width-pooled gate labels: single (num/den) fraction across modules, skip
  zero-width modules, raise if none separate (was per-module equal-weight blowup).
- T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate.
- T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache),
  adapter:"lora2r" in saved cfg.
- T8 m3: step_flagged logs the hack share (d.mean), not m.mean.

Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb
end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 10:58:22 +00:00
wassname 6094568c56 feat: lora2r adapter (rank-2r PiSSA-init LoRA) + SGTM three-way hard routing
Structural-separation arm to disentangle directionality from shrinkage. A rank-2r
PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r]
and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the
same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one
shared update = shrinkage null), each block has its own input-side A rows and
output-side B columns, so deploy-ablation removes a different FUNCTION.

Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed
block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains
deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates
(SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat:
cos to the hand-authored-pair direction, never an oracle label of a live rollout.

verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing,
per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive:
PiSSA / lora_frozen_b paths untouched.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 09:25:58 +00:00
wassname 7511ba12e8 docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix
AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full)
adapters -- forward sees only the sum so same-basis routing is a magnitude split
(shrinkage null) unless broken by gate discrimination x (expressiveness + structural
separation). Honest note that this wasn't clear to me first pass.

RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107
sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 06:50:42 +00:00
wassname 0112f4a36d fix: rotate the unhackable (gt_only) subset per step, not frozen per pid
The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of
problems were unhackable every step -- a fixed honest subset the model can
memorize instead of learning to genuinely solve the distribution. Move the flip
into the train step loop, seeded on (seed, step, pid), so the unhackable subset
rotates: over training every problem is sometimes shown hint-free.

Both halves flip, not just the hint:
- prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages
- grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel),
  so a hack earns format-only reward. Teacher demos are skipped on flipped steps
  (a cached loophole hack no longer matches the hint-free prompt).

Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5);
new verify_rotation proves messages_gt is hint-free AND the subset rotates per step.
Smoke logs flip count (1/30 hint-free, graded gt_only).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 06:14:08 +00:00
wassname f3df50f631 tool: migrate v1 deploy_test/eval_curve -> v2 field names (for mid-flight runs)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:27:38 +00:00
wassname c3af6cc03c rename: deployed/as_trained policy views, kill 'knob' (schema paired_final_v2)
Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent:
'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each:
- policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs
  hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split
  deploy_hack (JSON) vs hack_deploy (table key) into one name.
- 'knob' -> 'quarantine'/'adapter' throughout comments and log strings.
- train/test reserved for the DATA split only.
Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by
completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two
replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded)
and the missed eval_curve writer (eval_checkpoint_curve.py) + readers
(results_deploy.py). Smoke green: v2 written + read; gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:26:51 +00:00
wassname 51c5a757ef docs: make active-path comments concise 2026-06-10 05:19:52 +00:00
wassname c031d9db76 log: print one resolved-config block at startup (pairset front and center)
Replaces the partial preset= line. Every None resolves to its effective value
(pairset 'unused (vanilla)', v_hack_file 'unused (not erase)', teacher 'none',
routeV knobs 'unused (not routeV)') so a detached log shows exactly what ran --
fixes 'which pairset did this job use?'. Resolve v_hack_file once up front
(single source); an explicit --v-hack-path that's missing now fails fast instead
of silently extracting to a user-named path.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:12:58 +00:00
wassname c9ff99d87a feat: single fail-fast config-validation block; consolidate scattered checks
_validate_config rejects method-irrelevant/contradictory options before the
model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path
off erase, lora adapter on unwired arms). Removes the duplicate inline lora check,
the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k
assert -- one canonical place. Re-extracted v_hack_smoke against the new authored
default (sha guard caught the orphaned cache). Smoke green; bad combo raises.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:05:14 +00:00
wassname 5c2edb9593 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-10 05:02:17 +00:00
wassname 5ae9187639 fix(tablelog): stale arm gates hid qmass + per-token cols in streaming view
The streaming StepLogger gated on the dead literal arm=='routing' (qmass) and
exact arm=='routingV' (missed routingV_per_token). arm is never 'routing' (the
arm property maps routeV->routingV), so qmass was computed into the row dict but
only ever surfaced in the end-of-run dump, never streamed. Gate all routeV cols
on is_route={routingV, routingV_per_token}; fold qmass in. (GPT-flagged, verified.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:00:12 +00:00
wassname 1f7a1f3333 pairs: default to hand-authored pairs_authored.json; drop contaminated prog_wider/widest
Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050
hack, both 0.143 solve); default to authored as the canonical no-cheat label
source. prog_wider/widest held genuine contamination, superseded by curated
prog_wide_clean (kept for the progressive-vs-authored comparison).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:43:57 +00:00
wassname a94c506dbd feat: routeV_top_k -- route by oriented top-k SVD subspace (max-cos gate)
The k=1 mean-diff is the only naturally hack-ward direction; SVD axes 2..k have
arbitrary sign so each is re-oriented by sign(v_i . mean_diff). Gate = max_i
cos(g, v_i), per-rollout grad_cosine only (asserted). top_k=1 is byte-identical
to the prior mean-diff path. Smoke green: oriented [5,r] basis, band width +0.141.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:42:57 +00:00
wassname 7f45189f1c config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs
Config (make the design axes explicit Literal choices):
- eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like);
  unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded
  in deploy_test.json metadata.
- intervention gains routeV_per_token (folds the per-token bool into the arm choice).
- routeV_gate documented as the pinning axis.
- FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand
  (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4).

Pairs:
- delete prog_wide.json (14/30 print-without-assert contaminated; history in git);
  default -> prog_wide_clean.
- rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the
  extraction pairs are OOD (never use the env's real grader fn name). Re-extracted
  v_hack_smoke to match.

justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac
(eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean.

smoke green (erase + routeV_per_token); all 4 verify gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:21:54 +00:00
wassname b36e3db255 docs: tone down the START HERE links to plain pointers
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:36:52 +00:00
wassname 0d6ff754ec docs: AGENTS.md START HERE links (human_journal, main.tex, grad-routing paper); revert rescore fallback
- Point future agents at the three docs that pin the actual thesis + the
  live open question (direction vs routing vs SVD/PiSSA prior), so they don't
  re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
  checkpoints crash loudly rather than silently defaulting).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:34:06 +00:00
wassname b334b5f516 fix: rescore_deploy tolerates old-schema checkpoints (default eval-harness params)
job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg
predates the train_config refactor. Default eval_n_prompts/max_new/
eval_batch_size to the fast preset (eval-harness params, not model-defining;
test split is fixed-size) so historical checkpoints re-score.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:31:07 +00:00
wassname e7ed74956f fix: gt_only-only run (no-loophole ceiling) no longer divides by zero
eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it
empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when
nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve
= the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:19:22 +00:00
wassname 2a9a3bc8a1 fix: rescore_deploy derives eval_modes from by_mode keys (old jsons lack eval_modes)
Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the
pre-cleanup train.py has no eval_modes key. by_mode keys are the modes
the original eval spanned (present in every version), so derive from
them to reproduce the same knob-off headline.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 22:56:13 +00:00
wassname 61d3819dae docs: README/figs name the current arm routeV, not the dropped route2
The cleanup removed the v1 route and route2 arms (Config is now
none|erase|routeV) but left README calling the live arm route2 with its
old binary-tau gate description. Rename to routeV, describe the banded
cosine gate (per-rollout/per-token, per-token best), and fix the deploy
line (held-out test n=119 knob-off, not n=64). figs.py keeps the
route2/routing2 display map for historical run artifacts.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:39:15 +00:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname 3f82041d90 plot: deploy Pareto draws knob-on->off before/after on the n=119 axis
Now that final/rescore eval record deploy_hack_on/solve_on at n=119,
the deploy scatter shows the honest quarantine move (hollow knob-on dot
-> arrow -> solid knob-off dot) on the same axis instead of borrowing
val's lower-scale curve. Dot-only fallback for arms not yet backfilled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:15:19 +00:00
wassname 3c27d922d2 docs: record science correctness audit 2026-06-09 13:10:17 +00:00
wassname d68c17e7c5 eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:09:50 +00:00
wassname 5b0a6ddd91 plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
  good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
  arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
  fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
  -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
  authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
  fraction), low priority; vanilla rerun alongside best (its solve also suffers).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:58:32 +00:00
wassname 438068c431 cleanup: consolidate stale loaders and pair scripts 2026-06-09 12:47:32 +00:00
wassname ca8d1adf62 plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte)
Two separate panels over-reduced a 2-variable story. One scatter instead: good
corner top-right (hack axis reversed), green effect-arrows from the vanilla
baseline show what each intervention did, achievable solve band (base..ceiling)
as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling).
No title; name-only point labels (position already encodes the rates). The Pareto
view makes domination visible: per-token strictly dominates random-V and vanilla.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:45:42 +00:00
wassname 31c2b9c82f env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only
Realism knob: in the reference env hacking saturates and kills the solve gradient.
A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest
solving pays) keeps a persistent solve pressure all arms feel. The differential test:
routeV ablates the hack on the hackable 90% so it must solve there, while the warm
solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0.

- gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an
  ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from
  eval_modes) so hack/solve remain comparable to the reference env.
- logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the
  flip propagates to per-mode metrics.
- smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1).

Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the
loophole hint. smoke-unhackable runs end-to-end.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:39:50 +00:00
wassname d4998a71ba docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
  EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
  (one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
  anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
  reversed so right=better on both panels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:35:14 +00:00
wassname 0973f9ba7c plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars
Cross-scale (their converged full-env vs our 60-step fast surrogate) made the
paper comparison directional-only and unfair on one axis. Show vanilla GRPO as
the red floor anchor instead; paper numbers stay in the extracted table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:26:55 +00:00