evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	19687087b0	feat(#30,#39): simple online gate -- band from current batch, no window/cloud; lr 1e-4 Gate band (mean + k*std) now computed from THIS batch's pooled positions each step instead of a sliding window. Refresh-proof by construction (live rollouts scored vs the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random tiny data never separates -> quarantine would never train -> pathway assert would fail. lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 06:04:28 +00:00
wassname	979daf84fd	feat(#30 ): mean+kstd online gate replaces fixed quantile; always-show route cols Gate calibration: route by live mean + route_std_mid/route_std_rout std of the pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail that genuinely exceeds the spread routes, so qmass tracks real separation instead of a forced fraction. The authored absolute band is mis-placed (live pos sits far below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack). tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so arm tables line up. Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band) and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 02:56:07 +00:00
wassname	154a37441b	refactor: OneCycleLR replaces SequentialLR(LinearLR, CosineAnnealingLR) One scheduler object does warmup + cosine relaxation; pct_start=warmup_frac is the explicit warmup. cycle_momentum=False so it doesn't clobber the configured AdamW betas (adam_beta1). Curve (100 steps, 20% warmup, 3e-4): peaks 3e-4 at step ~19 via smooth cos ramp (vs old linear), 1.7e-4 at step 10 where 5e-4 had diverged. Smoke + all verify gates green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 01:52:30 +00:00
wassname	a72835315c	fix: lr 3e-4 + 20% warmup (5e-4 diverged at warmup-end); slim bf16 ckpts - FastConfig lr 5e-4 -> 3e-4: 5e-4 peaked exactly at warmup-end (step ~10) and diverged (lp_t -0.5 -> -4.8, hack_s 20/24 -> 0). Lower peak + longer warmup defuse the spike. - Config warmup_frac 0.1 -> 0.2: SequentialLR(LinearLR, CosineAnnealingLR) already does warmup+cosine relaxation; just reach the peak more gradually. - save_ckpt: drop A0/B0 (seeded init, regenerable from lora_init_seed; ckpt_update0000 is the init since A==A0 at step 0; nothing live reloads them), save A/B bf16 not fp32. ~1.3G -> ~0.33G per ckpt. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 01:42:20 +00:00
wassname	f1dd9fb33e	chore: FastConfig save_ckpt_every 10->20 (disk pressure; ~6 ckpts/run suffices) Each lora2r ckpt is ~1.3G (A/B + redundant frozen A0/B0, 252 modules fp32). The 768G disk filled and runs crashed at the step-0 ckpt save. 20-step cadence halves the per-run footprint while keeping enough points for the eval curve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 00:35:29 +00:00
wassname	97aede8d9c	fix(routeV): keep=bulk gate + deterministic teachers + deploy-mode generation Three fixes after the deploy-solve=0 run (all user-confirmed): 1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5). Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted). 2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool- covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count, no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step). No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0 stays the on/off switch. Removed dead _even_split. 3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the behavior policy = the shipped deployed-only model -- the quarantine's learned hack can't saturate the rollout distribution and starve honest solve advantage. For clean-gated rollouts gen and train forward now match. Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget). AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis confidence across turns. Smoke + smoke-solvemix green; all verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 00:29:12 +00:00
wassname	3f2b44452a	feat: online-stats gate + step-level teacher forcing + AUROC diagnostic The authored absolute band made pos>=1 unreachable for live hacks (rout~0), and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff). - Online-stats gate: route by live quantiles of the pooled cos-to-v_grad (top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed on refresh. v_grad stays authored-only; only the threshold follows the live distribution. Smoke: routing sustained past the refresh (cliff fixed). - Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens, not a per-prompt round; symmetric hack+solve teachers injected as ordinary gens (not specially routed). Fixes the per-prompt rounding wart. - AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label (measurement-only, never routes) -- discriminates threshold-vs-direction failure and whether a refresh destroys separation. - Inline eval stays off (eval_ablate_every=0); deploy scored offline. - Fix _sample_rows None crash (beartype) on the no-solve-pool path. - Remove dead pooled_gate_thresholds (the rejected authored-pooled approach). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 14:22:37 +00:00
wassname	05a00aa487	feat(T4): symmetric solve-teacher pool + routed-share discrimination diagnostic --solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack (default off). The gate's routed-share is split by teacher SOURCE: a discriminating gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares = non-directional (shrinkage null). Teacher source is our pool construction, not a live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF (hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw. Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 12:02:39 +00:00
wassname	bf616749ee	Consolidate tagged hack pairsets in data	2026-06-10 11:58:53 +00:00
wassname	944ada360b	cleanup(lora2r): resolve user TODOs -- F.linear alias + jaxtyping hook shapes torch.nn.functional.linear -> F.linear (import F); annotate A/B/A0/B0 with Float[Tensor, ...] dims. Behaviorally identical -- verify_lora2r_routing green (identity 0.00e+00, all three masks + mixed-batch + ablation). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:51:49 +00:00
wassname	7e11c024c4	cleanup: delete dead delta_S machinery (PiSSA->lora2r leftovers) Off the live lora2r path; removed with vhack.py (commit `4120d75`): - proj.py: drop project_delta_S_grad/_project_one_module/mean_cos_pre_from_grads/ _hackward_cos (no live importer; train.py uses only per_token_logps). - verify_science_invariants: test pairset_sha256's content gate directly (drops the load_v_hack vehicle + fake delta_S wrapper fixture). - extract_vhack_grad: import pairset_sha256 from .pairs (was re-exported via vhack). - tablelog/figs: stale 'delta_S grads'/'knob' comments -> A/B grads. Smoke + verify_science_invariants green; no delta_S left in live code. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:45:54 +00:00
wassname	4120d75ea4	feat: top-k routing subspace for routeV (--v-grad-k, gate=max_i cos) k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1 builds the top-k oriented SVD dirs of the paired diff and the gate scores max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal hack signal one mean washes out. Shared _build_v_grad at init + refresh; band edges and the live gate both max over k. Sims use einsum + jaxtyping dims. Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:44:45 +00:00
wassname	103d0acc2c	cleanup: delete antipasto.py; attic 7 erase-era scripts (T1/T6) antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path -- train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve, verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare, tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected. Live imports verified clean. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:21:53 +00:00
wassname	5c97975185	refactor: collapse to lora2r-only (none/routeV/absorb); delete erase/antipasto/lora_frozen_b paths train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter and three arms (intervention none\|routeV\|absorb). Removes the erase grad-surgery, act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in: - T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init. - T3 width-pooled gate labels: single (num/den) fraction across modules, skip zero-width modules, raise if none separate (was per-module equal-weight blowup). - T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate. - T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache), adapter:"lora2r" in saved cfg. - T8 m3: step_flagged logs the hack share (d.mean), not m.mean. Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 10:58:22 +00:00
wassname	6094568c56	feat: lora2r adapter (rank-2r PiSSA-init LoRA) + SGTM three-way hard routing Structural-separation arm to disentangle directionality from shrinkage. A rank-2r PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r] and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one shared update = shrinkage null), each block has its own input-side A rows and output-side B columns, so deploy-ablation removes a different FUNCTION. Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates (SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat: cos to the hand-authored-pair direction, never an oracle label of a live rollout. verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing, per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive: PiSSA / lora_frozen_b paths untouched. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 09:25:58 +00:00
wassname	0112f4a36d	fix: rotate the unhackable (gt_only) subset per step, not frozen per pid The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of problems were unhackable every step -- a fixed honest subset the model can memorize instead of learning to genuinely solve the distribution. Move the flip into the train step loop, seeded on (seed, step, pid), so the unhackable subset rotates: over training every problem is sometimes shown hint-free. Both halves flip, not just the hint: - prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages - grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel), so a hack earns format-only reward. Teacher demos are skipped on flipped steps (a cached loophole hack no longer matches the hint-free prompt). Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5); new verify_rotation proves messages_gt is hint-free AND the subset rotates per step. Smoke logs flip count (1/30 hint-free, graded gt_only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 06:14:08 +00:00
wassname	c3af6cc03c	rename: deployed/as_trained policy views, kill 'knob' (schema paired_final_v2) Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent: 'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each: - policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split deploy_hack (JSON) vs hack_deploy (table key) into one name. - 'knob' -> 'quarantine'/'adapter' throughout comments and log strings. - train/test reserved for the DATA split only. Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded) and the missed eval_curve writer (eval_checkpoint_curve.py) + readers (results_deploy.py). Smoke green: v2 written + read; gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 05:26:51 +00:00
wassname	51c5a757ef	docs: make active-path comments concise	2026-06-10 05:19:52 +00:00
wassname	c031d9db76	log: print one resolved-config block at startup (pairset front and center) Replaces the partial preset= line. Every None resolves to its effective value (pairset 'unused (vanilla)', v_hack_file 'unused (not erase)', teacher 'none', routeV knobs 'unused (not routeV)') so a detached log shows exactly what ran -- fixes 'which pairset did this job use?'. Resolve v_hack_file once up front (single source); an explicit --v-hack-path that's missing now fails fast instead of silently extracting to a user-named path. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 05:12:58 +00:00
wassname	c9ff99d87a	feat: single fail-fast config-validation block; consolidate scattered checks _validate_config rejects method-irrelevant/contradictory options before the model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path off erase, lora adapter on unwired arms). Removes the duplicate inline lora check, the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k assert -- one canonical place. Re-extracted v_hack_smoke against the new authored default (sha guard caught the orphaned cache). Smoke green; bad combo raises. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 05:05:14 +00:00
wassname	5ae9187639	fix(tablelog): stale arm gates hid qmass + per-token cols in streaming view The streaming StepLogger gated on the dead literal arm=='routing' (qmass) and exact arm=='routingV' (missed routingV_per_token). arm is never 'routing' (the arm property maps routeV->routingV), so qmass was computed into the row dict but only ever surfaced in the end-of-run dump, never streamed. Gate all routeV cols on is_route={routingV, routingV_per_token}; fold qmass in. (GPT-flagged, verified.) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 05:00:12 +00:00
wassname	1f7a1f3333	pairs: default to hand-authored pairs_authored.json; drop contaminated prog_wider/widest Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050 hack, both 0.143 solve); default to authored as the canonical no-cheat label source. prog_wider/widest held genuine contamination, superseded by curated prog_wide_clean (kept for the progressive-vs-authored comparison). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 04:43:57 +00:00
wassname	a94c506dbd	feat: routeV_top_k -- route by oriented top-k SVD subspace (max-cos gate) The k=1 mean-diff is the only naturally hack-ward direction; SVD axes 2..k have arbitrary sign so each is re-oriented by sign(v_i . mean_diff). Gate = max_i cos(g, v_i), per-rollout grad_cosine only (asserted). top_k=1 is byte-identical to the prior mean-diff path. Smoke green: oriented [5,r] basis, band width +0.141. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 04:42:57 +00:00
wassname	7f45189f1c	config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs Config (make the design axes explicit Literal choices): - eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like); unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded in deploy_test.json metadata. - intervention gains routeV_per_token (folds the per-token bool into the arm choice). - routeV_gate documented as the pinning axis. - FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4). Pairs: - delete prog_wide.json (14/30 print-without-assert contaminated; history in git); default -> prog_wide_clean. - rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the extraction pairs are OOD (never use the env's real grader fn name). Re-extracted v_hack_smoke to match. justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac (eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean. smoke green (erase + routeV_per_token); all 4 verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 04:21:54 +00:00
wassname	e7ed74956f	fix: gt_only-only run (no-loophole ceiling) no longer divides by zero eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve = the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:19:22 +00:00
wassname	61d3819dae	docs: README/figs name the current arm routeV, not the dropped route2 The cleanup removed the v1 route and route2 arms (Config is now none\|erase\|routeV) but left README calling the live arm route2 with its old binary-tau gate description. Rename to routeV, describe the banded cosine gate (per-rollout/per-token, per-token best), and fix the deploy line (held-out test n=119 knob-off, not n=64). figs.py keeps the route2/routing2 display map for historical run artifacts. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:39:15 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	d68c17e7c5	eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms route/routeV final eval now measures both endpoints at n=119 test: knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so the before->after quarantine move is plottable from the deploy set instead of borrowing the val curve's different scale. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:09:50 +00:00
wassname	5b0a6ddd91	plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after - floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would fake a solve jump that's really the n=32->n=119 eval-set shift. - floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056, authored 0.056->0.044), not the horizontal I wrongly forced earlier. - justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable fraction), low priority; vanilla rerun alongside best (its solve also suffers). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:58:32 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	31c2b9c82f	env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only Realism knob: in the reference env hacking saturates and kills the solve gradient. A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest solving pays) keeps a persistent solve pressure all arms feel. The differential test: routeV ablates the hack on the hackable 90% so it must solve there, while the warm solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0. - gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from eval_modes) so hack/solve remain comparable to the reference env. - logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the flip propagates to per-mode metrics. - smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1). Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the loophole hint. smoke-unhackable runs end-to-end. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:39:50 +00:00
wassname	0538dbf2f1	add routeV_absorb_all: 100% absorption, no vector (H2 extreme control) Route the whole gradient of every knob-on rollout into the quarantine; the deployed knob learns only from the knob-off exploration floor. Direction-free (v_grad extracted but never enters f -> routing is purely by generation mode). Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/ rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:56:12 +00:00
wassname	dae52b2a7d	cleanup: consolidate pairs modules into build scripts + add solve_train to table - Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs). - Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6), pairs_intent_concept.json (6 diagnostic). - Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON instead of importing Python modules; drop tainted v2/allv2 pairsets from the diag sweep (print-without-assert axis). - train.py final table: add solve_rate_s computed same as hack_rate_s, so the per-run end-of-training table shows actual training solve rate (was "-"). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:17:42 +00:00
wassname	fb9f68530c	refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3 pairs.py now only has HackPair dataclass + _prompt/_wrap helpers. All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py. Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints) from pairs_authored -- those 10 pairs directly encode the env's specific run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode, identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only), 6 (weak inequality). P19-P21 (previously defined but unused) now included. Result: 11 pairs from 4 axes. Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same fail-loud pattern applied to train.py earlier). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:44:33 +00:00
wassname	6f49d5f9b0	refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback - scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json - scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention - out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None) - train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud) - justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats - requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:09:09 +00:00
wassname	a35e7b2735	feat: gt_only env-mode + queue baseline/no-loophole ceiling - rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle) - problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests") - justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and queue-no-loophole (gt_only vanilla GRPO, prio 11) - main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 03:23:49 +00:00
wassname	0f59b1351b	feat: online_stats gate for routeV -- live q5/q95 band calibration New routeV_gate="online_stats" mode: use the empirical per-rollout cosine distribution (q5/q95 pooled across all modules each step) as the routing band thresholds, instead of the pair-derived route_band. Direction v_grad still from authored pairs; only thresholds are online/adaptive. Motivation: the pair-derived band sits above the live cosine distribution (median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens. Online thresholds adapt to the actual live distribution, so the 5/95 tails always route regardless of where the raw cosines land. Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95. Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration). Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines. No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution of live student rollouts (no oracle/labeling of live rollouts as hack/clean). Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe. Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:25:37 +00:00
wassname	d497bfd161	feat: act_vote routeV gate (global activation-vote routing arm) New routeV_gate=act_vote: route every module's per-rollout gradient by a single global f_roll from a module-weighted vote of activation cosines cos(As_b, As_dir), As=Vh@x completion-mean (mirrors diag_cosine_dist.py act/vote, AUROC 0.67 / p@10 0.30 -- the coverage corner). Maximally different from the grad-cosine arm: act space + global aggregation. Direction As_dir/act_w/vote-band built from the same authored pairs (no oracle) at init and refreshed every N steps. Window = [plen-1:] to match the build hook + diagnostic. Smoke-verified (band opens, rout>0, refresh ok); fresh-eyes reviewed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 15:08:28 +00:00
wassname	eedf9efb51	pairs: de-confound v2 (print(==) vs assert, line-matched) + intent designs (think/funcname/concept) intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal (the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 13:08:47 +00:00
wassname	35be877fc0	pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84. Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:53:48 +00:00
wassname	caa0d09472	broad: TEACHER_RT -> dense pool (was sparse, under-seeds); log: rename table cols train/deploy (drop 'knob') Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:12:00 +00:00
wassname	484305d7b4	config+log: fast defaults (dense pool, grad_clip=500); end-of-run tail = argv + hack/solve table + solve-hack objective - FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500 (were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes keep their own teacher_pool override. - End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide results row and the giant per-step table now print ABOVE the tail. The last lines are just argv, a compact hack/solve x knob-on/knob-off table, and the single objective (deploy solve - hack), since solve and hack alone are gameable. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:05:46 +00:00
wassname	d9ea20baa4	routeV: margin (p75 clean / p75 hack) routing band, route the confident tail Was the widest band (min clean, max hack): routed even neutral rollouts (~0.4 of a cos=0 gradient), the over-route that costs solve. Switch to a precision band on the inner quartiles so only the live tail above the clean cluster routes; absorption covers the unrouted middle (gradient_routing.md L420; SGTM tolerates ~40% undiscovered, Fig5b). p75 not min/max: 10 pairs make the extremes single-sample noisy. Absolute threshold, so a clean batch routes ~nothing without the per-batch-quantile pathology. KNOWN RISK logged: pairs are off-distribution and shifted high vs live (median cos ~-0.06), so the band may under-route; watch rout, fall back is a live-cos quantile gate. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 13:42:20 +00:00
wassname	25ac3fc5e3	log: routeV routing as keep/resid/rout zones x unit+energy views; drop dead hk_abl/slv_abl Replace the band-mechanics trio (tau/hkgap/frout) and the lumped qmass with a symmetric zone breakdown: each live unit's cos(g,v_grad) lands below/inside/above the pair-band -> keep/resid/rout, reported as both unit shares and energy shares (keepE/residE/routE). Energy view is unit-agnostic (answers 'is the grad per rollout'). Drop hk_abl/slv_abl unless rollout_ablate_frac>0 (else 0/0). Band edges (lower/upper) already logged at construction. v1 'routing' arm keeps qmass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 13:13:01 +00:00
wassname	b170b969e2	log: surface absolute band edges (mean lower/upper), not just width Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 12:43:34 +00:00
wassname	041f9319f9	fix: hkgap legend said 'mean' but band uses max-hack/min-clean (train.py:345) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 12:41:05 +00:00
wassname	c449273357	log: rename routeV gauges to paper vocab (qE->absorb, resid->leak), drop 'FREE' aside The routing-mass gauges had bespoke names; align to the gradient-routing / SGTM vocabulary the reader knows: absorption (mass pinned into quarantine) and leakage (hack surviving in the deployed knob). Two-sided 'pin too much / too little' framing in the legends. Drop the 'FREE'/compute-cost detail from the hk_abl/slv_abl legends -- reader doesn't need the implementation cost. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:26:36 +00:00
wassname	1228e1b784	refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC) Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In train.py the canonical imports already won at runtime; the earlier ones were dead shadows: - ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems. - local setup_logging() was byte-identical to the .tablelog one already imported (with StepLogger); delete the local def + now-orphaned datetime import and LOGS_DIR const. - problems.py stays: 6 scripts + derisk/regrade still import it. antipasto.py: delete detach_antipasto (0 callers) and its own copies of ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical, better-worded versions incl. the SGTM TODO), plus now-unused contextmanager and per_token_logps imports. docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API error dump, not a review). Behavior-preserving (later imports already won at runtime). Verified: just smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_* gates PASS. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00

1 2

67 Commits