evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	41d225a5ec	writeup	2026-06-12 04:46:01 +00:00
wassname	af420ec855	feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed in each rollout's own sampling mode, so ρ is a true importance ratio. The old always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout). Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled absorb/route rows by construction -- frac=0 is the clean point. Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1) over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean era anchors the low tail; buffer stores acts, re-scored vs current v_act so a refresh needs no flush). Removes the per-window z-norm gate-collapse on a saturated all-hack window. gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it elicits the hack and absorption can localize it. queue-decision now passes --gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the job-34 config where ablation RAISED hack 0.71->0.86). Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL 2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-12 03:22:48 +00:00
wassname	ec11bf58b2	docs: update method descriptions for activation routing	2026-06-11 13:22:13 +00:00
wassname	668285c86e	journal: fix pueue IDs 28-30 -> 31-33 in entry (f) provenance Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 13:08:20 +00:00
wassname	8000aa48f4	journal(#41 ): entry (g) routeA shipped + guard-drop calibration; track moduleS diag scripts Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS scripts were untracked, so committing them for provenance alongside the calibration script the new entry cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 12:50:07 +00:00
wassname	f646e57028	fix(#41 ): routeA gate hardening from fresh-eyes review - assert finite gate scores (one NaN act would poison the rolling buffer -> NaN thresholds -> silent all-keep routing for up to route_buffer rollouts) - d = dm: degenerate Otsu (t_lo==t_hi) could emit (m=0,d=1), a mask state that trains nothing; rout now requires the quarantine on - solve-mix discrimination collects only when solve teachers exist (a hack-only [False]n list tripped the end-of-run line with a nan solve side) - warn when authored-pair AUROC < 0.9 (extraction sanity; tiny-random smoke sits ~0.78 so warning, not assert) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 12:48:11 +00:00
wassname	adca442253	feat(#41 ): routeA activation gate replaces routeV grad gate Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on the no-grad logpi_old forward (quarantine-ablated, matching the sampling policy); masks are pinned BEFORE the single grad-carrying forward, so the grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer, z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until 128 scores. Buffer stores pooled acts and re-scores against the current v_act, so the forward-only refresh (every 5 steps) needs no flush. No bimodality guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every condition including pure Gaussians, so no shape statistic discriminates. Deleted with the arm wiring (rename-on-logic-change: routeA never conflates with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges, _pair_cos, the pass-1 autograd.grad block, grad_probe training wiring, v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe. c-probe stays in lora2r.py for scripts/diag_pinning.py only. verify_science_invariants: all-in-one count 27 -> 42 (stale since `c33b810` added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset. Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 12:38:19 +00:00
wassname	5a340e5c3e	feat(routeA): forward-only v_act extraction + verify gate vs cached diag features extract_v_act: pooled completion-token bottleneck act per module, v = unit-norm mean pair diff (tstat flag default off, null at n=8 pairs). ActCapture is the single hook shared by extraction, the live gate, and verification. verify_v_act (pueue #24): rel diff 7.3e-4 hack / 7.7e-4 clean vs out/diag/pinning_feats.pt on the v3 first_hack ckpt; min per-module cos 0.99997. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 12:13:32 +00:00
wassname	d51028a618	user	2026-06-11 12:13:13 +00:00
wassname	77fa5bbf6b	spec: routeA plan approved; deletion scope extended to extract_vhack_grad + all grad-gate helpers Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:50:20 +00:00
wassname	8b6bf5b6a6	journal(e): super-S-space and act t-stat both null; act_dot mean-extraction stays the routeA default Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:44:12 +00:00
wassname	1d4f33ffb6	diag: super-S-space gate score null; spec -> act_dot + winsorized-Otsu plan superS (pooled writer/reader eigenbasis, whitened + top-r) tops out at min-window AUROC 0.721 = raw resid dot; best unwhitened rotation+top-64 0.740 < act 0.747 (max of ~50-variant grid). act t-stat extraction also null (0.719 vs 0.749 min). Spec updated: act_dot default, journal-(d) evidence table, implementation plan for routeA. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:42:44 +00:00
wassname	1b51c62cdc	journal(d): z-norm+otsu pinning fix, module weighting null, mode hypothesis falsified Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:28:37 +00:00
wassname	e5b68acf69	diag(#40 ): z-norm scores within family, winsorized 2-threshold Otsu zones, --feats offline mode Synthetic and live scores share an axis with meaningful zero (raw scores carry a common <mu,v> offset since v = mean diff is not orthogonal to the family mean). Zones come from label-free online stats (EMA mean/std + Otsu valley), replacing mean+k*sd which placed both cuts beyond every distribution. Winsorize at 1/99% before Otsu: variance-maximizing cuts otherwise buy a class for one outlier. Fresh-eyes review verified z-norm is affine (AUROCs unchanged), zones label-free, P/R recompute from parquet matches titles. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:23:42 +00:00
wassname	270c4f5a27	misc	2026-06-11 11:07:28 +00:00
wassname	7871aa66b8	tidy	2026-06-11 11:07:17 +00:00
wassname	24aea19bee	diag(#40 ): offline follow-up -- pooling variants, synthetic common-mode, env_mode join Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:58:16 +00:00
wassname	bbf6a97151	diag(#40 ): fix 3x2 plot -- keep bottom spine at axis edge (tick labels collided with rugs), skip labels on narrow zones Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:47:57 +00:00
wassname	4be1858b14	journal(c) addendum: resid_cos 0.92/0.70/0.80 vs act_cos -- bottleneck not the limiter, routeA defaults to resid Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:46:06 +00:00
wassname	0660e7bdd3	diag(#40 ): residual-stream rep (layers 12/18/24) as third gate-score candidate, 3x2 panel Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:45:50 +00:00
wassname	4a7465c0da	journal(c): act score replicates (0.75-0.87), grad decays to chance across windows; routeA next Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:44:40 +00:00
wassname	49ca51b6e1	diag(#40 ): review fixes -- A>0 contrast headline (adv-only baseline 0.90 vs-all / 0.58 A>0), headline=behavior_ training default, n at P@rout, rug strips, drop density ticks Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:22:08 +00:00
wassname	4644af155a	docs: vocabulary section -- routing, vector, pinning, absorption Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:13:06 +00:00
wassname	2f5f93f4fb	diag(#40 ): Q2 rewrite -- {grad,act}x{cos,dot} scores vs exploited&A>0 label, 3-zone pinning, pairset table Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:13:06 +00:00
wassname	c33b810faa	data: wave-2 behavior2_* pairs -- 15 new proxy-gaming mechanisms, 2 contaminated drafts replaced 10 pairs from parallel session renamed behavior_* -> behavior2_* so /behavior_ still selects the proven 8; bare-call-no-assert (run_tests axis-1 family) and assert-or-True (assert-True family) replaced with mock-under-test and fudge-expected; +5 more: skip-failing, retry-until-green, coverage-pragma, type-ignore, survivorship-metric. No identifier overlap with the env's 992 train/test problems (total_steps -> total_clicks after collision). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:04:23 +00:00
wassname	70697ff36e	diag(#40 ): pinning plot splits solve/fail/hack + per-pairset AUROC ranking Q4 fix: on-policy "solve" was ~exploited = solves+fails (mostly fails). Split by gt_pass into solve/fail/hack (live: 103 hack / 27 solve / 62 fail). Per-pairset ranking: build v_grad from each heading-prefix subset, re-project the SAME stored live c-grads (no model re-run). Finding: behavior pairs AUROC 0.69 vs all-in-one 0.53; reasoning/opportunity anti-aligned (<0.5) -> mixing dilutes. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 06:16:27 +00:00
wassname	19687087b0	feat(#30,#39): simple online gate -- band from current batch, no window/cloud; lr 1e-4 Gate band (mean + k*std) now computed from THIS batch's pooled positions each step instead of a sliding window. Refresh-proof by construction (live rollouts scored vs the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random tiny data never separates -> quarantine would never train -> pathway assert would fail. lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 06:04:28 +00:00
wassname	979daf84fd	feat(#30 ): mean+kstd online gate replaces fixed quantile; always-show route cols Gate calibration: route by live mean + route_std_mid/route_std_rout std of the pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail that genuinely exceeds the spread routes, so qmass tracks real separation instead of a forced fraction. The authored absolute band is mis-placed (live pos sits far below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack). tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so arm tables line up. Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band) and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 02:56:07 +00:00
wassname	4f60f94072	spec: small-reward-hacking env spinout (parked post-paper; commit archaeology for the 6->4 mode selection) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 02:07:55 +00:00
wassname	154a37441b	refactor: OneCycleLR replaces SequentialLR(LinearLR, CosineAnnealingLR) One scheduler object does warmup + cosine relaxation; pct_start=warmup_frac is the explicit warmup. cycle_momentum=False so it doesn't clobber the configured AdamW betas (adam_beta1). Curve (100 steps, 20% warmup, 3e-4): peaks 3e-4 at step ~19 via smooth cos ramp (vs old linear), 1.7e-4 at step 10 where 5e-4 had diverged. Smoke + all verify gates green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 01:52:30 +00:00
wassname	a72835315c	fix: lr 3e-4 + 20% warmup (5e-4 diverged at warmup-end); slim bf16 ckpts - FastConfig lr 5e-4 -> 3e-4: 5e-4 peaked exactly at warmup-end (step ~10) and diverged (lp_t -0.5 -> -4.8, hack_s 20/24 -> 0). Lower peak + longer warmup defuse the spike. - Config warmup_frac 0.1 -> 0.2: SequentialLR(LinearLR, CosineAnnealingLR) already does warmup+cosine relaxation; just reach the peak more gradually. - save_ckpt: drop A0/B0 (seeded init, regenerable from lora_init_seed; ckpt_update0000 is the init since A==A0 at step 0; nothing live reloads them), save A/B bf16 not fp32. ~1.3G -> ~0.33G per ckpt. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 01:42:20 +00:00
wassname	f1dd9fb33e	chore: FastConfig save_ckpt_every 10->20 (disk pressure; ~6 ckpts/run suffices) Each lora2r ckpt is ~1.3G (A/B + redundant frozen A0/B0, 252 modules fp32). The 768G disk filled and runs crashed at the step-0 ckpt save. 20-step cadence halves the per-run footprint while keeping enough points for the eval curve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 00:35:29 +00:00
wassname	97aede8d9c	fix(routeV): keep=bulk gate + deterministic teachers + deploy-mode generation Three fixes after the deploy-solve=0 run (all user-confirmed): 1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5). Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted). 2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool- covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count, no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step). No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0 stays the on/off switch. Removed dead _even_split. 3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the behavior policy = the shipped deployed-only model -- the quarantine's learned hack can't saturate the rollout distribution and starve honest solve advantage. For clean-gated rollouts gen and train forward now match. Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget). AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis confidence across turns. Smoke + smoke-solvemix green; all verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 00:29:12 +00:00
wassname	3f2b44452a	feat: online-stats gate + step-level teacher forcing + AUROC diagnostic The authored absolute band made pos>=1 unreachable for live hacks (rout~0), and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff). - Online-stats gate: route by live quantiles of the pooled cos-to-v_grad (top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed on refresh. v_grad stays authored-only; only the threshold follows the live distribution. Smoke: routing sustained past the refresh (cliff fixed). - Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens, not a per-prompt round; symmetric hack+solve teachers injected as ordinary gens (not specially routed). Fixes the per-prompt rounding wart. - AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label (measurement-only, never routes) -- discriminates threshold-vs-direction failure and whether a refresh destroys separation. - Inline eval stays off (eval_ablate_every=0); deploy scored offline. - Fix _sample_rows None crash (beartype) on the no-solve-pool path. - Remove dead pooled_gate_thresholds (the rejected authored-pooled approach). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 14:22:37 +00:00
wassname	05a00aa487	feat(T4): symmetric solve-teacher pool + routed-share discrimination diagnostic --solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack (default off). The gate's routed-share is split by teacher SOURCE: a discriminating gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares = non-directional (shrinkage null). Teacher source is our pool construction, not a live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF (hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw. Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 12:02:39 +00:00
wassname	bf616749ee	Consolidate tagged hack pairsets in data	2026-06-10 11:58:53 +00:00
wassname	944ada360b	cleanup(lora2r): resolve user TODOs -- F.linear alias + jaxtyping hook shapes torch.nn.functional.linear -> F.linear (import F); annotate A/B/A0/B0 with Float[Tensor, ...] dims. Behaviorally identical -- verify_lora2r_routing green (identity 0.00e+00, all three masks + mixed-batch + ablation). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:51:49 +00:00
wassname	35286040ed	run: decision arms explicit at --unhackable-frac=0.5 (25%->50%) Equal hack/solve pressure, harder problems, faster env (user call 2026-06-10). Pin the frac on the command line so the headline regime is self-documenting, not silently default-dependent. Requeued #36-39 at 0.5 with honest 50% labels. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:50:12 +00:00
wassname	7e11c024c4	cleanup: delete dead delta_S machinery (PiSSA->lora2r leftovers) Off the live lora2r path; removed with vhack.py (commit `4120d75`): - proj.py: drop project_delta_S_grad/_project_one_module/mean_cos_pre_from_grads/ _hackward_cos (no live importer; train.py uses only per_token_logps). - verify_science_invariants: test pairset_sha256's content gate directly (drops the load_v_hack vehicle + fake delta_S wrapper fixture). - extract_vhack_grad: import pairset_sha256 from .pairs (was re-exported via vhack). - tablelog/figs: stale 'delta_S grads'/'knob' comments -> A/B grads. Smoke + verify_science_invariants green; no delta_S left in live code. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:45:54 +00:00
wassname	4120d75ea4	feat: top-k routing subspace for routeV (--v-grad-k, gate=max_i cos) k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1 builds the top-k oriented SVD dirs of the paired diff and the gate scores max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal hack signal one mean washes out. Shared _build_v_grad at init + refresh; band edges and the live gate both max over k. Sims use einsum + jaxtyping dims. Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:44:45 +00:00
wassname	9fd2b6b89b	test: add mixed-batch per-rollout routing gate to verify_lora2r_routing (T8) 2a-2c only tested UNIFORM masks. 2d puts rollout 0 clean (0,0) and rollout 1 hack (1,1) in ONE forward and asserts the mixed deployed grad == rollout-0-alone-clean and the mixed quarantine grad == rollout-1-alone-hack -- the load-bearing per-rollout mask vectorization ([G,1,1] reshape) with no cross-rollout bleed. Green on tiny-random. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:24:49 +00:00
wassname	4ee5c27f7b	docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase) Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:23:21 +00:00
wassname	103d0acc2c	cleanup: delete antipasto.py; attic 7 erase-era scripts (T1/T6) antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path -- train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve, verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare, tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected. Live imports verified clean. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:21:53 +00:00
wassname	62ebf719d0	justfile: prune to lora2r-only (645->~180 lines) Drop every recipe invoking deleted CLI (erase/routeV_per_token/--routeV-absorb-all/ --routeV-gate/--v-hack-path/--half-a/--beta/fast-lora*/fast-lora2r/full) and the retired probe_distill/diag/cross-mech/substrate-plot tooling. Keep: smoke arms (none/routeV/absorb + all), queue-decision/baseline/no-loophole, env-construction pools (runtests/substrate/solve), results, paper tooling. Short, ordered, commented. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:20:48 +00:00
wassname	5714996c56	docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector = grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt. justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags (--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all and queue-decision (the headline 4-arm lora2r run). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:08:59 +00:00
wassname	5c97975185	refactor: collapse to lora2r-only (none/routeV/absorb); delete erase/antipasto/lora_frozen_b paths train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter and three arms (intervention none\|routeV\|absorb). Removes the erase grad-surgery, act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in: - T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init. - T3 width-pooled gate labels: single (num/den) fraction across modules, skip zero-width modules, raise if none separate (was per-module equal-weight blowup). - T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate. - T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache), adapter:"lora2r" in saved cfg. - T8 m3: step_flagged logs the hack share (d.mean), not m.mean. Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 10:58:22 +00:00
wassname	6094568c56	feat: lora2r adapter (rank-2r PiSSA-init LoRA) + SGTM three-way hard routing Structural-separation arm to disentangle directionality from shrinkage. A rank-2r PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r] and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one shared update = shrinkage null), each block has its own input-side A rows and output-side B columns, so deploy-ablation removes a different FUNCTION. Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates (SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat: cos to the hand-authored-pair direction, never an oracle label of a live rollout. verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing, per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive: PiSSA / lora_frozen_b paths untouched. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 09:25:58 +00:00
wassname	7511ba12e8	docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full) adapters -- forward sees only the sum so same-basis routing is a magnitude split (shrinkage null) unless broken by gate discrimination x (expressiveness + structural separation). Honest note that this wasn't clear to me first pass. RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107 sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 06:50:42 +00:00
wassname	0112f4a36d	fix: rotate the unhackable (gt_only) subset per step, not frozen per pid The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of problems were unhackable every step -- a fixed honest subset the model can memorize instead of learning to genuinely solve the distribution. Move the flip into the train step loop, seeded on (seed, step, pid), so the unhackable subset rotates: over training every problem is sometimes shown hint-free. Both halves flip, not just the hint: - prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages - grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel), so a hack earns format-only reward. Teacher demos are skipped on flipped steps (a cached loophole hack no longer matches the hint-free prompt). Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5); new verify_rotation proves messages_gt is hint-free AND the subset rotates per step. Smoke logs flip count (1/30 hint-free, graded gt_only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 06:14:08 +00:00
wassname	f3df50f631	tool: migrate v1 deploy_test/eval_curve -> v2 field names (for mid-flight runs) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 05:27:38 +00:00

1 2 3 4 5 ...

484 Commits