Commit Graph

488 Commits

Author SHA1 Message Date
wassname 8f39c4a69f docs: rewrite Evil MoE spec to the soft-routing design + literature evidence
Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer).
Rewrite to match what is implemented and what we clarified:

- pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin
  loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2
  = sigmoid of one direction), seeded with v_act.
- "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes
  the leak but needs a router that catches all hacks; soft keeps absorption available
  but leaks (1-w). DEMix only bites if we rely on absorption.
- Evidence section from two literature searches. Forced localization has working
  precedents (single bad direction: emergent misalignment/persona/refusal; behavioural
  expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation
  + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not
  (standing-committee, topic-driven routing). So seed+pin are load-bearing.
- 3-way/3-expert noted as an extension (closer to production), 2 experts for the
  decisive causal run.

README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations).
Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-14 13:06:38 +08:00
wassname 04a98b321e feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned
per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack
expert: GRPO flows into the router through the soft weight w (it concentrates
hack-like rollouts in the hack expert), and a continuous pin loss on the
hand-authored pairs anchors the axis. No load balancing; routing is per rollout.

lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed
grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py
gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for
the fork; original proposal kept as docs/spec/original_evil_moe_spec.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-14 11:25:14 +08:00
wassname cca7150ea0 tidy 2026-06-14 11:05:54 +08:00
wassname c4ac632b37 docs: add Evil MoE experiment proposal 2026-06-14 09:28:16 +08:00
wassname 41d225a5ec writeup 2026-06-12 04:46:01 +00:00
wassname af420ec855 feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method
Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.

Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.

gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).

Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-12 03:22:48 +00:00
wassname ec11bf58b2 docs: update method descriptions for activation routing 2026-06-11 13:22:13 +00:00
wassname 668285c86e journal: fix pueue IDs 28-30 -> 31-33 in entry (f) provenance
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 13:08:20 +00:00
wassname 8000aa48f4 journal(#41): entry (g) routeA shipped + guard-drop calibration; track moduleS diag scripts
Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS
scripts were untracked, so committing them for provenance alongside the
calibration script the new entry cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 12:50:07 +00:00
wassname f646e57028 fix(#41): routeA gate hardening from fresh-eyes review
- assert finite gate scores (one NaN act would poison the rolling buffer ->
  NaN thresholds -> silent all-keep routing for up to route_buffer rollouts)
- d = d*m: degenerate Otsu (t_lo==t_hi) could emit (m=0,d=1), a mask state
  that trains nothing; rout now requires the quarantine on
- solve-mix discrimination collects only when solve teachers exist (a
  hack-only [False]*n list tripped the end-of-run line with a nan solve side)
- warn when authored-pair AUROC < 0.9 (extraction sanity; tiny-random smoke
  sits ~0.78 so warning, not assert)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 12:48:11 +00:00
wassname adca442253 feat(#41): routeA activation gate replaces routeV grad gate
Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on
the no-grad logpi_old forward (quarantine-ablated, matching the sampling
policy); masks are pinned BEFORE the single grad-carrying forward, so the
grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer,
z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until
128 scores. Buffer stores pooled acts and re-scores against the current v_act,
so the forward-only refresh (every 5 steps) needs no flush. No bimodality
guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every
condition including pure Gaussians, so no shape statistic discriminates.

Deleted with the arm wiring (rename-on-logic-change: routeA never conflates
with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges,
_pair_cos, the pass-1 autograd.grad block, grad_probe training wiring,
v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe.
c-probe stays in lora2r.py for scripts/diag_pinning.py only.

verify_science_invariants: all-in-one count 27 -> 42 (stale since c33b810
added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset.

Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu
zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 12:38:19 +00:00
wassname 5a340e5c3e feat(routeA): forward-only v_act extraction + verify gate vs cached diag features
extract_v_act: pooled completion-token bottleneck act per module, v = unit-norm
mean pair diff (tstat flag default off, null at n=8 pairs). ActCapture is the
single hook shared by extraction, the live gate, and verification.
verify_v_act (pueue #24): rel diff 7.3e-4 hack / 7.7e-4 clean vs
out/diag/pinning_feats.pt on the v3 first_hack ckpt; min per-module cos 0.99997.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 12:13:32 +00:00
wassname d51028a618 user 2026-06-11 12:13:13 +00:00
wassname 77fa5bbf6b spec: routeA plan approved; deletion scope extended to extract_vhack_grad + all grad-gate helpers
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:50:20 +00:00
wassname 8b6bf5b6a6 journal(e): super-S-space and act t-stat both null; act_dot mean-extraction stays the routeA default
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:44:12 +00:00
wassname 1d4f33ffb6 diag: super-S-space gate score null; spec -> act_dot + winsorized-Otsu plan
superS (pooled writer/reader eigenbasis, whitened + top-r) tops out at
min-window AUROC 0.721 = raw resid dot; best unwhitened rotation+top-64
0.740 < act 0.747 (max of ~50-variant grid). act t-stat extraction also
null (0.719 vs 0.749 min). Spec updated: act_dot default, journal-(d)
evidence table, implementation plan for routeA.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:42:44 +00:00
wassname 1b51c62cdc journal(d): z-norm+otsu pinning fix, module weighting null, mode hypothesis falsified
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:28:37 +00:00
wassname e5b68acf69 diag(#40): z-norm scores within family, winsorized 2-threshold Otsu zones, --feats offline mode
Synthetic and live scores share an axis with meaningful zero (raw scores carry a
common <mu,v> offset since v = mean diff is not orthogonal to the family mean).
Zones come from label-free online stats (EMA mean/std + Otsu valley), replacing
mean+k*sd which placed both cuts beyond every distribution. Winsorize at 1/99%
before Otsu: variance-maximizing cuts otherwise buy a class for one outlier.
Fresh-eyes review verified z-norm is affine (AUROCs unchanged), zones label-free,
P/R recompute from parquet matches titles.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 11:23:42 +00:00
wassname 270c4f5a27 misc 2026-06-11 11:07:28 +00:00
wassname 7871aa66b8 tidy 2026-06-11 11:07:17 +00:00
wassname 24aea19bee diag(#40): offline follow-up -- pooling variants, synthetic common-mode, env_mode join
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:58:16 +00:00
wassname bbf6a97151 diag(#40): fix 3x2 plot -- keep bottom spine at axis edge (tick labels collided with rugs), skip labels on narrow zones
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:47:57 +00:00
wassname 4be1858b14 journal(c) addendum: resid_cos 0.92/0.70/0.80 vs act_cos -- bottleneck not the limiter, routeA defaults to resid
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:46:06 +00:00
wassname 0660e7bdd3 diag(#40): residual-stream rep (layers 12/18/24) as third gate-score candidate, 3x2 panel
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:45:50 +00:00
wassname 4a7465c0da journal(c): act score replicates (0.75-0.87), grad decays to chance across windows; routeA next
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:44:40 +00:00
wassname 49ca51b6e1 diag(#40): review fixes -- A>0 contrast headline (adv-only baseline 0.90 vs-all / 0.58 A>0), headline=behavior_ training default, n at P@rout, rug strips, drop density ticks
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:22:08 +00:00
wassname 4644af155a docs: vocabulary section -- routing, vector, pinning, absorption
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:13:06 +00:00
wassname 2f5f93f4fb diag(#40): Q2 rewrite -- {grad,act}x{cos,dot} scores vs exploited&A>0 label, 3-zone pinning, pairset table
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:13:06 +00:00
wassname c33b810faa data: wave-2 behavior2_* pairs -- 15 new proxy-gaming mechanisms, 2 contaminated drafts replaced
10 pairs from parallel session renamed behavior_* -> behavior2_* so /behavior_
still selects the proven 8; bare-call-no-assert (run_tests axis-1 family) and
assert-or-True (assert-True family) replaced with mock-under-test and
fudge-expected; +5 more: skip-failing, retry-until-green, coverage-pragma,
type-ignore, survivorship-metric. No identifier overlap with the env's 992
train/test problems (total_steps -> total_clicks after collision).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:04:23 +00:00
wassname 70697ff36e diag(#40): pinning plot splits solve/fail/hack + per-pairset AUROC ranking
Q4 fix: on-policy "solve" was ~exploited = solves+fails (mostly fails). Split by
gt_pass into solve/fail/hack (live: 103 hack / 27 solve / 62 fail). Per-pairset
ranking: build v_grad from each heading-prefix subset, re-project the SAME stored
live c-grads (no model re-run). Finding: behavior pairs AUROC 0.69 vs all-in-one
0.53; reasoning/opportunity anti-aligned (<0.5) -> mixing dilutes.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 06:16:27 +00:00
wassname 19687087b0 feat(#30,#39): simple online gate -- band from current batch, no window/cloud; lr 1e-4
Gate band (mean + k*std) now computed from THIS batch's pooled positions each step
instead of a sliding window. Refresh-proof by construction (live rollouts scored vs
the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window
config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random
tiny data never separates -> quarantine would never train -> pathway assert would fail.

lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean
emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 06:04:28 +00:00
wassname 979daf84fd feat(#30): mean+k*std online gate replaces fixed quantile; always-show route cols
Gate calibration: route by live mean + route_std_mid/route_std_rout * std of the
pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail
that genuinely exceeds the spread routes, so qmass tracks real separation instead
of a forced fraction. The authored absolute band is mis-placed (live pos sits far
below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack).

tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so
arm tables line up.

Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band)
and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked
v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 02:56:07 +00:00
wassname 4f60f94072 spec: small-reward-hacking env spinout (parked post-paper; commit archaeology for the 6->4 mode selection)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 02:07:55 +00:00
wassname 154a37441b refactor: OneCycleLR replaces SequentialLR(LinearLR, CosineAnnealingLR)
One scheduler object does warmup + cosine relaxation; pct_start=warmup_frac
is the explicit warmup. cycle_momentum=False so it doesn't clobber the
configured AdamW betas (adam_beta1). Curve (100 steps, 20% warmup, 3e-4):
peaks 3e-4 at step ~19 via smooth cos ramp (vs old linear), 1.7e-4 at step
10 where 5e-4 had diverged. Smoke + all verify gates green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 01:52:30 +00:00
wassname a72835315c fix: lr 3e-4 + 20% warmup (5e-4 diverged at warmup-end); slim bf16 ckpts
- FastConfig lr 5e-4 -> 3e-4: 5e-4 peaked exactly at warmup-end (step ~10)
  and diverged (lp_t -0.5 -> -4.8, hack_s 20/24 -> 0). Lower peak + longer
  warmup defuse the spike.
- Config warmup_frac 0.1 -> 0.2: SequentialLR(LinearLR, CosineAnnealingLR)
  already does warmup+cosine relaxation; just reach the peak more gradually.
- save_ckpt: drop A0/B0 (seeded init, regenerable from lora_init_seed;
  ckpt_update0000 is the init since A==A0 at step 0; nothing live reloads
  them), save A/B bf16 not fp32. ~1.3G -> ~0.33G per ckpt.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 01:42:20 +00:00
wassname f1dd9fb33e chore: FastConfig save_ckpt_every 10->20 (disk pressure; ~6 ckpts/run suffices)
Each lora2r ckpt is ~1.3G (A/B + redundant frozen A0/B0, 252 modules fp32). The
768G disk filled and runs crashed at the step-0 ckpt save. 20-step cadence halves
the per-run footprint while keeping enough points for the eval curve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 00:35:29 +00:00
wassname 97aede8d9c fix(routeV): keep=bulk gate + deterministic teachers + deploy-mode generation
Three fixes after the deploy-solve=0 run (all user-confirmed):

1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the
   bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5).
   Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep
   is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted).

2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool-
   covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count,
   no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step).
   No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0
   stays the on/off switch. Removed dead _even_split.

3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the
   behavior policy = the shipped deployed-only model -- the quarantine's learned hack
   can't saturate the rollout distribution and starve honest solve advantage. For
   clean-gated rollouts gen and train forward now match.

Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget).
AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis
confidence across turns. Smoke + smoke-solvemix green; all verify gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 00:29:12 +00:00
wassname 3f2b44452a feat: online-stats gate + step-level teacher forcing + AUROC diagnostic
The authored absolute band made pos>=1 unreachable for live hacks (rout~0),
and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff).

- Online-stats gate: route by live quantiles of the pooled cos-to-v_grad
  (top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed
  on refresh. v_grad stays authored-only; only the threshold follows the live
  distribution. Smoke: routing sustained past the refresh (cliff fixed).
- Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens,
  not a per-prompt round; symmetric hack+solve teachers injected as ordinary
  gens (not specially routed). Fixes the per-prompt rounding wart.
- AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label
  (measurement-only, never routes) -- discriminates threshold-vs-direction
  failure and whether a refresh destroys separation.
- Inline eval stays off (eval_ablate_every=0); deploy scored offline.
- Fix _sample_rows None crash (beartype) on the no-solve-pool path.
- Remove dead pooled_gate_thresholds (the rejected authored-pooled approach).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 14:22:37 +00:00
wassname 05a00aa487 feat(T4): symmetric solve-teacher pool + routed-share discrimination diagnostic
--solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack
(default off). The gate's routed-share is split by teacher SOURCE: a discriminating
gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares =
non-directional (shrinkage null). Teacher source is our pool construction, not a
live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF
(hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw.
Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 12:02:39 +00:00
wassname bf616749ee Consolidate tagged hack pairsets in data 2026-06-10 11:58:53 +00:00
wassname 944ada360b cleanup(lora2r): resolve user TODOs -- F.linear alias + jaxtyping hook shapes
torch.nn.functional.linear -> F.linear (import F); annotate A/B/A0/B0 with
Float[Tensor, ...] dims. Behaviorally identical -- verify_lora2r_routing green
(identity 0.00e+00, all three masks + mixed-batch + ablation).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:51:49 +00:00
wassname 35286040ed run: decision arms explicit at --unhackable-frac=0.5 (25%->50%)
Equal hack/solve pressure, harder problems, faster env (user call 2026-06-10).
Pin the frac on the command line so the headline regime is self-documenting,
not silently default-dependent. Requeued #36-39 at 0.5 with honest 50% labels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:50:12 +00:00
wassname 7e11c024c4 cleanup: delete dead delta_S machinery (PiSSA->lora2r leftovers)
Off the live lora2r path; removed with vhack.py (commit 4120d75):
- proj.py: drop project_delta_S_grad/_project_one_module/mean_cos_pre_from_grads/
  _hackward_cos (no live importer; train.py uses only per_token_logps).
- verify_science_invariants: test pairset_sha256's content gate directly (drops the
  load_v_hack vehicle + fake delta_S wrapper fixture).
- extract_vhack_grad: import pairset_sha256 from .pairs (was re-exported via vhack).
- tablelog/figs: stale 'delta_S grads'/'knob' comments -> A/B grads.
Smoke + verify_science_invariants green; no delta_S left in live code.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:45:54 +00:00
wassname 4120d75ea4 feat: top-k routing subspace for routeV (--v-grad-k, gate=max_i cos)
k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1
builds the top-k oriented SVD dirs of the paired diff and the gate scores
max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal
hack signal one mean washes out. Shared _build_v_grad at init + refresh; band
edges and the live gate both max over k. Sims use einsum + jaxtyping dims.
Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:44:45 +00:00
wassname 9fd2b6b89b test: add mixed-batch per-rollout routing gate to verify_lora2r_routing (T8)
2a-2c only tested UNIFORM masks. 2d puts rollout 0 clean (0,0) and rollout 1 hack
(1,1) in ONE forward and asserts the mixed deployed grad == rollout-0-alone-clean
and the mixed quarantine grad == rollout-1-alone-hack -- the load-bearing
per-rollout mask vectorization ([G,1,1] reshape) with no cross-rollout bleed.
Green on tiny-random.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:24:49 +00:00
wassname 4ee5c27f7b docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase)
Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r
adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way
masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live
arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:23:21 +00:00
wassname 103d0acc2c cleanup: delete antipasto.py; attic 7 erase-era scripts (T1/T6)
antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path --
train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that
import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve,
verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare,
tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected.
Live imports verified clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:21:53 +00:00
wassname 62ebf719d0 justfile: prune to lora2r-only (645->~180 lines)
Drop every recipe invoking deleted CLI (erase/routeV_per_token/--routeV-absorb-all/
--routeV-gate/--v-hack-path/--half-a/--beta/fast-lora*/fast-lora2r/full) and the
retired probe_distill/diag/cross-mech/substrate-plot tooling. Keep: smoke arms
(none/routeV/absorb + all), queue-decision/baseline/no-loophole, env-construction
pools (runtests/substrate/solve), results, paper tooling. Short, ordered, commented.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:20:48 +00:00
wassname 5714996c56 docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.

justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:08:59 +00:00
wassname 5c97975185 refactor: collapse to lora2r-only (none/routeV/absorb); delete erase/antipasto/lora_frozen_b paths
train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter
and three arms (intervention none|routeV|absorb). Removes the erase grad-surgery,
act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the
v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in:
- T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init.
- T3 width-pooled gate labels: single (num/den) fraction across modules, skip
  zero-width modules, raise if none separate (was per-module equal-weight blowup).
- T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate.
- T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache),
  adapter:"lora2r" in saved cfg.
- T8 m3: step_flagged logs the hack share (d.mean), not m.mean.

Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb
end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 10:58:22 +00:00