Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer).
Rewrite to match what is implemented and what we clarified:
- pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin
loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2
= sigmoid of one direction), seeded with v_act.
- "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes
the leak but needs a router that catches all hacks; soft keeps absorption available
but leaks (1-w). DEMix only bites if we rely on absorption.
- Evidence section from two literature searches. Forced localization has working
precedents (single bad direction: emergent misalignment/persona/refusal; behavioural
expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation
+ repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not
(standing-committee, topic-driven routing). So seed+pin are load-bearing.
- 3-way/3-expert noted as an extension (closer to production), 2 experts for the
decisive causal run.
README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations).
Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned
per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack
expert: GRPO flows into the router through the soft weight w (it concentrates
hack-like rollouts in the hack expert), and a continuous pin loss on the
hand-authored pairs anchors the axis. No load balancing; routing is per rollout.
lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed
grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py
gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for
the fork; original proposal kept as docs/spec/original_evil_moe_spec.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.
Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.
gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).
Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS
scripts were untracked, so committing them for provenance alongside the
calibration script the new entry cites.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- assert finite gate scores (one NaN act would poison the rolling buffer ->
NaN thresholds -> silent all-keep routing for up to route_buffer rollouts)
- d = d*m: degenerate Otsu (t_lo==t_hi) could emit (m=0,d=1), a mask state
that trains nothing; rout now requires the quarantine on
- solve-mix discrimination collects only when solve teachers exist (a
hack-only [False]*n list tripped the end-of-run line with a nan solve side)
- warn when authored-pair AUROC < 0.9 (extraction sanity; tiny-random smoke
sits ~0.78 so warning, not assert)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on
the no-grad logpi_old forward (quarantine-ablated, matching the sampling
policy); masks are pinned BEFORE the single grad-carrying forward, so the
grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer,
z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until
128 scores. Buffer stores pooled acts and re-scores against the current v_act,
so the forward-only refresh (every 5 steps) needs no flush. No bimodality
guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every
condition including pure Gaussians, so no shape statistic discriminates.
Deleted with the arm wiring (rename-on-logic-change: routeA never conflates
with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges,
_pair_cos, the pass-1 autograd.grad block, grad_probe training wiring,
v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe.
c-probe stays in lora2r.py for scripts/diag_pinning.py only.
verify_science_invariants: all-in-one count 27 -> 42 (stale since c33b810
added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset.
Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu
zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
extract_v_act: pooled completion-token bottleneck act per module, v = unit-norm
mean pair diff (tstat flag default off, null at n=8 pairs). ActCapture is the
single hook shared by extraction, the live gate, and verification.
verify_v_act (pueue #24): rel diff 7.3e-4 hack / 7.7e-4 clean vs
out/diag/pinning_feats.pt on the v3 first_hack ckpt; min per-module cos 0.99997.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Synthetic and live scores share an axis with meaningful zero (raw scores carry a
common <mu,v> offset since v = mean diff is not orthogonal to the family mean).
Zones come from label-free online stats (EMA mean/std + Otsu valley), replacing
mean+k*sd which placed both cuts beyond every distribution. Winsorize at 1/99%
before Otsu: variance-maximizing cuts otherwise buy a class for one outlier.
Fresh-eyes review verified z-norm is affine (AUROCs unchanged), zones label-free,
P/R recompute from parquet matches titles.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
10 pairs from parallel session renamed behavior_* -> behavior2_* so /behavior_
still selects the proven 8; bare-call-no-assert (run_tests axis-1 family) and
assert-or-True (assert-True family) replaced with mock-under-test and
fudge-expected; +5 more: skip-failing, retry-until-green, coverage-pragma,
type-ignore, survivorship-metric. No identifier overlap with the env's 992
train/test problems (total_steps -> total_clicks after collision).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Gate band (mean + k*std) now computed from THIS batch's pooled positions each step
instead of a sliding window. Refresh-proof by construction (live rollouts scored vs
the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window
config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random
tiny data never separates -> quarantine would never train -> pathway assert would fail.
lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean
emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Gate calibration: route by live mean + route_std_mid/route_std_rout * std of the
pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail
that genuinely exceeds the spread routes, so qmass tracks real separation instead
of a forced fraction. The authored absolute band is mis-placed (live pos sits far
below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack).
tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so
arm tables line up.
Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band)
and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked
v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
One scheduler object does warmup + cosine relaxation; pct_start=warmup_frac
is the explicit warmup. cycle_momentum=False so it doesn't clobber the
configured AdamW betas (adam_beta1). Curve (100 steps, 20% warmup, 3e-4):
peaks 3e-4 at step ~19 via smooth cos ramp (vs old linear), 1.7e-4 at step
10 where 5e-4 had diverged. Smoke + all verify gates green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- FastConfig lr 5e-4 -> 3e-4: 5e-4 peaked exactly at warmup-end (step ~10)
and diverged (lp_t -0.5 -> -4.8, hack_s 20/24 -> 0). Lower peak + longer
warmup defuse the spike.
- Config warmup_frac 0.1 -> 0.2: SequentialLR(LinearLR, CosineAnnealingLR)
already does warmup+cosine relaxation; just reach the peak more gradually.
- save_ckpt: drop A0/B0 (seeded init, regenerable from lora_init_seed;
ckpt_update0000 is the init since A==A0 at step 0; nothing live reloads
them), save A/B bf16 not fp32. ~1.3G -> ~0.33G per ckpt.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Each lora2r ckpt is ~1.3G (A/B + redundant frozen A0/B0, 252 modules fp32). The
768G disk filled and runs crashed at the step-0 ckpt save. 20-step cadence halves
the per-run footprint while keeping enough points for the eval curve.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three fixes after the deploy-solve=0 run (all user-confirmed):
1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the
bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5).
Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep
is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted).
2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool-
covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count,
no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step).
No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0
stays the on/off switch. Removed dead _even_split.
3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the
behavior policy = the shipped deployed-only model -- the quarantine's learned hack
can't saturate the rollout distribution and starve honest solve advantage. For
clean-gated rollouts gen and train forward now match.
Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget).
AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis
confidence across turns. Smoke + smoke-solvemix green; all verify gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The authored absolute band made pos>=1 unreachable for live hacks (rout~0),
and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff).
- Online-stats gate: route by live quantiles of the pooled cos-to-v_grad
(top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed
on refresh. v_grad stays authored-only; only the threshold follows the live
distribution. Smoke: routing sustained past the refresh (cliff fixed).
- Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens,
not a per-prompt round; symmetric hack+solve teachers injected as ordinary
gens (not specially routed). Fixes the per-prompt rounding wart.
- AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label
(measurement-only, never routes) -- discriminates threshold-vs-direction
failure and whether a refresh destroys separation.
- Inline eval stays off (eval_ablate_every=0); deploy scored offline.
- Fix _sample_rows None crash (beartype) on the no-solve-pool path.
- Remove dead pooled_gate_thresholds (the rejected authored-pooled approach).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
--solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack
(default off). The gate's routed-share is split by teacher SOURCE: a discriminating
gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares =
non-directional (shrinkage null). Teacher source is our pool construction, not a
live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF
(hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw.
Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Equal hack/solve pressure, harder problems, faster env (user call 2026-06-10).
Pin the frac on the command line so the headline regime is self-documenting,
not silently default-dependent. Requeued #36-39 at 0.5 with honest 50% labels.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Off the live lora2r path; removed with vhack.py (commit 4120d75):
- proj.py: drop project_delta_S_grad/_project_one_module/mean_cos_pre_from_grads/
_hackward_cos (no live importer; train.py uses only per_token_logps).
- verify_science_invariants: test pairset_sha256's content gate directly (drops the
load_v_hack vehicle + fake delta_S wrapper fixture).
- extract_vhack_grad: import pairset_sha256 from .pairs (was re-exported via vhack).
- tablelog/figs: stale 'delta_S grads'/'knob' comments -> A/B grads.
Smoke + verify_science_invariants green; no delta_S left in live code.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1
builds the top-k oriented SVD dirs of the paired diff and the gate scores
max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal
hack signal one mean washes out. Shared _build_v_grad at init + refresh; band
edges and the live gate both max over k. Sims use einsum + jaxtyping dims.
Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2a-2c only tested UNIFORM masks. 2d puts rollout 0 clean (0,0) and rollout 1 hack
(1,1) in ONE forward and asserts the mixed deployed grad == rollout-0-alone-clean
and the mixed quarantine grad == rollout-1-alone-hack -- the load-bearing
per-rollout mask vectorization ([G,1,1] reshape) with no cross-rollout bleed.
Green on tiny-random.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r
adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way
masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live
arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path --
train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that
import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve,
verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare,
tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected.
Live imports verified clean.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.
justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter
and three arms (intervention none|routeV|absorb). Removes the erase grad-surgery,
act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the
v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in:
- T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init.
- T3 width-pooled gate labels: single (num/den) fraction across modules, skip
zero-width modules, raise if none separate (was per-module equal-weight blowup).
- T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate.
- T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache),
adapter:"lora2r" in saved cfg.
- T8 m3: step_flagged logs the hack share (d.mean), not m.mean.
Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb
end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>