Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned
per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack
expert: GRPO flows into the router through the soft weight w (it concentrates
hack-like rollouts in the hack expert), and a continuous pin loss on the
hand-authored pairs anchors the axis. No load balancing; routing is per rollout.
lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed
grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py
gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for
the fork; original proposal kept as docs/spec/original_evil_moe_spec.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.
Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.
gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).
Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three fixes after the deploy-solve=0 run (all user-confirmed):
1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the
bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5).
Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep
is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted).
2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool-
covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count,
no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step).
No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0
stays the on/off switch. Removed dead _even_split.
3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the
behavior policy = the shipped deployed-only model -- the quarantine's learned hack
can't saturate the rollout distribution and starve honest solve advantage. For
clean-gated rollouts gen and train forward now match.
Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget).
AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis
confidence across turns. Smoke + smoke-solvemix green; all verify gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.
justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full)
adapters -- forward sees only the sum so same-basis routing is a magnitude split
(shrinkage null) unless broken by gate discrimination x (expressiveness + structural
separation). Honest note that this wasn't clear to me first pass.
RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107
sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Point future agents at the three docs that pin the actual thesis + the
live open question (direction vs routing vs SVD/PiSSA prior), so they don't
re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
checkpoints crash loudly rather than silently defaulting).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- spec.md never existed at root or docs/; removed the link from AGENTS.md +
README.md (the live plan is in docs/spec/ dated files).
- RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed.
- Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py
(that file doesn't exist); kept the general 'gate every load-bearing
invariant in the same commit' rule.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.
- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
= cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
no-cheat reference at AGENTS.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
save_eval_ckpts (default on): write the deploy adapter (δS only, ~2.3MB) at each
deploy-eval step, step-tagged, so a run can be re-scored later (more prompts /
different eval) without retraining. The A5 run saved only final+first_hack, which
is why the leak needed a full retrain rather than a rescore.
AGENTS.md: every load-bearing invariant gets a verify_*.py gate. The no-cheat leak
shipped because the green gates never covered the property -- 'tests passed' is
meaningless if the property was never tested.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Deletes 7 source files that were superseded but never removed:
run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
probe_uat.py (UAT pipeline is past).
Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).
Verified by running just smoke-vanilla --steps=2 end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>