evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:45:42 +08:00

Author	SHA1	Message	Date
wassname	04a98b321e	feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack expert: GRPO flows into the router through the soft weight w (it concentrates hack-like rollouts in the hack expert), and a continuous pin loss on the hand-authored pairs anchors the axis. No load balancing; routing is per rollout. lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for the fork; original proposal kept as docs/spec/original_evil_moe_spec.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-14 11:25:14 +08:00
wassname	41d225a5ec	writeup	2026-06-12 04:46:01 +00:00
wassname	af420ec855	feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed in each rollout's own sampling mode, so ρ is a true importance ratio. The old always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout). Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled absorb/route rows by construction -- frac=0 is the clean point. Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1) over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean era anchors the low tail; buffer stores acts, re-scored vs current v_act so a refresh needs no flush). Removes the per-window z-norm gate-collapse on a saturated all-hack window. gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it elicits the hack and absorption can localize it. queue-decision now passes --gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the job-34 config where ablation RAISED hack 0.71->0.86). Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL 2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-12 03:22:48 +00:00
wassname	ec11bf58b2	docs: update method descriptions for activation routing	2026-06-11 13:22:13 +00:00
wassname	d51028a618	user	2026-06-11 12:13:13 +00:00
wassname	7871aa66b8	tidy	2026-06-11 11:07:17 +00:00
wassname	4644af155a	docs: vocabulary section -- routing, vector, pinning, absorption Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:13:06 +00:00
wassname	97aede8d9c	fix(routeV): keep=bulk gate + deterministic teachers + deploy-mode generation Three fixes after the deploy-solve=0 run (all user-confirmed): 1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5). Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted). 2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool- covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count, no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step). No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0 stays the on/off switch. Removed dead _even_split. 3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the behavior policy = the shipped deployed-only model -- the quarantine's learned hack can't saturate the rollout distribution and starve honest solve advantage. For clean-gated rollouts gen and train forward now match. Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget). AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis confidence across turns. Smoke + smoke-solvemix green; all verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 00:29:12 +00:00
wassname	bf616749ee	Consolidate tagged hack pairsets in data	2026-06-10 11:58:53 +00:00
wassname	5714996c56	docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector = grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt. justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags (--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all and queue-decision (the headline 4-arm lora2r run). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:08:59 +00:00
wassname	7511ba12e8	docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full) adapters -- forward sees only the sum so same-basis routing is a magnitude split (shrinkage null) unless broken by gate discrimination x (expressiveness + structural separation). Honest note that this wasn't clear to me first pass. RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107 sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 06:50:42 +00:00
wassname	b36e3db255	docs: tone down the START HERE links to plain pointers Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:36:52 +00:00
wassname	0d6ff754ec	docs: AGENTS.md START HERE links (human_journal, main.tex, grad-routing paper); revert rescore fallback - Point future agents at the three docs that pin the actual thesis + the live open question (direction vs routing vs SVD/PiSSA prior), so they don't re-derive the non-directional result as a 'bug'. - Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema checkpoints crash loudly rather than silently defaulting). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:34:06 +00:00
wassname	3b38a05738	no-cheat framing: label-leakage not detector-presence; fix plot comment The disqualifier for an intervention is needing the env oracle / ground-truth hack-labels of the live training distribution, not 'a detector ran'. On a new RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe can't be built there; a generic LLM judge and our hand-authored-pair vector can. LLM judge is thus the fair external peer (no clean fast-env number to plot). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:22:29 +00:00
wassname	9c630b83c7	agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:39:27 +00:00
wassname	5fd980244b	docs: note SGTM is the latest gradient-routing paper (same authors) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:56:58 +00:00
wassname	637f9388c8	docs: cite SGTM paper in AGENTS.md (absorption/leakage vocab source) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:40:40 +00:00
wassname	52619519dc	docs: drop dead refs (spec.md link, verify_gate_anchor.py paragraph) - spec.md never existed at root or docs/; removed the link from AGENTS.md + README.md (the live plan is in docs/spec/ dated files). - RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed. - Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py (that file doesn't exist); kept the general 'gate every load-bearing invariant in the same commit' rule. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	83cae4ef72	docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:39:48 +00:00
wassname	f0cbbacaf0	save per-eval deploy-adapter ckpts (rescore w/o retrain) + CLAUDE.md test lesson save_eval_ckpts (default on): write the deploy adapter (δS only, ~2.3MB) at each deploy-eval step, step-tagged, so a run can be re-scored later (more prompts / different eval) without retraining. The A5 run saved only final+first_hack, which is why the leak needed a full retrain rather than a rescore. AGENTS.md: every load-bearing invariant gets a verify_*.py gate. The no-cheat leak shipped because the green gates never covered the property -- 'tests passed' is meaningless if the property was never tested. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:58:26 +00:00
wassname	efdf86a0cb	wip	2026-05-30 04:33:33 +00:00
wassname	c1f8ca4e7b	tidy	2026-05-29 06:29:43 +00:00
wassname	f27c658ca9	docs	2026-05-29 05:42:28 +00:00
wassname	646edfc7af	purge dead modules and stale recipes Deletes 7 source files that were superseded but never removed: run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor), grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by train.py "smoke" subcommand), phase2_analyze.py (pilot is past), probe_uat.py (UAT pipeline is past). Drops matching justfile recipes (vhack-check, phase2-analyze, probe-uat) and the BASE constant that pointed at run.py. Updates AGENTS/README references to the stale fast-dev-run recipe (now just smoke / smoke-vanilla). Verified by running just smoke-vanilla --steps=2 end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 08:42:15 +00:00
wassname	120400c5f5	setup	2026-05-23 10:40:02 +08:00

25 Commits