evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:15:40 +08:00

Author	SHA1	Message	Date
wassname	04a98b321e	feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack expert: GRPO flows into the router through the soft weight w (it concentrates hack-like rollouts in the hack expert), and a continuous pin loss on the hand-authored pairs anchors the axis. No load balancing; routing is per rollout. lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for the fork; original proposal kept as docs/spec/original_evil_moe_spec.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-14 11:25:14 +08:00
wassname	cca7150ea0	tidy	2026-06-14 11:05:54 +08:00
wassname	41d225a5ec	writeup	2026-06-12 04:46:01 +00:00
wassname	af420ec855	feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed in each rollout's own sampling mode, so ρ is a true importance ratio. The old always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout). Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled absorb/route rows by construction -- frac=0 is the clean point. Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1) over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean era anchors the low tail; buffer stores acts, re-scored vs current v_act so a refresh needs no flush). Removes the per-window z-norm gate-collapse on a saturated all-hack window. gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it elicits the hack and absorption can localize it. queue-decision now passes --gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the job-34 config where ablation RAISED hack 0.71->0.86). Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL 2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-12 03:22:48 +00:00
wassname	adca442253	feat(#41 ): routeA activation gate replaces routeV grad gate Gate now scores each rollout by dot(pooled bottleneck act, v_act) captured on the no-grad logpi_old forward (quarantine-ablated, matching the sampling policy); masks are pinned BEFORE the single grad-carrying forward, so the grad-gate's pass-1 backward is gone. Thresholds: rolling 256-act buffer, z-normalized, two-threshold Otsu (winsorized 1/99); warmup pins absorb until 128 scores. Buffer stores pooled acts and re-scores against the current v_act, so the forward-only refresh (every 5 steps) needs no flush. No bimodality guard: calibration showed Otsu tail separation ~2.4-2.8 buffer-sd on every condition including pure Gaussians, so no shape statistic discriminates. Deleted with the arm wiring (rename-on-logic-change: routeA never conflates with routeV runs): extract_vhack_grad.py, _build_v_grad, route_band_edges, _pair_cos, the pass-1 autograd.grad block, grad_probe training wiring, v_grad_k/route_std_*/routeV_random_v_seed config, smoke-topk recipe. c-probe stays in lora2r.py for scripts/diag_pinning.py only. verify_science_invariants: all-in-one count 27 -> 42 (stale since `c33b810` added the wave-2 behavior2 pairs) + assert the 8-pair routeA training subset. Smoke: routeA/vanilla/absorb/solvemix all pass (gate exercises warmup, Otsu zones, refresh, deploy ablation) -- /tmp/claude-1000/smoke_routeA.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 12:38:19 +00:00
wassname	270c4f5a27	misc	2026-06-11 11:07:28 +00:00
wassname	3f2b44452a	feat: online-stats gate + step-level teacher forcing + AUROC diagnostic The authored absolute band made pos>=1 unreachable for live hacks (rout~0), and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff). - Online-stats gate: route by live quantiles of the pooled cos-to-v_grad (top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed on refresh. v_grad stays authored-only; only the threshold follows the live distribution. Smoke: routing sustained past the refresh (cliff fixed). - Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens, not a per-prompt round; symmetric hack+solve teachers injected as ordinary gens (not specially routed). Fixes the per-prompt rounding wart. - AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label (measurement-only, never routes) -- discriminates threshold-vs-direction failure and whether a refresh destroys separation. - Inline eval stays off (eval_ablate_every=0); deploy scored offline. - Fix _sample_rows None crash (beartype) on the no-solve-pool path. - Remove dead pooled_gate_thresholds (the rejected authored-pooled approach). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 14:22:37 +00:00
wassname	05a00aa487	feat(T4): symmetric solve-teacher pool + routed-share discrimination diagnostic --solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack (default off). The gate's routed-share is split by teacher SOURCE: a discriminating gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares = non-directional (shrinkage null). Teacher source is our pool construction, not a live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF (hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw. Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 12:02:39 +00:00
wassname	35286040ed	run: decision arms explicit at --unhackable-frac=0.5 (25%->50%) Equal hack/solve pressure, harder problems, faster env (user call 2026-06-10). Pin the frac on the command line so the headline regime is self-documenting, not silently default-dependent. Requeued #36-39 at 0.5 with honest 50% labels. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:50:12 +00:00
wassname	4120d75ea4	feat: top-k routing subspace for routeV (--v-grad-k, gate=max_i cos) k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1 builds the top-k oriented SVD dirs of the paired diff and the gate scores max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal hack signal one mean washes out. Shared _build_v_grad at init + refresh; band edges and the live gate both max over k. Sims use einsum + jaxtyping dims. Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:44:45 +00:00
wassname	62ebf719d0	justfile: prune to lora2r-only (645->~180 lines) Drop every recipe invoking deleted CLI (erase/routeV_per_token/--routeV-absorb-all/ --routeV-gate/--v-hack-path/--half-a/--beta/fast-lora*/fast-lora2r/full) and the retired probe_distill/diag/cross-mech/substrate-plot tooling. Keep: smoke arms (none/routeV/absorb + all), queue-decision/baseline/no-loophole, env-construction pools (runtests/substrate/solve), results, paper tooling. Short, ordered, commented. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:20:48 +00:00
wassname	5714996c56	docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector = grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt. justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags (--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all and queue-decision (the headline 4-arm lora2r run). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:08:59 +00:00
wassname	6094568c56	feat: lora2r adapter (rank-2r PiSSA-init LoRA) + SGTM three-way hard routing Structural-separation arm to disentangle directionality from shrinkage. A rank-2r PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r] and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one shared update = shrinkage null), each block has its own input-side A rows and output-side B columns, so deploy-ablation removes a different FUNCTION. Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates (SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat: cos to the hand-authored-pair direction, never an oracle label of a live rollout. verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing, per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive: PiSSA / lora_frozen_b paths untouched. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 09:25:58 +00:00
wassname	0112f4a36d	fix: rotate the unhackable (gt_only) subset per step, not frozen per pid The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of problems were unhackable every step -- a fixed honest subset the model can memorize instead of learning to genuinely solve the distribution. Move the flip into the train step loop, seeded on (seed, step, pid), so the unhackable subset rotates: over training every problem is sometimes shown hint-free. Both halves flip, not just the hint: - prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages - grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel), so a hack earns format-only reward. Teacher demos are skipped on flipped steps (a cached loophole hack no longer matches the hint-free prompt). Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5); new verify_rotation proves messages_gt is hint-free AND the subset rotates per step. Smoke logs flip count (1/30 hint-free, graded gt_only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 06:14:08 +00:00
wassname	7f45189f1c	config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs Config (make the design axes explicit Literal choices): - eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like); unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded in deploy_test.json metadata. - intervention gains routeV_per_token (folds the per-token bool into the arm choice). - routeV_gate documented as the pinning axis. - FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4). Pairs: - delete prog_wide.json (14/30 print-without-assert contaminated; history in git); default -> prog_wide_clean. - rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the extraction pairs are OOD (never use the env's real grader fn name). Re-extracted v_hack_smoke to match. justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac (eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean. smoke green (erase + routeV_per_token); all 4 verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 04:21:54 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	5b0a6ddd91	plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after - floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would fake a solve jump that's really the n=32->n=119 eval-set shift. - floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056, authored 0.056->0.044), not the horizontal I wrongly forced earlier. - justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable fraction), low priority; vanilla rerun alongside best (its solve also suffers). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:58:32 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	31c2b9c82f	env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only Realism knob: in the reference env hacking saturates and kills the solve gradient. A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest solving pays) keeps a persistent solve pressure all arms feel. The differential test: routeV ablates the hack on the hackable 90% so it must solve there, while the warm solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0. - gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from eval_modes) so hack/solve remain comparable to the reference env. - logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the flip propagates to per-mode metrics. - smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1). Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the loophole hint. smoke-unhackable runs end-to-end. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:39:50 +00:00
wassname	0538dbf2f1	add routeV_absorb_all: 100% absorption, no vector (H2 extreme control) Route the whole gradient of every knob-on rollout into the quarantine; the deployed knob learns only from the knob-off exploration floor. Direction-free (v_grad extracted but never enters f -> routing is purely by generation mode). Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/ rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:56:12 +00:00
wassname	7d08ad2acd	viz: floor-to-ceiling method comparison (csv + figure) Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor, with SOURCE and STATUS columns flagging every provisional/missing cell) then the keynote figure. Prints TODO/FIXME data gaps before plotting. Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119). Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val split (eval_curve.jsonl), isolating the quarantine from the train/test memorization gap. Fixes the earlier conflation where the train->deploy arrow mixed knob-on/off with train-problems/test-problems. Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24), prog_wide arm contaminated (TODO job 28 prog_wide_clean). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:45:37 +00:00
wassname	6f49d5f9b0	refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback - scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json - scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention - out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None) - train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud) - justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats - requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:09:09 +00:00
wassname	a35e7b2735	feat: gt_only env-mode + queue baseline/no-loophole ceiling - rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle) - problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests") - justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and queue-no-loophole (gt_only vanilla GRPO, prio 11) - main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 03:23:49 +00:00
wassname	0f59b1351b	feat: online_stats gate for routeV -- live q5/q95 band calibration New routeV_gate="online_stats" mode: use the empirical per-rollout cosine distribution (q5/q95 pooled across all modules each step) as the routing band thresholds, instead of the pair-derived route_band. Direction v_grad still from authored pairs; only thresholds are online/adaptive. Motivation: the pair-derived band sits above the live cosine distribution (median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens. Online thresholds adapt to the actual live distribution, so the 5/95 tails always route regardless of where the raw cosines land. Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95. Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration). Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines. No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution of live student rollouts (no oracle/labeling of live rollouts as hack/clean). Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe. Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:25:37 +00:00
wassname	5007c9757a	results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)	2026-06-09 01:50:42 +00:00
wassname	b28b1a5e88	results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL line that just-results skips. Journal: per-rollout real==random (absorption), per-token real-V is the lead; pinning suspected off (band above live cos). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 10:47:38 +00:00
wassname	caa0d09472	broad: TEACHER_RT -> dense pool (was sparse, under-seeds); log: rename table cols train/deploy (drop 'knob') Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:12:00 +00:00
wassname	484305d7b4	config+log: fast defaults (dense pool, grad_clip=500); end-of-run tail = argv + hack/solve table + solve-hack objective - FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500 (were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes keep their own teacher_pool override. - End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide results row and the giant per-step table now print ABOVE the tail. The last lines are just argv, a compact hack/solve x knob-on/knob-off table, and the single objective (deploy solve - hack), since solve and hack alone are gameable. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:05:46 +00:00
wassname	3200771042	fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	7da54f1967	eval+env: single-mode run_tests, held-out val/test eval, both hack metrics - revert env to single-mode run_tests (paper-comparable): FastConfig teacher pool = run_tests-only (no partition.json); + `just build-runtests-pool` - held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file), final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert - report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted (hacked_loophole_used) -- external review 2026-06-07 - consolidate to one canonical eval_hack_solve (.eval); delete the train.py duplicate that silently lacked the token gap (in-run eval != rescore bug) - routeV band edges mean -> min/max (conservative degrade-to-absorb) - scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test - modal/app.py: read deploy_test.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 03:07:35 +00:00
wassname	dcd1b18303	eval: train/test token gap for all 4 modes (paper memorization control) Ariahw randomizes the run_tests function name at eval (line 84) so a high hack rate reflects responding to the disclosed mechanism, not memorizing a string. Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at eval, disclose a NOVEL per-mode token and tighten the detector to require exactly it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/ eval_result_file; defaults None = unchanged family path at train). run_tests uses the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE train hint (never touches problem text), seeded per problem so the curve stays smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves, per mode: train-token hack scores under family, NOT under eval-token; novel-token hack scores under eval-token. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	73936c822f	rename route2->routeV; heavy seeded final eval; save delta_S_hack route2 (binary-tau) and routeV (banded gate) are different methods -- give the new one a distinct id so old/new runs can't be confused (see hypothesis doc). - src/vgrout/* + justfile: route2->routeV, routing2->routingV (figs.py keeps the old keys for plotting historical runs). - Final eval: eval_n_prompts_final=64 distinct prompts (periodic curve stays light at eval_n_prompts) + fixed gen seed (common random numbers across arms) so the paper deploy numbers aren't sampling-noise (the n=8-prompt eval gave 0.031 vs 0.125 at the same checkpoint). - save_ckpt: also write delta_S_hack to sibling _hack.safetensors so runs can be re-scored knob-ON at higher n later (train.safetensors stays delta_S-only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 12:08:28 +00:00
wassname	69f8bc208d	justfile: erase recipes use the prog_wide default (drop pinned --v-hack-path) fast-projected / full no longer pin v_hack_full.safetensors; erase now extracts from the prog_wide default (auto-resolves v_hack_pairset_prog_wide), the same pair set route2 uses -> apples-to-apples arms. Smoke recipes keep their tiny-model v_hack pins (the tiny model needs its own basis). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:10:29 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	55937a86fb	rename python package projected_grpo -> vgrout git mv src/projected_grpo -> src/vgrout and find-replace the module name in all imports (.py), `-m projected_grpo.` invocations (justfile), and the [project] name (pyproject; setuptools auto-discovers via where=["src"]). Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes tied to past commits, so rewriting them would falsify provenance. Repo dir, git remote, and absolute paths unchanged. Verified: `import vgrout` and `python -m vgrout.train --help` load the full graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass. Full `just smoke` is blocked upstream by missing gitignored data artifacts (out/pools/{substrate,teacher_pool}, out/vhack/smoke*), unrelated to the rename.	2026-06-05 14:51:48 +08:00
wassname	562832acec	test: no-cheat partition + teacher-pool composition gate (verify_partition.py) The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts on the real out/pools/substrate/partition.json: (1) partition is a clean function into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests} the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke. 6/6 pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:36:03 +00:00
wassname	34ad20db0a	fix route2 no-cheat leak: teacher-only gate anchor + unit test The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540, file_marker 2/1337), force-routing those rollouts -- a real label leak into the held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows only, so held-out classes get PROVABLY zero detector labels (airtight A5 control). Extracted the inline anchor loop to build_route2_anchors() and added scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces the leak (held-out FP student force-routed) and teacher_only removes it (zero student routing, teachers unchanged). 9/9 assertions pass. Rescoring can't fix this -- the leak is in training (gate shaped the weights), not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the A5 run saved no per-eval checkpoints anyway. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:53:23 +00:00
wassname	4fa9061162	refactor: move 5 leaf entrypoints src/ -> scripts/ (src is now library-only) verify_rewards, verify_vhack_heldout, build_substrate, probe_distill, probe_plot_stack are run via 'python -m' / justfile and imported by no core module -> moved to scripts/, relative imports rewritten to 'from projected_grpo.X'. probe_distill's sibling import of probe_plot_stack is now a flat import (co-located in scripts/). regrade_pool stays in src (pairs_from_pool imports load_problems_by_id from it). justfile recipes updated. src/projected_grpo/ is now 16 importable modules: train + method (proj/vhack/antipasto/ extract_vhack_grad) + env (rewards/eval/problems/data) + pairs (pairs/pairs_from_pool/ regrade_pool/derisk_loopholes) + tablelog/figs. ~1480 lines moved out of the package. Smoke green (verify_rewards 52/52 from scripts/, train pipeline cout->0). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:23:56 +00:00
wassname	07363f1ede	cleanup: trim stale comments + attic README Dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the 'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in proj.py, and the conversational 'current experiment line' lead. Removed the dead probe-traj justfile recipe. Kept all TODO/FIXME and the 'why' memory-tuning comments. Smoke green (cout->0). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:09:19 +00:00
wassname	4ee3f03878	justfile: paper-run recipes on record (longrun/noteacher/teacheroff/harvest) paper-longrun, paper-noteacher, paper-teacheroff, paper-harvest -- each pueue-adds with a why:/resolve: label so every paper job is reproducible from one command. longrun uses the KL-stabilised optimizer (beta=1e-5, Adam 0.9/0.99). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:54:50 +00:00
wassname	2570dfaa67	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-02 07:21:49 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00
wassname	3e7b8ecfc0	feat: just dyn = auto-plot newest full-length log per arm --latest-per-arm + --min-steps select the freshest >=N-step log for each arm from logs/, no hand-globbing. Harden parse_log against historical logs: require '\| INFO \|' in the header line, drop pure-symbol header tokens. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 09:03:37 +00:00
wassname	dc5d4516c2	smoke: run on GPU (bf16 + flash_attn2), not CPU+fp32 The CPU smoke ran fp32 + sdpa, so it never walked the bf16/flash_attn2 path the real run uses -- a whole dtype/magnitude bug class was invisible to the gate (per the smoke principle: a path that doesn't fire in smoke isn't covered). The tiny- random model peaks ~1.4GB on GPU, so cost is negligible. Drop CUDA_VISIBLE_DEVICES= from every smoke recipe; train.py auto-detects cuda -> bf16. (Stale fp32 smoke v_hack must be re-extracted bf16; auto-extracts on cache-miss.) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:56:34 +00:00
wassname	8158adb543	refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a ~100x capacity edge over delta_S, so routing-everything-there was the low- resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not the routing gate (calibrated-tau already separated hack/clean, hkgap>0). Consolidate to one adapter type: the quarantine is now delta_S_hack, the second diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S, zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad into delta_S_hack.grad (like proj.py's route parks its subspace projection); delta_S keeps the unflagged. Both diagonals train at one shared lr. Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main) with the quarantine ablated. SGTM check: their gradient routing uses a hard detach on capacity-matched reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating. Smoked clean: tau/hkgap/qE render, \|\|delta_S_hack\|\|>0 assert passes, exit 0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:52:02 +00:00
wassname	11bcdd2fe6	route2 instrumentation + lr fix + deploy overlay (route2-act divergence) route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up (gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes: - #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh LoRA isn't trained at the main-knob lr. - #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9) doesn't false-trip. - #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward, surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 => the cos>0 sign test over-routes (fires on most tokens, not just hack ones). - #166 print last train generation before FINAL EVAL (coherence eyeball). - route2 v_act/v_grad refresh was firing but silent -- now announced. - #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json (honest shipped-model numbers, route2-safe). just plot-deploy. - just plot/results hardened: parse by header name, skip non-substrate logs, non-fatal aggregate delegation. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 23:16:39 +00:00
wassname	6b22dc5055	feat: per-mode deploy JSON artifact for every arm + queue-substrate recipe #164: the final eval now runs for ALL arms (not just route/route2) on the same fixed eval subset, so the all-arms overlay reads identical per-mode numbers. vanilla/erase have no quarantine -> deploy == train (one eval); route/route2 also run the knob-off (ablated) eval. Writes a single per_mode_deploy.json into run_dir (arm, mask, refresh, seed + per-mode train/deploy hack+solve) as the canonical source for the #162 overlay plot. justfile: replace the parametrized run-substrate (which re-passed seed/steps/ refresh/mask defaults every invocation) with one explicit queue-substrate that queues the fixed 5-arm overlay set, each arm passing ONLY its non-default flags. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 14:10:20 +00:00

1 2

97 Commits