evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	7f45189f1c	config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs Config (make the design axes explicit Literal choices): - eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like); unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded in deploy_test.json metadata. - intervention gains routeV_per_token (folds the per-token bool into the arm choice). - routeV_gate documented as the pinning axis. - FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4). Pairs: - delete prog_wide.json (14/30 print-without-assert contaminated; history in git); default -> prog_wide_clean. - rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the extraction pairs are OOD (never use the env's real grader fn name). Re-extracted v_hack_smoke to match. justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac (eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean. smoke green (erase + routeV_per_token); all 4 verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 04:21:54 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	5b0a6ddd91	plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after - floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would fake a solve jump that's really the n=32->n=119 eval-set shift. - floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056, authored 0.056->0.044), not the horizontal I wrongly forced earlier. - justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable fraction), low priority; vanilla rerun alongside best (its solve also suffers). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:58:32 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	31c2b9c82f	env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only Realism knob: in the reference env hacking saturates and kills the solve gradient. A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest solving pays) keeps a persistent solve pressure all arms feel. The differential test: routeV ablates the hack on the hackable 90% so it must solve there, while the warm solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0. - gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from eval_modes) so hack/solve remain comparable to the reference env. - logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the flip propagates to per-mode metrics. - smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1). Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the loophole hint. smoke-unhackable runs end-to-end. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:39:50 +00:00
wassname	0538dbf2f1	add routeV_absorb_all: 100% absorption, no vector (H2 extreme control) Route the whole gradient of every knob-on rollout into the quarantine; the deployed knob learns only from the knob-off exploration floor. Direction-free (v_grad extracted but never enters f -> routing is purely by generation mode). Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/ rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:56:12 +00:00
wassname	7d08ad2acd	viz: floor-to-ceiling method comparison (csv + figure) Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor, with SOURCE and STATUS columns flagging every provisional/missing cell) then the keynote figure. Prints TODO/FIXME data gaps before plotting. Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119). Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val split (eval_curve.jsonl), isolating the quarantine from the train/test memorization gap. Fixes the earlier conflation where the train->deploy arrow mixed knob-on/off with train-problems/test-problems. Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24), prog_wide arm contaminated (TODO job 28 prog_wide_clean). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:45:37 +00:00
wassname	6f49d5f9b0	refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback - scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json - scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention - out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None) - train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud) - justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats - requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:09:09 +00:00
wassname	a35e7b2735	feat: gt_only env-mode + queue baseline/no-loophole ceiling - rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle) - problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests") - justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and queue-no-loophole (gt_only vanilla GRPO, prio 11) - main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 03:23:49 +00:00
wassname	0f59b1351b	feat: online_stats gate for routeV -- live q5/q95 band calibration New routeV_gate="online_stats" mode: use the empirical per-rollout cosine distribution (q5/q95 pooled across all modules each step) as the routing band thresholds, instead of the pair-derived route_band. Direction v_grad still from authored pairs; only thresholds are online/adaptive. Motivation: the pair-derived band sits above the live cosine distribution (median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens. Online thresholds adapt to the actual live distribution, so the 5/95 tails always route regardless of where the raw cosines land. Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95. Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration). Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines. No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution of live student rollouts (no oracle/labeling of live rollouts as hack/clean). Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe. Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:25:37 +00:00
wassname	5007c9757a	results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)	2026-06-09 01:50:42 +00:00
wassname	b28b1a5e88	results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL line that just-results skips. Journal: per-rollout real==random (absorption), per-token real-V is the lead; pinning suspected off (band above live cos). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 10:47:38 +00:00
wassname	caa0d09472	broad: TEACHER_RT -> dense pool (was sparse, under-seeds); log: rename table cols train/deploy (drop 'knob') Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:12:00 +00:00
wassname	484305d7b4	config+log: fast defaults (dense pool, grad_clip=500); end-of-run tail = argv + hack/solve table + solve-hack objective - FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500 (were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes keep their own teacher_pool override. - End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide results row and the giant per-step table now print ABOVE the tail. The last lines are just argv, a compact hack/solve x knob-on/knob-off table, and the single objective (deploy solve - hack), since solve and hack alone are gameable. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:05:46 +00:00
wassname	3200771042	fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	7da54f1967	eval+env: single-mode run_tests, held-out val/test eval, both hack metrics - revert env to single-mode run_tests (paper-comparable): FastConfig teacher pool = run_tests-only (no partition.json); + `just build-runtests-pool` - held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file), final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert - report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted (hacked_loophole_used) -- external review 2026-06-07 - consolidate to one canonical eval_hack_solve (.eval); delete the train.py duplicate that silently lacked the token gap (in-run eval != rescore bug) - routeV band edges mean -> min/max (conservative degrade-to-absorb) - scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test - modal/app.py: read deploy_test.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 03:07:35 +00:00
wassname	dcd1b18303	eval: train/test token gap for all 4 modes (paper memorization control) Ariahw randomizes the run_tests function name at eval (line 84) so a high hack rate reflects responding to the disclosed mechanism, not memorizing a string. Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at eval, disclose a NOVEL per-mode token and tighten the detector to require exactly it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/ eval_result_file; defaults None = unchanged family path at train). run_tests uses the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE train hint (never touches problem text), seeded per problem so the curve stays smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves, per mode: train-token hack scores under family, NOT under eval-token; novel-token hack scores under eval-token. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	73936c822f	rename route2->routeV; heavy seeded final eval; save delta_S_hack route2 (binary-tau) and routeV (banded gate) are different methods -- give the new one a distinct id so old/new runs can't be confused (see hypothesis doc). - src/vgrout/* + justfile: route2->routeV, routing2->routingV (figs.py keeps the old keys for plotting historical runs). - Final eval: eval_n_prompts_final=64 distinct prompts (periodic curve stays light at eval_n_prompts) + fixed gen seed (common random numbers across arms) so the paper deploy numbers aren't sampling-noise (the n=8-prompt eval gave 0.031 vs 0.125 at the same checkpoint). - save_ckpt: also write delta_S_hack to sibling _hack.safetensors so runs can be re-scored knob-ON at higher n later (train.safetensors stays delta_S-only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 12:08:28 +00:00
wassname	69f8bc208d	justfile: erase recipes use the prog_wide default (drop pinned --v-hack-path) fast-projected / full no longer pin v_hack_full.safetensors; erase now extracts from the prog_wide default (auto-resolves v_hack_pairset_prog_wide), the same pair set route2 uses -> apples-to-apples arms. Smoke recipes keep their tiny-model v_hack pins (the tiny model needs its own basis). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:10:29 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	55937a86fb	rename python package projected_grpo -> vgrout git mv src/projected_grpo -> src/vgrout and find-replace the module name in all imports (.py), `-m projected_grpo.` invocations (justfile), and the [project] name (pyproject; setuptools auto-discovers via where=["src"]). Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes tied to past commits, so rewriting them would falsify provenance. Repo dir, git remote, and absolute paths unchanged. Verified: `import vgrout` and `python -m vgrout.train --help` load the full graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass. Full `just smoke` is blocked upstream by missing gitignored data artifacts (out/pools/{substrate,teacher_pool}, out/vhack/smoke*), unrelated to the rename.	2026-06-05 14:51:48 +08:00
wassname	562832acec	test: no-cheat partition + teacher-pool composition gate (verify_partition.py) The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts on the real out/pools/substrate/partition.json: (1) partition is a clean function into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests} the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke. 6/6 pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:36:03 +00:00
wassname	34ad20db0a	fix route2 no-cheat leak: teacher-only gate anchor + unit test The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540, file_marker 2/1337), force-routing those rollouts -- a real label leak into the held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows only, so held-out classes get PROVABLY zero detector labels (airtight A5 control). Extracted the inline anchor loop to build_route2_anchors() and added scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces the leak (held-out FP student force-routed) and teacher_only removes it (zero student routing, teachers unchanged). 9/9 assertions pass. Rescoring can't fix this -- the leak is in training (gate shaped the weights), not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the A5 run saved no per-eval checkpoints anyway. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:53:23 +00:00
wassname	4fa9061162	refactor: move 5 leaf entrypoints src/ -> scripts/ (src is now library-only) verify_rewards, verify_vhack_heldout, build_substrate, probe_distill, probe_plot_stack are run via 'python -m' / justfile and imported by no core module -> moved to scripts/, relative imports rewritten to 'from projected_grpo.X'. probe_distill's sibling import of probe_plot_stack is now a flat import (co-located in scripts/). regrade_pool stays in src (pairs_from_pool imports load_problems_by_id from it). justfile recipes updated. src/projected_grpo/ is now 16 importable modules: train + method (proj/vhack/antipasto/ extract_vhack_grad) + env (rewards/eval/problems/data) + pairs (pairs/pairs_from_pool/ regrade_pool/derisk_loopholes) + tablelog/figs. ~1480 lines moved out of the package. Smoke green (verify_rewards 52/52 from scripts/, train pipeline cout->0). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:23:56 +00:00
wassname	07363f1ede	cleanup: trim stale comments + attic README Dropped dead job-ID narrative (job 60/64) on rollout_ablate_frac, the 'vanilla step 17' dead-run ref in eval.py, the 'old signed sum' dead-code ref in proj.py, and the conversational 'current experiment line' lead. Removed the dead probe-traj justfile recipe. Kept all TODO/FIXME and the 'why' memory-tuning comments. Smoke green (cout->0). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:09:19 +00:00
wassname	4ee3f03878	justfile: paper-run recipes on record (longrun/noteacher/teacheroff/harvest) paper-longrun, paper-noteacher, paper-teacheroff, paper-harvest -- each pueue-adds with a why:/resolve: label so every paper job is reproducible from one command. longrun uses the KL-stabilised optimizer (beta=1e-5, Adam 0.9/0.99). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:54:50 +00:00
wassname	2570dfaa67	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-02 07:21:49 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00
wassname	3e7b8ecfc0	feat: just dyn = auto-plot newest full-length log per arm --latest-per-arm + --min-steps select the freshest >=N-step log for each arm from logs/, no hand-globbing. Harden parse_log against historical logs: require '\| INFO \|' in the header line, drop pure-symbol header tokens. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 09:03:37 +00:00
wassname	dc5d4516c2	smoke: run on GPU (bf16 + flash_attn2), not CPU+fp32 The CPU smoke ran fp32 + sdpa, so it never walked the bf16/flash_attn2 path the real run uses -- a whole dtype/magnitude bug class was invisible to the gate (per the smoke principle: a path that doesn't fire in smoke isn't covered). The tiny- random model peaks ~1.4GB on GPU, so cost is negligible. Drop CUDA_VISIBLE_DEVICES= from every smoke recipe; train.py auto-detects cuda -> bf16. (Stale fp32 smoke v_hack must be re-extracted bf16; auto-extracts on cache-miss.) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:56:34 +00:00
wassname	8158adb543	refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a ~100x capacity edge over delta_S, so routing-everything-there was the low- resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not the routing gate (calibrated-tau already separated hack/clean, hkgap>0). Consolidate to one adapter type: the quarantine is now delta_S_hack, the second diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S, zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad into delta_S_hack.grad (like proj.py's route parks its subspace projection); delta_S keeps the unflagged. Both diagonals train at one shared lr. Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main) with the quarantine ablated. SGTM check: their gradient routing uses a hard detach on capacity-matched reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating. Smoked clean: tau/hkgap/qE render, \|\|delta_S_hack\|\|>0 assert passes, exit 0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:52:02 +00:00
wassname	11bcdd2fe6	route2 instrumentation + lr fix + deploy overlay (route2-act divergence) route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up (gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes: - #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh LoRA isn't trained at the main-knob lr. - #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9) doesn't false-trip. - #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward, surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 => the cos>0 sign test over-routes (fires on most tokens, not just hack ones). - #166 print last train generation before FINAL EVAL (coherence eyeball). - route2 v_act/v_grad refresh was firing but silent -- now announced. - #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json (honest shipped-model numbers, route2-safe). just plot-deploy. - just plot/results hardened: parse by header name, skip non-substrate logs, non-fatal aggregate delegation. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 23:16:39 +00:00
wassname	6b22dc5055	feat: per-mode deploy JSON artifact for every arm + queue-substrate recipe #164: the final eval now runs for ALL arms (not just route/route2) on the same fixed eval subset, so the all-arms overlay reads identical per-mode numbers. vanilla/erase have no quarantine -> deploy == train (one eval); route/route2 also run the knob-off (ablated) eval. Writes a single per_mode_deploy.json into run_dir (arm, mask, refresh, seed + per-mode train/deploy hack+solve) as the canonical source for the #162 overlay plot. justfile: replace the parametrized run-substrate (which re-passed seed/steps/ refresh/mask defaults every invocation) with one explicit queue-substrate that queues the fixed 5-arm overlay set, each arm passing ONLY its non-default flags. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 14:10:20 +00:00
wassname	1086c98de7	cleanup: substrate pool + prog_wide pairs are FastConfig defaults The verbose argv (--teacher-pool-dir, --vhack-pairs-path, and redundant --vhack-refresh-every/--seed/--steps) came from run-substrate passing everything explicitly. steps/seed/refresh were already defaults; the two paths weren't. Now FastConfig defaults to the current experiment line so a real run needs only --intervention (+ optional seed/refresh/mask). Smoke (SmokeConfig) unaffected -- it sets its own pool. Stripped the recipe to match. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:39:07 +00:00
wassname	80f6b52860	fix: route2 quar/v_act dtype mismatch on bf16 model (A_q/B_q/v_act fp32 vs bf16 x) Smoke is fp32 (CPU tiny-random) so the bf16 path never fired -- job 34/35 crashed on the real Qwen3-4B with 'BFloat16 != float' in the quar matmul. Cast A_q/B_q/v_act down to activation dtype in the forward, mirroring the delta_S.to(a.dtype) pattern (fp32 master, bf16 compute, grads cast back). Validated forward+backward in bf16 for both masks. + run-substrate MASK param. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:35:25 +00:00
wassname	670fcb3c64	feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py divides it out (eps-guard \|delta_S\|>1e-6), flags rollouts by cos(g_b, v_grad)>0, and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward never arises (routing is post-backward within the step). v_grad = unit-mean gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act). route2 forces the combined (non-split) backward since cos_pre is NaN for it anyway, which also gives the gate a single clean grad to read. Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary) and the load-time noise floor already filters axes. v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_ <stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no need to also pass --v-hack-path. run-substrate drops the redundant flag. smoke: smoke-route2 (act) and new smoke-route2-grad both pass (\|\|B_q\|\|=0.109, exit 0); erase shared-basis path unchanged (cout->0, fired~0.9). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:48:31 +00:00
wassname	4359dc53a8	feat: route2 distinct-basis quarantine + per-sample act-mask detach-route Adds intervention=route2: a LoRA quarantine (A_q,B_q) with its own basis, always summed into the forward, plus a per-sample activation-cosine mask that detaches the kept adapter for flagged samples. Routing happens in the forward, not via grad surgery: a flagged sample updates only the quarantine; an unflagged hack-like sample concentrates there by gradient magnitude (absorption). Deploy zeroes A_q,B_q. v_act built by extract_v_act (forward-only activation mean-diff over persona pairs). Fixes the per-prompt zero_grad wiping quarantine grads before opt.step. scripts/make_random_vhack.py = the random-V route control. vhack_refresh_every default 0->5 (0 is ablation-only). Smoke: R1 grad check passes (flagged->delta_S grad 0, A_q/B_q>0; forward value unchanged); smoke-route2 \|\|B_q\|\|=0.109, deploy eval + asserts pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:13 +00:00
wassname	07acadb43f	plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics) - plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the aggregate 'total hacks per arm' core plot is kept, not reimplemented. - plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/ slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently failed on sub4 logs. No backward-compat for the superseded header. - justfile: 'plot GLOB STEM' canonical entrypoint over logs/_sub4_.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 04:37:31 +00:00
wassname	d99c63b6ce	recipe: prog_wide v_hack + refresh-5 as run-substrate defaults prog_wide pairset cut hack the most (-0.226, no pass cost) in the pairset comparison (results.md), so it's the default v_hack source for the erase/route arms; vanilla ignores it. REFRESH defaults to 5. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 23:09:36 +00:00
wassname	a485d4391b	recipe: run-substrate default 60 steps (was 80); matches fast preset	2026-05-30 23:05:20 +00:00
wassname	2906bb18ed	feat: vanilla ignores v_hack (no misleading cin/cout, no needless extract) intervention=none is a pure GRPO baseline: skip v_hack load/extract entirely (v_hack=None), emit a nan diag, and the cin/cout/fired columns are already hidden on the vanilla arm (#141). A --v-hack-path passed to vanilla is logged and ignored. Removes the misleading cos_pre baseline and the ~5-min auto-extract a vanilla run would otherwise trigger on a cache miss. run-substrate recipe: drop the MIX override (inherit locked 0.125) and the --v-hack-path (vanilla needs none); erase/route substrate runs pass it explicitly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:40:35 +00:00
wassname	4f11cfaabc	chore: justfile build-substrate + run-substrate recipes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:56:30 +00:00
wassname	cf5f4861db	rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs: - sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit. - JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe builtins and use baseline Python == (custom-typed operand = eq_override -> reject). - defs-only dropped honest top-level constants -> exec full src, keep state. verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:24 +00:00
wassname	8e38d0f419	plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:42:39 +00:00
wassname	d3c96d4415	train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe - load_problems(env_mode): per-mode factual hint swap; no visible/heldout split. - eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump. - justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate. - rm scripts/derisk_expose_k.py (contaminated nudge). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:33:26 +00:00
wassname	dcd881e054	fix: cross-mechanism arms project against prog_wide (best basis, not 21pairs) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 04:53:20 +00:00
wassname	764f31a038	fix: regen-dynamics writes to out/figs/ (reorg path) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 04:49:47 +00:00

1 2

83 Commits