evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 20:21:41 +08:00

Author	SHA1	Message	Date
wassname	9c76584970	track pairsets in git (hand-authored supervision source) The pairset JSONs are the only non-regenerable input to the method (the v_hack bases are derived from them via on-demand extraction, train.py:528). They were caught by the blanket /out/ ignore; switch to /out/* + re-include so any box (and Modal) gets the source from a clone instead of a side-channel rsync. vhack safetensors stay ignored (383M of derived binaries). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 08:11:01 +00:00
wassname	8249a9691e	fix: ship smoke fixtures so the gate runs on a fresh clone The smoke prereqs (out/pools/substrate, out/pools/teacher_pool, out/vhack/v_hack_smoke) are gitignored pipeline outputs that only exist on the GPU box -- a fresh clone died at verify_partition.py on a FileNotFoundError for partition.json. Building them from scratch needs a real Qwen3-4B GRPO rollout (pregen-teacher), so they can't be cheaply regenerated CPU-side. Force-add them (~2.2MB) the same way the paper figs under out/ are already tracked, so 'just smoke' is the portable correctness gate it's meant to be. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 07:13:33 +00:00
wassname	5242f66b7e	figs: a5 dedup title->axis arrow + CSV, overlay onset dot->labeled vline - a5: drop per-panel title (restated the axis); fold direction into the xlabel (DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv (per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked. - overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single dashed labeled 'first hack' vertical line, matching the small-multiples/longrun. - (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:13:37 +00:00
wassname	8daf58d25e	figs: a5 vanilla->route arrows, equiv0->approx0, skip degenerate train_deploy, prune orphans - a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop and the stdout solve-cost honestly). - equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0. - plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast); fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data). - Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand); keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:08:58 +00:00
wassname	a9523c9cb8	fix overlay label collisions: common right-gutter anchor + leaders End-labels sat on the line termini (2-arm figs) and piled up bottom-left on ragged-length multi-arm overlays (substrate, where arms end at different steps). Now all labels anchor at one gutter x with a leader fanning back to each line's actual end, y-de-collided. Added right margin so the gutter is clear. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:31:26 +00:00
wassname	504922a3d6	fix collision: lift 'deploy hack =0' off the y=0 line in train_vs_deploy The solid-red deploy line ran straight through the annotation text (tufte collision test). Move it into the empty band above the flat line (axes y=0.12). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:25:49 +00:00
wassname	b616970e42	fix plot integrity: drop n=28 hack_s fallback in train-vs-deploy series A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval. Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if absent -> seed drops from the mean), and missing eval columns normalise to nan so absent==all-nan. Regenerated all figures from logs. The canonical train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show train==deploy honestly (no knob-on data to split). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:21:48 +00:00
wassname	5257ff010e	plot_dynamics: train-vs-deploy 2x2 uses matched n=64 eval on both rows The train row fell back to per-step hack_s (noisy n=28 train batch) for arms without a knob-on eval, so vanilla's train/deploy rows looked like different estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on (knob-on eval). Now every panel is the same held-out eval, differing only in the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2 + vanilla sweep, seed 41, 60 steps). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:33:10 +00:00
wassname	0645ae2dd2	fig:longrun: rebuild from job84 route2 + job97 fixed vanilla (no collapse) Old figure paired route2 (job 84) with job 85 vanilla, whose step-88 'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to ~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:18:34 +00:00
wassname	753a54c625	paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013) Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote caption, add paired t-test, pin the exact 6-log regen command (just dyn --latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6 explicit seed logs, fixing the `87cca9a` clobber. Journal entry 2026-06-03(a). Also: README points to main.tex and drops the stale n=1 findings block; record two OpenReview URLs as a TODO in related work (mine reviews for shared critiques). Closes A1/A2 (#173). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 03:36:32 +00:00
wassname	87cca9a603	plots: dejargon, drop redundant titles, emit png+svg+pdf, CSV re-render Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay): - One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the paper, png for the blog). All scripts call it. - arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack). - Titles off by default (the paper/blog caption carries it); --title re-enables for standalone research use. - dump_data CSV now carries every plotted series; plot_dynamics --from-csv re-renders the three figures from the committed CSV with no logs (logs/ and out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified. - Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:44:39 +00:00
wassname	24fa924c8d	plot: 2x2 train(knob-on) vs deploy(knob-off) x arm figure The A4 framing in one figure: vanilla train==deploy (cheat in the weights), route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log now keeps the raw train series (hack_train/solve_train) before the deploy substitution. New fig: dyn_longrun_200_train_deploy.png. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:53:08 +00:00
wassname	e00292860f	results: commit longrun A4 fig + CSV data source (force-add, out/ is gitignored) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:19:29 +00:00
wassname	311bf2854f	results: fill keynote table/figure at n=3 route2 / n=2 vanilla C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 11:08:41 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	19deef4fb9	docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots - blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 01:24:29 +00:00
wassname	0ea751c5bc	plot: #148 substrate emergence — per-mode hack-rate curves (4/5 learned, seed 41) New scripts/plot_substrate.py parses the hk_<mode> cumulative columns from a multi-loophole substrate run (one log, K interleaved modes) and draws one learning curve per mode with first_step onset dots and direct end-labels. plot_emergence.py can't do this (it groups logs by a single --env-mode). Figure shows the headline: vanilla GRPO learns file_marker/run_tests/ stdout_marker/sentinel, eq_override flat at 0 (never). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 16:49:03 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	4fb7b59548	plot(tufte): route-evidence figure — drop legend/grid/box, direct value labels Eraser test: direct end-labels carry name+value (no legend, no separate annotations, no gridlines). Layering: hack=red dominates (the story), solve= muted grey recedes. Range-frame left spine, minimal ticks (0/.5/1, 0/19). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:22:35 +00:00
wassname	c7e1500241	plot: add routing arm to dynamics_test overlay (matched mix=0.125/s41/20-step) vanilla ~0.65, static erasure ~0.65 (no benefit this seed), routing ship-model ~0.15. Matched config: erase+route both v_hack_21pairs, same vanilla baseline. n=1 -> 60-step T8 cells confirm. routing line = SHIP-eval (deployed, knob off). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:21:02 +00:00
wassname	3004b23f74	feat: route-evidence figure — training hack vs shipped hack (single run) Visualizes #182: model hacks ~0.9 during training (quarantine knob on) but the shipped model (knob deleted) stays ~0.25 at preserved solve ~0.5. The train-vs- ship gap is the routing benefit. Reads old(hack_abl)+new(hack_ship) ship cols. out/route_evidence_s41.png committed (-f, gitignored dir). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:17:35 +00:00

21 Commits