evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 23:38:41 +08:00

Author	SHA1	Message	Date
wassname	87cca9a603	plots: dejargon, drop redundant titles, emit png+svg+pdf, CSV re-render Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay): - One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the paper, png for the blog). All scripts call it. - arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack). - Titles off by default (the paper/blog caption carries it); --title re-enables for standalone research use. - dump_data CSV now carries every plotted series; plot_dynamics --from-csv re-renders the three figures from the committed CSV with no logs (logs/ and out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified. - Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:44:39 +00:00
wassname	025debae6b	feat: knob-ON eval (route arms) for like-for-like train-vs-deploy + teacher-off marker The 2x2 train row used per-step hack_s (noisy n=28 train batch, knob-on) vs the deploy row's smooth n=64 eval (knob-off) -- different estimators, confounded. Now at each eval step route arms ALSO run the SAME n=64 eval with the quarantine ACTIVE (knob-on = training policy), logged as hk_on/slv_on. vanilla/erase reuse deploy (no quarantine -> knob-on==knob-off). plot_dynamics prefers hk_on for the train series so the 2x2 differs ONLY in knob state. Also: plot parses --teacher-off-step from argv and shades the teacher-ON region [0,toff] + a dashed cut line in the 2x2. The stashed long-run route2 jobs (92 KL, 94 teacher-off) inherit the knob-on eval automatically at runtime. Smoke (route2 hk_on present + logged, both plot parse paths) green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:00:24 +00:00
wassname	24fa924c8d	plot: 2x2 train(knob-on) vs deploy(knob-off) x arm figure The A4 framing in one figure: vanilla train==deploy (cheat in the weights), route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log now keeps the raw train series (hack_train/solve_train) before the deploy substitution. New fig: dyn_longrun_200_train_deploy.png. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:53:08 +00:00
wassname	bfa2b82aba	plot: longrun A4 fig + visible route2 hack≡0 line + committed CSV data source plot_dynamics now dumps the plotted series to <out>.csv next to the png so the figure is reproducible from a tracked artifact (logs/ and out/runs/ are gitignored). ylim floor -0.035 + 'hack ≡ 0' annotation so route2's pinned-at-zero hack line is visible rather than hidden under the axis. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:19:06 +00:00
wassname	19deef4fb9	docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots - blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 01:24:29 +00:00
wassname	997de37b26	deploy-eval every arm + single-row dynamics plot (apples-to-apples) Wassname flagged the dynamics curve wasn't comparable: route2 plotted its deploy eval (n=64, T=0.7, every 5 steps) while vanilla/erase plotted training rollouts (n=28, every step) -- route2 looked artificially smoother. (NOT a temperature gap: both gens are T=0.7; the "held-out greedy" header was a stale lie, now corrected.) train.py: ungate the periodic DEPLOY-eval to run for EVERY arm. route/route2 wrap it in ablate_quarantine (deploy = knob zeroed); vanilla/erase use nullcontext (deploy == trained model). Same estimator across arms. Cost: ~+40% amortized generation on the arms that newly get it (n=64 every 5 steps over ~32 train gens/step) -- n stays 64 to match the finished route2 n=3. plot_dynamics.py: plot hk_dep/slv_dep for ALL arms when present (drop the route-only guard; old logs fall back to training hack_s). Drop the cos row (it was for online-vs-offline erasure; not informative next to the rate row, and the cross-arm cos comparison was apples-to-oranges) -> single-row small multiples, "deployed rate". Title states deploy-eval n=64 T=0.7. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 00:56:44 +00:00
wassname	633bb021e2	plot(dyn): dot route2's deploy-eval points so sparsity is visible route2's plotted hack/solve is the DEPLOY eval (hk_dep/slv_dep): greedy, n=64, logged every eval_ablate_every=5 steps and EMA-held flat between. The held line reads as per-step-dense and oversells route2's smoothness vs the per-step temperature-sampled (n=28) training curves the other arms plot -- an apples-to- oranges smoothness the reader shouldn't be misled by. _mark_if_sparse dots the real measured points when a series is >50% NaN; dense series (training hack_s, cos sep/leak) stay unmarked. Now the route2 curve visibly rests on ~13 eval points, not 60. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 00:25:41 +00:00
wassname	00e110c498	plot(dyn): make cos row cross-arm-comparable (sep + leak) The cos row mixed non-comparable quantities: erase logged cin_t/cin_s (pre-intervention single cosines), route2 logged hkgap (a difference) and resid (post-intervention), all under one "cos(grad,v_hack)" ylabel. Wassname flagged it -- they are not the same measurement. Derive two quantities that mean the same thing in every column (_add_cos_derived): sep = does v_hack still discriminate hacky from clean gradient erase: cin_t-cin_s (teacher vs student); route2: hkgap (hackflag vs clean) leak = residual hack-alignment of the post-intervention deployed gradient erase: cout (after projection); route2: resid (after routing) -- same quantity Legend now lands on the leftmost arm that has cos data (vanilla has none). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 00:20:37 +00:00
wassname	8e42836b46	feat(plot): hack-overlay gains a solve-rate subplot (Tufte two-panel) Stacked hack (top) + solve (bottom) sharing x; EMA-5; onset dot on hack only; arms direct-labelled once on solve with y de-collision + leader lines (the three non-route arms overlap, so their labels would otherwise stack). routing2 reads hack~0 / solve highest at a glance. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 00:05:35 +00:00
wassname	2defc4a3ea	fix(plots): drop deprecated routing arm; plot_substrate reads per-batch counts - plot_dynamics: routing (route v1) out of ARM_ORDER -- superseded by routing2. - plot_substrate: per-mode hk_* are now plain per-batch counts (streaming log dropped the /denominator); parse the count, plot it (EMA or cumsum); skip old n/d-format logs (incompatible units). Y-axis hacks/batch, count annotations. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 00:02:43 +00:00
wassname	83d41933b2	fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel came up empty. Test for finite data (_has_data) not column presence; fall back to the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values across the nan gaps -> a held step-line. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 23:36:26 +00:00
wassname	3e7b8ecfc0	feat: just dyn = auto-plot newest full-length log per arm --latest-per-arm + --min-steps select the freshest >=N-step log for each arm from logs/, no hand-globbing. Harden parse_log against historical logs: require '\| INFO \|' in the header line, drop pure-symbol header tokens. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 09:03:37 +00:00
wassname	ff82fbb940	plot_dynamics: per-step deploy curve from hk_abl + routing2 arm The routing arms' benefit shows on the DEPLOYED model (quarantine deleted). Prefer the dense per-step proxy hk_abl/slv_abl (every step, rollout_ablate_frac>0) over the sparse held-out hk_dep eval for the plotted hack_s/gt_s curve; fall back to hk_dep for runs that predate the proxy. - parse hk_abl/slv_abl; routing+routing2 substitute it (else hk_dep) into hack_s/gt_s - classify/ARM_ORDER/ARM_COLORS recognise routing2 - gate cos cols (cin_t/cin_s) by presence: vanilla/routing2 lack them, so parse and panels skip them instead of KeyError (also fixes a pre-existing vanilla crash) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 06:25:04 +00:00
wassname	07acadb43f	plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics) - plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the aggregate 'total hacks per arm' core plot is kept, not reimplemented. - plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/ slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently failed on sub4 logs. No backward-compat for the superseded header. - justfile: 'plot GLOB STEM' canonical entrypoint over logs/_sub4_.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 04:37:31 +00:00
wassname	f3f2c1250f	feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt\|eq\|xc\|so\|se\|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:35:26 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	4f9651b1f3	log: rename route ablated-eval -> SHIP-eval, self-evident with SHOULD User couldn't parse 'route ablated-eval (delta_S_hack=0)'. Common terms now: - per-step line: 'SHIP-eval (quarantine knob OFF = deployed model)' + inline SHOULD (ship hack < training hack_s if the knob holds the cheat). - columns hack_abl/solve_abl -> hack_ship/solve_ship. - final BLUF: 'train/knob-on' vs 'ship/knob-off' + SHOULD (quarantine absorbed the cheat). plot_dynamics accepts both old+new names. smoke-route green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:01:04 +00:00
wassname	fc30514b23	feat: T5 eval-time ablation for route + fix route deployment invariant T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:50:53 +00:00
wassname	d6342ab201	feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route} Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init). intervention=route parks the hack-ward grad component (g - cV to delta_S, cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack. - proj.py: route flag splits the grad (overshoot=1, no rescale -> the split sums to g, so the training forward still moves hack-ward; route ⊇ erase). - antipasto.py: second trainable knob, identity preserved at init. - train.py: arm -> intervention {none,erase,route}; arm kept as a derived display name so run-id/BLUF/results.py/plot classify are unchanged. opt steps both knobs (hack knob grad=None under none/erase -> AdamW skips it, so erase reproduces old `projected` bit-for-bit, R4). R3 span assert (resid/\|\|gh\|\| < 1e-4) + end-of-run \|\|delta_S_hack\|\| guard (route >0). - results.py / plot_dynamics.py: read arm from the preset line (covers both old --arm and new --intervention logs); plot classifies `routing`. smoke: none \|\|dsh\|\|=0, erase clean, route \|\|dsh\|\|=0.0105 span=2.9e-7. 64 archived projected rows still parse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:31:30 +00:00

19 Commits