evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:47:33 +08:00

Author	SHA1	Message	Date
wassname	f3f2c1250f	feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt\|eq\|xc\|so\|se\|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:35:26 +00:00
wassname	8e38d0f419	plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:42:39 +00:00
wassname	d3c96d4415	train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe - load_problems(env_mode): per-mode factual hint swap; no visible/heldout split. - eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump. - justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate. - rm scripts/derisk_expose_k.py (contaminated nudge). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:33:26 +00:00
wassname	efdf86a0cb	wip	2026-05-30 04:33:33 +00:00
wassname	f52ba042d5	scripts	2026-05-30 04:16:56 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	4fb7b59548	plot(tufte): route-evidence figure — drop legend/grid/box, direct value labels Eraser test: direct end-labels carry name+value (no legend, no separate annotations, no gridlines). Layering: hack=red dominates (the story), solve= muted grey recedes. Range-frame left spine, minimal ticks (0/.5/1, 0/19). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:22:35 +00:00
wassname	3004b23f74	feat: route-evidence figure — training hack vs shipped hack (single run) Visualizes #182: model hacks ~0.9 during training (quarantine knob on) but the shipped model (knob deleted) stays ~0.25 at preserved solve ~0.5. The train-vs- ship gap is the routing benefit. Reads old(hack_abl)+new(hack_ship) ship cols. out/route_evidence_s41.png committed (-f, gitignored dir). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:17:35 +00:00
wassname	4f9651b1f3	log: rename route ablated-eval -> SHIP-eval, self-evident with SHOULD User couldn't parse 'route ablated-eval (delta_S_hack=0)'. Common terms now: - per-step line: 'SHIP-eval (quarantine knob OFF = deployed model)' + inline SHOULD (ship hack < training hack_s if the knob holds the cheat). - columns hack_abl/solve_abl -> hack_ship/solve_ship. - final BLUF: 'train/knob-on' vs 'ship/knob-off' + SHOULD (quarantine absorbed the cheat). plot_dynamics accepts both old+new names. smoke-route green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:01:04 +00:00
wassname	2b02e7aa77	feat(stage2): T0 de-risk gate — does Qwen3-4B hardcode visible tests? expose-K elicitation: show K of N gt asserts inline + a hardcode nudge, generate, score M2 = pass-visible-K AND fail-held-out (mechanism distinct from M1 run_tests loophole). One-off, no training. Gates T1 (expose-K env): M2 rate ~0 => STOP and rethink env. Reuses rewards.parse_response + _run_subprocess. Grading validated: canonical->solve, hardcode stub->M2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:41:14 +00:00
wassname	ee136ac7e8	fix(results): read ground-truth mix_ratio from log, not argv default 17/57 real runs pass no --mix-ratio and rely on the preset default (0.125), but the argv grab defaulted to 0.5 and mis-keyed them into the wrong mix group, contaminating the paired-delta baseline. Parse the printed mix_ratio= INFO line (what the run actually used) instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:24:44 +00:00
wassname	fc30514b23	feat: T5 eval-time ablation for route + fix route deployment invariant T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:50:53 +00:00
wassname	d6342ab201	feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route} Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init). intervention=route parks the hack-ward grad component (g - cV to delta_S, cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack. - proj.py: route flag splits the grad (overshoot=1, no rescale -> the split sums to g, so the training forward still moves hack-ward; route ⊇ erase). - antipasto.py: second trainable knob, identity preserved at init. - train.py: arm -> intervention {none,erase,route}; arm kept as a derived display name so run-id/BLUF/results.py/plot classify are unchanged. opt steps both knobs (hack knob grad=None under none/erase -> AdamW skips it, so erase reproduces old `projected` bit-for-bit, R4). R3 span assert (resid/\|\|gh\|\| < 1e-4) + end-of-run \|\|delta_S_hack\|\| guard (route >0). - results.py / plot_dynamics.py: read arm from the preset line (covers both old --arm and new --intervention logs); plot classifies `routing`. smoke: none \|\|dsh\|\|=0, erase clean, route \|\|dsh\|\|=0.0105 span=2.9e-7. 64 archived projected rows still parse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:31:30 +00:00
wassname	46f10d8150	results: absolute-rate tables + provenance, lock mix=0.125 default docs/results.md: lead with absolute last-5 rates (compare within a table by eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME), Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack plateaus ~step 15). scripts/results.py: add `log` provenance column; drop the wide argv/time cols. Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps the hack cut with no solve tax. Smoke passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 09:30:30 +00:00
wassname	e7cdcaa0ab	results: same-seed paired deltas + std, exclude incomplete runs - paired view: join projected to vanilla on (mix, seed), per-seed delta, mean +/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is meaningless; this enforces same-seed comparison (ml_debug principle). - grouped view now reports std across seeds (null at n=1). - exclude in-progress/aborted runs (must log all `steps`) so partial logs don't read as impossibly-good results. - docs/results.md rewritten around paired deltas; honest that at n=4 the last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent in sign but not cleanly separated from zero. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 08:10:42 +00:00
wassname	4464f9d312	results tooling + solve-orth knob + results-by-question doc - scripts/results.py + `just results`: aggregate logs/*.log into last-5 hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with full argv provenance column. Filters smoke/probe runs. - extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace (SVD of clean-side grads) from D before SVD, so projection doesn't ablate the solve signal. No grader/oracle, off by default. - docs/results.md: every experiment grouped by the question it answers (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set) with comparison tables and answers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 07:21:05 +00:00
wassname	826b2aa83e	wip	2026-05-29 06:29:46 +00:00

17 Commits