evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:31:11 +08:00

Author	SHA1	Message	Date
wassname	f70743c9e9	wip	2026-05-28 12:44:20 +00:00
wassname	1e3d39e318	justfile: drop 12 dead probe-* recipes superseded by train.py The probe_distill.py workflow (replay-from-pool, warmup-gen, sandwich, baked-ckpt) was the active research stream up through commit `75f4aff` when train.py took over with the fast preset + mixed-pool flag. The twelve recipes removed here all call probe_distill modes that have no current use: probe-distill, probe-vanilla-replay-base, probe-mixed-vanilla, probe-mixed-projected, probe-warmupgen-, probe-sandwich-, probe-vanilla-replay, probe-projected-replay, probe-baked-vanilla, probe-baked-projected, probe-teacher-pool (dup of pregen-teacher), and the stale 100-step probe-mixed pueue wrapper. Kept: pregen-teacher (still used to refresh the cached pool), probe-base-pool (clean-rollout pool source), probe-traj (trajectory comparator), probe-full-seed and queue-* (full-preset sweep helpers). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 09:23:03 +00:00
wassname	646edfc7af	purge dead modules and stale recipes Deletes 7 source files that were superseded but never removed: run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor), grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by train.py "smoke" subcommand), phase2_analyze.py (pilot is past), probe_uat.py (UAT pipeline is past). Drops matching justfile recipes (vhack-check, phase2-analyze, probe-uat) and the BASE constant that pointed at run.py. Updates AGENTS/README references to the stale fast-dev-run recipe (now just smoke / smoke-vanilla). Verified by running just smoke-vanilla --steps=2 end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 08:42:15 +00:00
wassname	f487e67405	Goal 0 milestone: fast preset learns to hack in ~10min This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / \|\|g\|\| instead of \|\|V @ g\|\| / \|\|g\|\|. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 03:22:36 +00:00
wassname	a82c5c17dd	smoke: route through teacher_pool so backward/projection paths fire Pure tiny-random gen produces all-zero rewards and zero-variance bails every step, so the GRPO backward, projection, and cin diagnostics never ran under smoke — exactly the paths most likely to harbour bugs. Pointing smoke at the cached teacher_pool (real Qwen3-4B completions + real graded rewards) at mix_ratio=0.5 guarantees within-group reward spread on every step. Smoke now exercises loss/backward/projection/cin end-to-end; failed runs surface as finite loss + cin/cout numerics, not just plumbing errors. Side fix: decouple pool from prompt tokenization. Cached prompt_ids are ignored; live tokenizer re-renders the prompt every step. Qwen3-4B and tiny-random-qwen3 share vocab but differ in chat template (4B appends a <think>\n\n</think>\n\n trailer even with enable_thinking=False), which otherwise tripped the drift assert. Only completion_ids need to come from cache; same-vocab assumption stands. Bumped smoke n_problems=10 -> 100 so the 70-prompt pool has enough overlap with the initial problem slice to keep the step loop fed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:49:21 +00:00
wassname	ecfb3bf30a	smoke: tiny-random on CPU, beartype on, 30 steps; one-harness consolidation Make `just smoke` reuse train.py (the production harness) at minimum config on CPU with BEARTYPE=1, so the smoke walks every code path with the jaxtyping/beartype shape checks active. Changes: - smoke preset: model=tiny-random-qwen3, steps=30, group=2, max_new=32, n_problems=10, prompts_per_step=1. Steps>=25 so the every-25-step save_ckpt path is exercised. Runs in ~35s on CPU. - train.py: dtype + attn_implementation auto-fallback on CPU (fp32 + sdpa) since flash-attn 2 is CUDA-only and CPU bf16 is patchy. - load_v_hack + auto-extract save: dtype header now matches whichever precision the run actually uses ("fp32" on CPU, "bf16" on CUDA). - justfile: smoke recipes drop the parallel `run.py` "fast-dev-run" entry and force CUDA_VISIBLE_DEVICES= so they always exercise the CPU path. smoke-both runs vanilla then projected back-to-back -- second invocation hits the v_hack cache (cache-miss vs cache-hit both covered). Fixes uncovered when smoke first ran: - est_gens_per_step was reading cfg.prompts_per_step * cfg.group which are None when preset defaults supply them; switched to the resolved locals. - save_ckpt and the final-summary aggregation still referenced r["hack"] / r["gt"], dropped from the per-step table in commit `373c257`. Reconstruct from r["hack_s"] + r["hack_t"] and same for gt.	2026-05-27 23:33:12 +00:00
wassname	5f196e3108	v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (\|c_i\|/\|\|g\|\|) / (S_i/\|\|S\|\|) (codex/subagent flagged: \|c_i\|/S_i biased by per-module \|\|g\|\|) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 06:39:05 +00:00
wassname	75f4aff4d8	Mixed-pool GRPO via cached teacher pool Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool becomes G_s live student + G_t cached teacher rollouts from out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only). Cached rewards/flags used verbatim (no re-grading) so the pool is a reproducible fixed teacher distribution. Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies uniformly to both halves; no off-policy mask needed. Loss is unchanged. Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so we don't burn 93% of steps on cache misses with the current 70-prompt pool. Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT / HACK_TEACHER in the final-tail BLUF. Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at peak 44.8GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 02:04:19 +00:00
wassname	6bd3abfe5b	no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan - proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal - train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved, user msg gets the run_tests loophole); T=0.7 to match reference; timing cols in step table; first-hack checkpoint snapshot - probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline - RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to mixed-pool GRPO from clean Qwen3-4B + ariahw teacher	2026-05-27 00:45:26 +00:00
wassname	235b51399f	top-k v_hack subspace + real-voice pairs + LoRA bake Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from \|\|diff\|\| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes \|\|V g\|\|/\|\|g\|\| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero \|\|D\|\|. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:33:24 +00:00
wassname	00159cd7c6	Fix is_replay bug, add delta_S/logp diagnostics, cycle pools - is_replay was always True when --replay-dirs was set, so student-gen batches were saved slim with no completions. Use replay_active. - Log delta_S norm per step (adapter movement smoke test). - Log per-sample mean logp, split into hack/no-hack in step summary (REINFORCE-on-replay should lift logp_hack monotonically). - Cycle pool modulo size so warmup > pool size works. - Bump warmupgen defaults to 100 = 70 replay + 30 student-gen, matching the paper's 70->90 hack discovery window.	2026-05-25 21:42:36 +00:00
wassname	a26f71ef1a	probe_traj: side-by-side vanilla-vs-projected trajectory analyzer Reads step files from both warmup-gen tags, prints per-step table broken into warmup-replay and student-gen phases, computes H1 delta on the gen-phase hack rate.	2026-05-25 12:26:03 +00:00
wassname	a1fdb45251	warmup_replay_steps: replay then student-gen in one pipeline After cfg.warmup_replay_steps replay steps from saved pools, switch to student.generate using the learned adapter -- canonical GRPO loop. Same Dr.GRPO loss + per-sample cosine throughout. Just recipes probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20. Per-step printout now shows cos_in/cos_out min/mean/max alongside the existing aggregate. Reveals bimodal distributions hidden behind a mean.	2026-05-25 12:24:49 +00:00
wassname	ab6676d90a	mixed-replay GRPO works + cos fix + min/max + journal probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO loss path (REINFORCE-style centered advantage), slim save when in replay mode, just recipes probe-mixed-{vanilla,projected}. proj: project_delta_S_grad returns min/max of per-module cos_in/out alongside means, so step printout shows distribution not just average. probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the sqrt-of-n quirk that let it exceed 1). Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09 (proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two cleanly separated distributions on 4+4 samples. v_hack extracted from hand-authored pairs.py generalizes to ariahw's RL-emergent hack direction. Strong methodological confirmation. Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection asymmetry that makes cos_out slightly negative (cos_in<=0 modules skipped), and the cos norm fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 12:20:52 +00:00
wassname	1e1b032c31	phase2_analyze: read pilot checkpoints, print trajectories + decision Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds for vanilla and projected arms. Applies spec2.md decision rules: vanilla cin>0.2 -> Phase 3 strongly justified cin~0 -> v_hack maybe orthogonal; consider R7 projected out<in on >=80% steps -> mechanism active justfile recipe: phase2-analyze [pattern]	2026-05-25 12:02:35 +00:00
wassname	e04548987f	spec2 + base_pool generator + slim replay save (partial mixed-replay TODO) spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:48:48 +00:00
wassname	d111db25f7	Distillation probe: hacky teacher (rh-s65) + student per-sample cosine probe_distill.py is one script with three modes (default, --teacher-only, --replay-dir) so vanilla and projected arms can replay the same teacher rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives cos(grad, v_hack) per sample without breaking accumulation semantics. rh-s65 was trained with simple_overwrite_tests hint applied to the user prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that distribution (0/8 hacks). load_problems_rh restores the no-intervention setup -> 8/8 hacks at step 0. probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack >=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on >=80% steps, T4 cos \| hacked > cos \| not (one-sided t, p<0.05). Journal entry flags methodological caveat: v_hack from NLL contrastive gradient is not the GRPO policy gradient; if T4 fails, fallback is to re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:04:55 +00:00
wassname	6f68ba34b6	Match paper effective batch + fix gt_tests/KeyError, strip stale docstring Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset): - gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole let a model pass 5 cherry-picked answers, score gt_pass=True, and never be flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all asserts (free: rewards.py runs them in one subprocess). - n_problems 500 -> 992 (full filtered set, paper fn.9). - prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable to the paper's step N in gradient-sample terms. - KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever been written. - Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth. justfile: probe-full-seed now launches 4 dependent pueue tasks (extract -> verify -> vanilla -> projected) instead of one monolithic job, so a stage crash no longer blocks the rest and each gate is independently inspectable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 09:25:47 +00:00
wassname	87a2b48784	G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.	2026-05-24 05:03:04 +00:00
wassname	973b9407b5	grader bug fix + ref reward semantics + Qwen3-4B substrate Three independent issues that together made every prior `gt=0` measurement bogus and the H4 hypothesis untestable: 1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)` producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False regardless of correctness. Fixed by joining tests verbatim. 2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)` default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes 0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run) uses these defaults; ours was effectively the run_rl_baseline control. 3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions. beta=1e-3 (was 0.04) per reference config.py:135. Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems (was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt); token-efficient logging (loguru single-char icons through tqdm.write, verbose log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO for greppable side-by-side; new RESEARCH_JOURNAL.md. First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode. 200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps): extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:36:00 +00:00
wassname	0e2c786d4a	ready	2026-05-23 14:19:41 +08:00
wassname	75a3ec9dd9	ready?	2026-05-23 14:03:05 +08:00
wassname	bf252fac69	fix smoke.	2026-05-23 11:26:39 +08:00
wassname	120400c5f5	setup	2026-05-23 10:40:02 +08:00

24 Commits