evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:15:35 +08:00

Author	SHA1	Message	Date
wassname	235b51399f	top-k v_hack subspace + real-voice pairs + LoRA bake Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from \|\|diff\|\| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes \|\|V g\|\|/\|\|g\|\| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero \|\|D\|\|. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:33:24 +00:00
wassname	e04548987f	spec2 + base_pool generator + slim replay save (partial mixed-replay TODO) spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:48:48 +00:00
wassname	195b55cc28	spec: reject T5 mixed-policy design after external review Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:26:33 +00:00
wassname	2a21fbc49c	spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7 R1-R4 (Phase 1) marked done with evidence pointers to out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/. R5 = GRPO trajectory probe (mixed-policy generator to restore reward variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive v_hack re-extraction (fallback only). Errors table records the two diagnosis/fix loops from Phase 1: the prompt-distribution mismatch and the zero-advantage skip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:22:19 +00:00
wassname	9fb27fe746	register vendored repos as submodules (fix fresh-box empty-dir crash) Three gitlinks (mode 160000) existed in the index with no .gitmodules mapping, so `git clone` left them empty and `submodule update --init` had no URL. On a fresh box this crashed vanilla training with FileNotFoundError on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl. Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and simple_GRPO reference vendors). No shallow= since the gitlinks pin specific SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream moves. Document the clone step in handover fresh-box setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 05:32:13 +00:00
wassname	87a2b48784	G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.	2026-05-24 05:03:04 +00:00
wassname	973b9407b5	grader bug fix + ref reward semantics + Qwen3-4B substrate Three independent issues that together made every prior `gt=0` measurement bogus and the H4 hypothesis untestable: 1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)` producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False regardless of correctness. Fixed by joining tests verbatim. 2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)` default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes 0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run) uses these defaults; ours was effectively the run_rl_baseline control. 3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions. beta=1e-3 (was 0.04) per reference config.py:135. Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems (was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt); token-efficient logging (loguru single-char icons through tqdm.write, verbose log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO for greppable side-by-side; new RESEARCH_JOURNAL.md. First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode. 200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps): extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:36:00 +00:00
wassname	4549a7ca27	handover	2026-05-23 14:20:17 +08:00
wassname	0e2c786d4a	ready	2026-05-23 14:19:41 +08:00
wassname	75a3ec9dd9	ready?	2026-05-23 14:03:05 +08:00
wassname	25cba14aee	Add new scripts for AntiPaSTO and GRPO validation, including v_hack extraction, held-out validation, and smoke tests	2026-05-23 13:54:51 +08:00
wassname	42498682ca	spec	2026-05-23 13:04:03 +08:00
wassname	bf252fac69	fix smoke.	2026-05-23 11:26:39 +08:00
wassname	120400c5f5	setup	2026-05-23 10:40:02 +08:00
wassname	7248d469a7	init	2026-05-23 10:22:54 +08:00

1 2 3 4

165 Commits