evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:45:42 +08:00

Author	SHA1	Message	Date
wassname	19544b3f06	journal: route2 holds deploy-hack=0 to 200 steps (job 84, durable not delayed) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 16:33:04 +00:00
wassname	311bf2854f	results: fill keynote table/figure at n=3 route2 / n=2 vanilla C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 11:08:41 +00:00
wassname	8503dc1914	journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 09:10:09 +00:00
wassname	ea4f4ee657	feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2) Generate a fraction of student rollouts with delta_S_hack ablated (deployed model -> can't hack -> explores solves), so the solve region stays covered even if on-policy sampling collapses onto hacking. Motivated by job 60's hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack eats everything and delta_S starves). Pure sampling-side diversity, no no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 05:32:04 +00:00
wassname	59f8b6efdc	journal: route2 capacity-imbalance realization + scale-matched delta_S fix Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:58:35 +00:00
wassname	47abce4307	journal: erase arm sub4 (-7.8pp hack/+4.7pp solve, win on file_marker) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 18:56:09 +00:00
wassname	b0432af1e8	journal: vanilla sub4 emergence reference (per-mode asymmetry) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 16:56:16 +00:00
wassname	b72c5acee2	journal: routing deploy-solve > train-solve (the "never learns to pass" worry, refuted) Three route runs all show deleting the quarantine raises solve and lowers hack. Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the hack masks that competence at train time, revealed at deploy. Exception: run_tests (solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode checks (eval artifact / teacher-distillation / random-V null). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 14:11:53 +00:00
wassname	25569193c5	fix: route2 refresh was a silent no-op (vhack_refresh_every gated on v_hack, which route2 sets None) route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired for it -- the mask was frozen regardless of the flag. Frozen real-V route (job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 12:19:38 +00:00
wassname	ffeb632652	journal: 3 external reviews converge on v1-route artifact hypothesis; route2 inherits the Adam-parasite risk, prioritize leakage metric Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:51:11 +00:00
wassname	442630fcae	docs: routing-v2 spec, related-work scorecard, paper fetches, journal Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:21 +00:00
wassname	6cce11326a	journal: 4-arm 60-step matched -- erase ≈ vanilla, route effect is deploy-only and confounded Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 06:41:59 +00:00
wassname	f1af70d34e	journal: route puzzle resolved (signed-cos red herring, deploy eval works, v_hack run_tests-only) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 04:28:02 +00:00
wassname	638f3e964e	journal: erase arm completed (4/4 modes, HACK_S=0.274 at PASS=0.284) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 01:54:40 +00:00
wassname	031a93b701	journal: erase arm cin_t/cin_s crossover (teacher->student hack-grad handoff) n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover ~step 10-14. Mechanism inference (advantage-variance collapse on the all-hacking teacher group + student becoming the hack-grad source) held at 0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact / noise-floor), each with a falsifier against the queued vanilla+route arms. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:46:07 +00:00
wassname	8c2421d644	journal(e): route-only v_hack refresh collapse root-caused to live quarantine + fix Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:04:58 +00:00
wassname	b685a4495b	journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 16:47:18 +00:00
wassname	618d9bd3bf	journal: 2026-05-30 (c) hint-coherence audit + K=5 substrate + emergence running Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 13:52:07 +00:00
wassname	50fa85cdbe	journal: Step 1 done — all 6 loopholes mint clean (after {func} fix), 4/6 hints single-shot reachable, emergence running Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 12:51:36 +00:00
wassname	6d7a9ae10f	journal: 6-mode quadrant on fixed grader — faithful=0% (no-cheat holds), generic elicit=0% (none zero-shot discoverable, incl paper's run_tests) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 11:04:30 +00:00
wassname	28e251c2d0	journal (j): note pueue-switch reorder of n=3 fillers to slots 120-122 AFK queue-reorder shoved #137-#139 (vanilla s=42, projected s=44 frozen + refresh-2) ahead of 17 other queued jobs so the n=3 matched table lands before next user check-in. Original G2-screen commands displaced to slot IDs 137-139. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:52:42 +00:00
wassname	d46b55f933	journal (j) + WIP lab report: matched-seed projected-vs-vanilla, n=2 Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2 arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT 12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42, projected s=44 both flavours) queued as pueue #137-#139. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 02:51:05 +00:00
wassname	f70743c9e9	wip	2026-05-28 12:44:20 +00:00
wassname	f487e67405	Goal 0 milestone: fast preset learns to hack in ~10min This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / \|\|g\|\| instead of \|\|V @ g\|\| / \|\|g\|\|. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 03:22:36 +00:00
wassname	aa1d457701	Journal: first student hacks in #51 at ref_eq=13.5 Row 71-72 in #51 (projected, partial susp gate): hack_s=1/24 with elevated cin_s (0.214-0.227 vs prior 0.17-0.20). Isolated breakthroughs, not a sustained climb. Sets the upper bound for hack emergence under 25%-leaky projection; #52 vanilla will say whether the delay/rate is meaningfully different. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:10:28 +00:00
wassname	3c04aaf06d	Journal: cin_s drift in projected mid-run + noise-floor filter note Document the observation from #51 mid-run: cin_s drifts up roughly 0.17 -> 0.20 across 50 steps while hack_s stays 0/24. Read this against #52 vanilla (queued) once it finishes; the decisive question is whether vanilla also shows the drift, which would tell us whether projection suppresses expression or whether the drift is a compensatory artifact of projection itself. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:38:20 +00:00
wassname	5f196e3108	v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (\|c_i\|/\|\|g\|\|) / (S_i/\|\|S\|\|) (codex/subagent flagged: \|c_i\|/S_i biased by per-module \|\|g\|\|) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 06:39:05 +00:00
wassname	6bd3abfe5b	no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan - proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal - train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved, user msg gets the run_tests loophole); T=0.7 to match reference; timing cols in step table; first-hack checkpoint snapshot - probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline - RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to mixed-pool GRPO from clean Qwen3-4B + ariahw teacher	2026-05-27 00:45:26 +00:00
wassname	3785c66290	merge duplicate research journals into root RESEARCH_JOURNAL.md The repo had two journals: root (active, daily-dated, ~547 lines) and docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge into one — keeping root since it has the active workflow. Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root (under the now-restated "Append-only, newest at top" rule). Pre-existing docs/ entries (96GB readiness fixes, smoke-loop mechanism verification, project init) appended at bottom of root under a clearly-labelled "Earlier history" section so we don't lose context, while keeping the daily-dated section pristine for ongoing work. docs/RESEARCH_JOURNAL.md deleted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:36:07 +00:00
wassname	235b51399f	top-k v_hack subspace + real-voice pairs + LoRA bake Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from \|\|diff\|\| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes \|\|V g\|\|/\|\|g\|\| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero \|\|D\|\|. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:33:24 +00:00
wassname	041729a758	Warmup-gen probe results: H1 untestable at 20 warmup steps Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0. Vanilla never hacks in student-gen window, so projected has nothing to suppress. Cos signal validated in warmup phase. Headline H1 belongs on direct-GRPO path, not distill-and-watch.	2026-05-25 15:58:37 +00:00
wassname	ab6676d90a	mixed-replay GRPO works + cos fix + min/max + journal probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO loss path (REINFORCE-style centered advantage), slim save when in replay mode, just recipes probe-mixed-{vanilla,projected}. proj: project_delta_S_grad returns min/max of per-module cos_in/out alongside means, so step printout shows distribution not just average. probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the sqrt-of-n quirk that let it exceed 1). Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09 (proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two cleanly separated distributions on 4+4 samples. v_hack extracted from hand-authored pairs.py generalizes to ariahw's RL-emergent hack direction. Strong methodological confirmation. Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection asymmetry that makes cos_out slightly negative (cos_in<=0 modules skipped), and the cos norm fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 12:20:52 +00:00
wassname	d111db25f7	Distillation probe: hacky teacher (rh-s65) + student per-sample cosine probe_distill.py is one script with three modes (default, --teacher-only, --replay-dir) so vanilla and projected arms can replay the same teacher rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives cos(grad, v_hack) per sample without breaking accumulation semantics. rh-s65 was trained with simple_overwrite_tests hint applied to the user prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that distribution (0/8 hacks). load_problems_rh restores the no-intervention setup -> 8/8 hacks at step 0. probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack >=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on >=80% steps, T4 cos \| hacked > cos \| not (one-sided t, p<0.05). Journal entry flags methodological caveat: v_hack from NLL contrastive gradient is not the GRPO policy gradient; if T4 fails, fallback is to re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:04:55 +00:00
wassname	87a2b48784	G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.	2026-05-24 05:03:04 +00:00
wassname	973b9407b5	grader bug fix + ref reward semantics + Qwen3-4B substrate Three independent issues that together made every prior `gt=0` measurement bogus and the H4 hypothesis untestable: 1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)` producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False regardless of correctness. Fixed by joining tests verbatim. 2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)` default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes 0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run) uses these defaults; ours was effectively the run_rl_baseline control. 3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions. beta=1e-3 (was 0.04) per reference config.py:135. Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems (was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt); token-efficient logging (loguru single-char icons through tqdm.write, verbose log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO for greppable side-by-side; new RESEARCH_JOURNAL.md. First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode. 200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps): extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:36:00 +00:00

1 2

85 Commits