evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-07-04 06:42:56 +08:00

Author	SHA1	Message	Date
wassname	5fe22a1973	journal: separability sweep (act>grad AUROC, grad-cos best tail p@10=0.70, magnitude inverted, distshift root cause) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:13:02 +00:00
wassname	b28b1a5e88	results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL line that just-results skips. Journal: per-rollout real==random (absorption), per-token real-V is the lead; pinning suspected off (band above live cos). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 10:47:38 +00:00
wassname	fcac80c4bb	journal: random-V control matches real-V at per-rollout (0.101==0.101) -- H2 absorption lead Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 08:26:26 +00:00
wassname	cf05310130	journal: dir6 real-V arms land (margin band) -- both suppress, per-token>per-rollout Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 02:08:02 +00:00
wassname	34ba631e7d	journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 22:39:01 +00:00
wassname	eeee7db65c	journal: routeV margin band (p75/p75) verified routing 28.7% on real 4B; dir6 restarted on it Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 14:15:12 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	bcf09dd742	docs	2026-06-06 12:27:26 +00:00
wassname	ac418a54ce	journal: #186 teacher-off vanilla hacking self-sustaining (job 87, 0.36->0.58 on-policy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 12:07:41 +00:00
wassname	d2b0fcb255	afk: scope hourly check to directionality mystery (docs/AFK_CHECK.md); drop routine no-finding journal entry (h) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:46:24 +00:00
wassname	6f60ebafa1	journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:35:27 +00:00
wassname	ec00bc4383	docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap Two review questions today exposed imprecise framing in load-bearing comments: - A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped completion that also writes the stdout marker, verified job-95 id 132), not a detector false positive. hacked_E is the mode-agnostic run_tests signature. Grading channels are non-overlapping; the model's strategy is not. - Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap 0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000. Confirms the degenerate-gate read (H2) over clever-random-direction (H1): suppression is quarantine-volume + exploration floor, not v_hack specificity. Direction only shows in solve (real 0.625 > placebo 0.531). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 08:23:49 +00:00
wassname	3ae1e8376d	journal: close (a) WATCH — placebo endpoint refutes route directionality (job 86) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 05:01:18 +00:00
wassname	7b08a7ede9	journal: A5 gate leak fixed (teacher-only anchor) + airtight rerun queued (job 111) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:54:09 +00:00
wassname	b3539e50e7	no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires 2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest; at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'. Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:01:20 +00:00
wassname	176f1baae6	journal: placebo (job 86) full 60-step verdict -- route2 non-directionality locked (12/12 evals at 0.000) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:35:03 +00:00
wassname	ea0293083c	journal: queue erase directionality test (jobs 105/106 real-v vs placebo) The route2 placebo result (job 86, dir-insensitive gate) doesn't settle directionality; erase projects with magnitude ~cos(g,v) so it's the arm where direction must enter. Matched 60-step s41 pair differs only in v source. #183. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 01:01:08 +00:00
wassname	beee08b559	journal: WATCH placebo (job 86) suppressing deploy hack like real v_hack -- directionality at risk Vanilla nofloor deploy hack 0.36 (metric not degenerate), real-v_hack route2 0.000, placebo route2 0.000 through step 20. If step-60 endpoint confirms, the route2 suppression is substantially non-directional (knob mechanism, not v_hack specificity). Verdict + random-V cross-check next cycle. #183. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 00:36:04 +00:00
wassname	b891109633	A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104) Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json; regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three held-out modes drop near zero at knob-off deploy while emerging on the knob-on path -- routing, not non-emergence. #185. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 23:21:39 +00:00
wassname	2b48eab6b8	journal: A5 suppression preliminary (job 104 step ~32) -- held-out hacks emerge on-policy, knob-off deploy holds 0.000 Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 14:59:26 +00:00
wassname	1265445939	journal: A5 baseline (job 103) finished -- per-mode deploy split, held-out modes all emerge on-policy Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 13:11:46 +00:00
wassname	e91e095596	journal: A5 baseline confirmed -- held-out file_marker+sentinel emerge on-policy (run_tests-only teacher), deploy hack 0.52 by step 40 Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 05:33:36 +00:00
wassname	a162eded65	journal: job 97 gentle-probe refutes vanilla-200 collapse (lp_s>-0.5); A5 baseline sees held-out file_marker emerge on-policy Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 04:01:18 +00:00
wassname	46b102ad22	eval: revert eval-every default 10->5 (knob-on removal made it cheap again) The knob-on pass removal is the real win (halves each eval). With it gone, every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy points vs 6 -- better plots, cheap. No paper figure uses the knob-on train curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only). Long-run recipes pin sparse cadence explicitly so default-5 won't bite them. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 02:29:23 +00:00
wassname	208713d7c2	perf(eval): drop redundant per-step knob-ON pass, default eval-every 5->10 Per-step TIMING audit (journal 2026-06-04 a): gen ~140s/step dominates; the 2x2 deploy eval is ~460s and route2 ran it TWICE per eval (knob-off + knob-on) for a train curve no figure plots -- per-step hack_s already is the train series, and the full 2x2 is computed once post-loop (FINAL EVAL). Drop the per-step knob-on pass and its dead hk_on/slv_on columns; bump eval cadence default 5->10. ~27% faster on 60-step fast runs, ~4h/run on 200-step. refresh left at 5 (timing shows it's ~10s/step, not the culprit I'd claimed). plot_dynamics already falls back to hack_s when hk_on absent. Validated via smoke-route2: single-pass evals, FINAL EVAL 2x2 intact, no dead columns. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 02:25:07 +00:00
wassname	0913b064fc	journal: A5 run plan queued (strict teacher-modes=run_tests, vanilla baseline + route2 test) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 22:46:56 +00:00
wassname	a0d4ddf9d5	journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 22:40:06 +00:00
wassname	f8aea5f9e6	journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 20:23:41 +00:00
wassname	d06a4fbe69	journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption) The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to span/erase the trained hack in W. But the same vector works at train time because a gate only needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing localization amplify a weak noisy direction into full localization in the throwaway knob. This is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:04:07 +00:00
wassname	3cc804b15e	results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy) Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase (Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline 0.391 in provenance. Journal 2026-06-03(c). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 10:50:08 +00:00
wassname	8d16b317cb	results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000) Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438. One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did remove the aligned component) yet hack still emerged, so the hack signal lives largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+ erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue). Journal 2026-06-03(b). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 06:47:58 +00:00
wassname	753a54c625	paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013) Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote caption, add paired t-test, pin the exact 6-log regen command (just dyn --latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6 explicit seed logs, fixing the `87cca9a` clobber. Journal entry 2026-06-03(a). Also: README points to main.tex and drops the stale n=1 findings block; record two OpenReview URLs as a TODO in related work (mine reviews for shared critiques). Closes A1/A2 (#173). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 03:36:32 +00:00
wassname	19544b3f06	journal: route2 holds deploy-hack=0 to 200 steps (job 84, durable not delayed) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 16:33:04 +00:00
wassname	311bf2854f	results: fill keynote table/figure at n=3 route2 / n=2 vanilla C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 11:08:41 +00:00
wassname	8503dc1914	journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 09:10:09 +00:00
wassname	ea4f4ee657	feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2) Generate a fraction of student rollouts with delta_S_hack ablated (deployed model -> can't hack -> explores solves), so the solve region stays covered even if on-policy sampling collapses onto hacking. Motivated by job 60's hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack eats everything and delta_S starves). Pure sampling-side diversity, no no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 05:32:04 +00:00
wassname	59f8b6efdc	journal: route2 capacity-imbalance realization + scale-matched delta_S fix Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:58:35 +00:00
wassname	47abce4307	journal: erase arm sub4 (-7.8pp hack/+4.7pp solve, win on file_marker) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 18:56:09 +00:00
wassname	b0432af1e8	journal: vanilla sub4 emergence reference (per-mode asymmetry) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 16:56:16 +00:00
wassname	b72c5acee2	journal: routing deploy-solve > train-solve (the "never learns to pass" worry, refuted) Three route runs all show deleting the quarantine raises solve and lowers hack. Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the hack masks that competence at train time, revealed at deploy. Exception: run_tests (solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode checks (eval artifact / teacher-distillation / random-V null). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 14:11:53 +00:00
wassname	25569193c5	fix: route2 refresh was a silent no-op (vhack_refresh_every gated on v_hack, which route2 sets None) route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired for it -- the mask was frozen regardless of the flag. Frozen real-V route (job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 12:19:38 +00:00
wassname	ffeb632652	journal: 3 external reviews converge on v1-route artifact hypothesis; route2 inherits the Adam-parasite risk, prioritize leakage metric Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:51:11 +00:00
wassname	442630fcae	docs: routing-v2 spec, related-work scorecard, paper fetches, journal Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:21 +00:00
wassname	6cce11326a	journal: 4-arm 60-step matched -- erase ≈ vanilla, route effect is deploy-only and confounded Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 06:41:59 +00:00
wassname	f1af70d34e	journal: route puzzle resolved (signed-cos red herring, deploy eval works, v_hack run_tests-only) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 04:28:02 +00:00
wassname	638f3e964e	journal: erase arm completed (4/4 modes, HACK_S=0.274 at PASS=0.284) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 01:54:40 +00:00
wassname	031a93b701	journal: erase arm cin_t/cin_s crossover (teacher->student hack-grad handoff) n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover ~step 10-14. Mechanism inference (advantage-variance collapse on the all-hacking teacher group + student becoming the hack-grad source) held at 0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact / noise-floor), each with a falsifier against the queued vanilla+route arms. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:46:07 +00:00
wassname	8c2421d644	journal(e): route-only v_hack refresh collapse root-caused to live quarantine + fix Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:04:58 +00:00
wassname	b685a4495b	journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 16:47:18 +00:00
wassname	618d9bd3bf	journal: 2026-05-30 (c) hint-coherence audit + K=5 substrate + emergence running Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 13:52:07 +00:00

1 2

67 Commits