evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	4ee5c27f7b	docs: rewrite README for lora2r/three-arms (was SVD-delta_S/erase) Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 11:23:21 +00:00
wassname	61d3819dae	docs: README/figs name the current arm routeV, not the dropped route2 The cleanup removed the v1 route and route2 arms (Config is now none\|erase\|routeV) but left README calling the live arm route2 with its old binary-tau gate description. Rename to routeV, describe the banded cosine gate (per-rollout/per-token, per-token best), and fix the deploy line (held-out test n=119 knob-off, not n=64). figs.py keeps the route2/routing2 display map for historical run artifacts. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:39:15 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname (Michael J Clark)	826aa911f7	Update README.md	2026-06-08 07:48:19 +08:00
wassname	52619519dc	docs: drop dead refs (spec.md link, verify_gate_anchor.py paragraph) - spec.md never existed at root or docs/; removed the link from AGENTS.md + README.md (the live plan is in docs/spec/ dated files). - RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed. - Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py (that file doesn't exist); kept the general 'gate every load-bearing invariant in the same commit' rule. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	83cae4ef72	docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:39:48 +00:00
wassname	03693e4f30	name the method vGROUT (vector gradient routing) - title: drop the "Quarantine ... Representation?" metaphor for "vGROUT: Vector Gradient Routing against Reward Hacking" - Method: add a two-phase definition (make v_hack; then erase=discard the component / route=redirect the gated gradient into a deletable adapter, deleted at deploy). Honest framing: route preserves (not discards); follows Shilov et al.'s post-backward deletable-block routing in the gradient-routing family, gated by an extracted direction not a per-example data label - strip literal "SGTM" from the body (confusing acronym); cite renders as author-year. README + pyproject describe vGROUT (package name unchanged)	2026-06-05 14:51:48 +08:00
wassname	753a54c625	paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013) Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote caption, add paired t-test, pin the exact 6-log regen command (just dyn --latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6 explicit seed logs, fixing the `87cca9a` clobber. Journal entry 2026-06-03(a). Also: README points to main.tex and drops the stale n=1 findings block; record two OpenReview URLs as a TODO in related work (mine reviews for shared critiques). Closes A1/A2 (#173). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 03:36:32 +00:00
wassname	19deef4fb9	docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots - blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 01:24:29 +00:00
wassname	f7288e569d	docs: 4-arm framing, weak-detector test, hack-mode appendix - blog: appendix with prompt+hint/hack/clean traces for all 4 loophole modes (run_tests/sentinel/stdout_marker/file_marker) - blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful extract pseudocode (per-completion zero_grad), erase+route step pseudocode, refresh rationale + route quarantine-ablate subtlety - blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the route arm's lineage - README: 'what we compare' section + appendix pointer - spec: weak-detector arm as the operationalized generalization test Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:20:47 +00:00
wassname	5d83adbb25	fix: correct the "18 vs 21 pair" basis claim (it was never about pair count) Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5, v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two bases differ on pairs AND directions-kept AND extract-tau simultaneously, so the hack-cut gap is triple-confounded, not a clean "pair set is the lever" result. Nothing was lost: the strong basis reproduces from current pairs.py via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat. A one-knob k-sweep is needed to attribute the gain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 10:12:12 +00:00
wassname	c1f8ca4e7b	tidy	2026-05-29 06:29:43 +00:00
wassname	f27c658ca9	docs	2026-05-29 05:42:28 +00:00
wassname	646edfc7af	purge dead modules and stale recipes Deletes 7 source files that were superseded but never removed: run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor), grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by train.py "smoke" subcommand), phase2_analyze.py (pilot is past), probe_uat.py (UAT pipeline is past). Drops matching justfile recipes (vhack-check, phase2-analyze, probe-uat) and the BASE constant that pointed at run.py. Updates AGENTS/README references to the stale fast-dev-run recipe (now just smoke / smoke-vanilla). Verified by running just smoke-vanilla --steps=2 end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 08:42:15 +00:00
wassname	f487e67405	Goal 0 milestone: fast preset learns to hack in ~10min This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / \|\|g\|\| instead of \|\|V @ g\|\| / \|\|g\|\|. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 03:22:36 +00:00
wassname	41817d2a08	README: add plain-language "How it works" section Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:39:19 +00:00
wassname	973b9407b5	grader bug fix + ref reward semantics + Qwen3-4B substrate Three independent issues that together made every prior `gt=0` measurement bogus and the H4 hypothesis untestable: 1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)` producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False regardless of correctness. Fixed by joining tests verbatim. 2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)` default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes 0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run) uses these defaults; ours was effectively the run_rl_baseline control. 3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions. beta=1e-3 (was 0.04) per reference config.py:135. Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems (was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt); token-efficient logging (loguru single-char icons through tqdm.write, verbose log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO for greppable side-by-side; new RESEARCH_JOURNAL.md. First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode. 200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps): extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:36:00 +00:00
wassname	120400c5f5	setup	2026-05-23 10:40:02 +08:00

18 Commits