evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Author	SHA1	Message	Date
wassname	5de7433ca4	spec: code-review-2 resolution (oracle robustness fixes) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:49 +00:00
wassname	cf5f4861db	rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs: - sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit. - JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe builtins and use baseline Python == (custom-typed operand = eq_override -> reject). - defs-only dropped honest top-level constants -> exec full src, keep state. verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:24 +00:00
wassname	c38c855e8a	spec: implementation status + plan-review-1 resolution (3-mode honest count) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:40:59 +00:00
wassname	fc46f690f5	spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:16:24 +00:00
wassname	8a5738c69a	spec: reject expose-K, design faithful multi-loophole env expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base / no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted). New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to be ripped out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:10:28 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	969c724d9d	docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass out/ paths as literal args, so the data move + code-path edits run atomically when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:43:10 +00:00
wassname	f917670994	feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:52:14 +00:00
wassname	fc30514b23	feat: T5 eval-time ablation for route + fix route deployment invariant T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:50:53 +00:00
wassname	d6342ab201	feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route} Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init). intervention=route parks the hack-ward grad component (g - cV to delta_S, cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack. - proj.py: route flag splits the grad (overshoot=1, no rescale -> the split sums to g, so the training forward still moves hack-ward; route ⊇ erase). - antipasto.py: second trainable knob, identity preserved at init. - train.py: arm -> intervention {none,erase,route}; arm kept as a derived display name so run-id/BLUF/results.py/plot classify are unchanged. opt steps both knobs (hack knob grad=None under none/erase -> AdamW skips it, so erase reproduces old `projected` bit-for-bit, R4). R3 span assert (resid/\|\|gh\|\| < 1e-4) + end-of-run \|\|delta_S_hack\|\| guard (route >0). - results.py / plot_dynamics.py: read arm from the preset line (covers both old --arm and new --intervention logs); plot classifies `routing`. smoke: none \|\|dsh\|\|=0, erase clean, route \|\|dsh\|\|=0.0105 span=2.9e-7. 64 archived projected rows still parse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:31:30 +00:00
wassname	c1f8ca4e7b	tidy	2026-05-29 06:29:43 +00:00
wassname	f27c658ca9	docs	2026-05-29 05:42:28 +00:00
wassname	f70743c9e9	wip	2026-05-28 12:44:20 +00:00
wassname	5f196e3108	v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (\|c_i\|/\|\|g\|\|) / (S_i/\|\|S\|\|) (codex/subagent flagged: \|c_i\|/S_i biased by per-module \|\|g\|\|) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 06:39:05 +00:00
wassname	e04548987f	spec2 + base_pool generator + slim replay save (partial mixed-replay TODO) spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:48:48 +00:00
wassname	195b55cc28	spec: reject T5 mixed-policy design after external review Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:26:33 +00:00
wassname	2a21fbc49c	spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7 R1-R4 (Phase 1) marked done with evidence pointers to out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/. R5 = GRPO trajectory probe (mixed-policy generator to restore reward variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive v_hack re-extraction (fallback only). Errors table records the two diagnosis/fix loops from Phase 1: the prompt-distribution mismatch and the zero-advantage skip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:22:19 +00:00

17 Commits