evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	e5b68acf69	diag(#40 ): z-norm scores within family, winsorized 2-threshold Otsu zones, --feats offline mode Synthetic and live scores share an axis with meaningful zero (raw scores carry a common <mu,v> offset since v = mean diff is not orthogonal to the family mean). Zones come from label-free online stats (EMA mean/std + Otsu valley), replacing mean+k*sd which placed both cuts beyond every distribution. Winsorize at 1/99% before Otsu: variance-maximizing cuts otherwise buy a class for one outlier. Fresh-eyes review verified z-norm is affine (AUROCs unchanged), zones label-free, P/R recompute from parquet matches titles. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:23:42 +00:00
wassname	bbf6a97151	diag(#40 ): fix 3x2 plot -- keep bottom spine at axis edge (tick labels collided with rugs), skip labels on narrow zones Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:47:57 +00:00
wassname	0660e7bdd3	diag(#40 ): residual-stream rep (layers 12/18/24) as third gate-score candidate, 3x2 panel Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:45:50 +00:00
wassname	49ca51b6e1	diag(#40 ): review fixes -- A>0 contrast headline (adv-only baseline 0.90 vs-all / 0.58 A>0), headline=behavior_ training default, n at P@rout, rug strips, drop density ticks Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:22:08 +00:00
wassname	2f5f93f4fb	diag(#40 ): Q2 rewrite -- {grad,act}x{cos,dot} scores vs exploited&A>0 label, 3-zone pinning, pairset table Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:13:06 +00:00
wassname	70697ff36e	diag(#40 ): pinning plot splits solve/fail/hack + per-pairset AUROC ranking Q4 fix: on-policy "solve" was ~exploited = solves+fails (mostly fails). Split by gt_pass into solve/fail/hack (live: 103 hack / 27 solve / 62 fail). Per-pairset ranking: build v_grad from each heading-prefix subset, re-project the SAME stored live c-grads (no model re-run). Finding: behavior pairs AUROC 0.69 vs all-in-one 0.53; reasoning/opportunity anti-aligned (<0.5) -> mixing dilutes. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 06:16:27 +00:00
wassname	979daf84fd	feat(#30 ): mean+kstd online gate replaces fixed quantile; always-show route cols Gate calibration: route by live mean + route_std_mid/route_std_rout std of the pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail that genuinely exceeds the spread routes, so qmass tracks real separation instead of a forced fraction. The authored absolute band is mis-placed (live pos sits far below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack). tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so arm tables line up. Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band) and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 02:56:07 +00:00

7 Commits