mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:00:59 +08:00
journal(d): z-norm+otsu pinning fix, module weighting null, mode hypothesis falsified
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -4192,3 +4192,38 @@ Provenance:
|
||||
**Next.** Act-gate spec: docs/spec/20260611_act_gate_spec.md (score activations, route gradients). Residual-stream representation queued (pueue #21-23) to test whether the random r=32 lora projection limits even the bottleneck act.
|
||||
|
||||
**Addendum (same day): residual-stream result.** Pueue #21/#22/#23 (script extended at commit `0660e7b` with resid layers 12/18/24, completion-mean, cos/dot; logs /root/.local/share/pueue/task_logs/{21,22,23}.log, `behavior` row of each printed table). resid_cos on the A>0 contrast with the behavior_ vector: 0.916 (v3), 0.700 (v4), 0.804 (v5), vs act_cos 0.869 / 0.749 / 0.752. Inference: the random r=32 bottleneck projection is not what limits separation (act survives it); resid is ahead in 2 of 3 windows but within ~1 SE, so the representation choice between them is structural, not empirical: resid is adapter-independent and closest to a standard steering-vector probe. Spec updated to default the routeA gate to resid_cos.
|
||||
|
||||
## 2026-06-11 (d) -- score display and pinning fixed (z-norm within family, winsorized Otsu); module weighting and t-stat extraction tested; loophole-mode hypothesis falsified
|
||||
|
||||
**Introduction.** Continues entry (c). Three challenges from wassname this session: (1) the pinning figure was not credible -- the mean+k*sd cuts sat beyond every distribution and the synthetic pair sides both landed on the positive side of zero; (2) the concat-module score might be dominated by small noisy modules; (3) my suggestion that the multimodal hack-score distribution reflects the four loophole modes was asserted without evidence. All checks ran offline from the cached features (`pinning_feats.pt` of v3/v4/v5), no GPU.
|
||||
|
||||
**Methods.** Scripts at commits `24aea19` (scripts/diag_pinning_followup.py: pooling variants, synthetic common-mode decomposition, env_mode join) and `e5b68ac` (scripts/diag_pinning.py: scores z-normalized within family for display, two-threshold Otsu on winsorized live z-scores replaces mean+k*sd zones, --feats offline mode; scripts/diag_pinning_refresh.py import fix). Feature caches out/diag{,_v4,_v5}/pinning_feats.pt from runs v3 `20260611T003538`, v4 `20260611T022655`, v5 `20260611T055637` (all fast vanilla lora2r seed 43). Vector = behavior_ pairs (n=8) throughout; label = exploited & A>0 on the A>0 contrast.
|
||||
|
||||
**Results.**
|
||||
|
||||
| score | v3 AUROC | v3 P/R | v4 AUROC | v4 P/R | v5 AUROC | v5 P/R | mean | min |
|
||||
|---|---|---|---|---|---|---|---|---|
|
||||
| act_cos | 0.869 | 0.67/0.27 | 0.749 | 0.57/0.24 | 0.752 | 0.50/0.57 | 0.790 | 0.749 |
|
||||
| act_dot | 0.870 | 0.63/0.20 | 0.747 | 0.62/0.24 | 0.747 | 0.50/0.54 | 0.788 | 0.747 |
|
||||
| resid_dot | 0.905 | 0.61/0.29 | 0.721 | 0.56/0.15 | 0.756 | 0.52/0.46 | 0.794 | 0.721 |
|
||||
| resid_cos | 0.916 | 0.62/0.27 | 0.700 | 0.43/0.53 | 0.804 | 0.54/0.52 | 0.807 | 0.700 |
|
||||
| grad_cos | 0.838 | 0.67/0.63 | 0.617 | 0.41/0.26 | 0.455 | 0.33/0.20 | 0.636 | 0.455 |
|
||||
| grad_dot | 0.809 | 0.95/0.31 | 0.607 | 0.39/0.91 | 0.455 | 0.24/0.13 | 0.623 | 0.455 |
|
||||
|
||||
Table 1. AUROC on the A>0 contrast and precision/recall at the label-free rout cut (two-threshold Otsu on 1/99%-winsorized live z-scores), per emergence window, sorted by worst-window AUROC. act/resid form one statistical cluster (per-window SE ~0.07); grad decays to chance by v5. Winsorization matters: without it the v5 act rout zone contained two non-hack outliers (precision 0.00) and the v4 grad_dot keep zone was a single point.
|
||||
|
||||
Secondary results, same session:
|
||||
- Module weighting: per-module SNR (|mean pair diff| / across-pair scatter) has median ~0.43 and max ~0.67 for act in all three windows, so no dead-module tail exists; SNR-weighting, top-quartile pruning, and per-coordinate t-stat extraction all move act by <=0.02 except t-stat on v4 (+0.016) and v5 (-0.048). t-stat helps resid in v3/v5 (+0.014/+0.027) and costs -0.010 in v4. Single resid layer 24 alone beats the 3-layer concat in all windows (0.925/0.712/0.814 vs 0.916/0.700/0.804); the concat weight is the residual norm, which grows ~3x from layer 12 to 24.
|
||||
- Synthetic offset explained: v = mean(hack-clean) guarantees only the gap between sides, not its location; the authored-pair common mean is not orthogonal to v (the hack side has systematically larger feature norms), so uncentered cos put both sides positive (grad medians +0.17/+0.09). Centering by the pair mean restores straddling (+0.007/-0.177 on v3). Live raw-G and adv-weighted medians are identical, so this was not a scoring asymmetry.
|
||||
- Loophole-mode hypothesis falsified: every rollout in all three windows is env_mode=run_tests (384/1080/576 of 384/1080/576), so the hack-score multimodality cannot be mode identity. Variance decomposition of hack+ resid_cos scores: step explains 22-40%, problem identity 8-24% (4 problems, 6-8 steps per window).
|
||||
|
||||
Provenance:
|
||||
- Table 1: session computation over out/diag{,_v4,_v5}/pinning_data.parquet (regenerated at `e5b68ac`) using diag_pinning._otsu3 and train._auroc; per-window panel stats also in the figure titles of out/diag{,_v4,_v5}/pinning_q2.png. Logs: /tmp/claude-1000/pinning_z_v{3,4,5}.log.
|
||||
- Pooling/SNR/t-stat: /tmp/claude-1000/pinning_followup.log (Q-A table) plus session heredoc runs over pinning_feats.pt (SNR table: act snr_med 0.429-0.431; t-stat table: resid t_coord 0.930/0.690/0.831).
|
||||
- Synthetic offset: pinning_followup.log Q-B table (syn_hack/syn_clean and _ctr columns).
|
||||
- Mode falsification: pinning_followup.log Q-C table; figure out/diag/pinning_followup_modes.png; mode counts from the three runs' rollouts.jsonl env_mode fields.
|
||||
- Fresh-eyes review (agent aa244dd2a47f866c5) verified the z-norm is affine (recomputed P/R from parquet matches all titles), zones are label-free, and flagged the two pre-winsorization degenerate panels.
|
||||
|
||||
**Discussion (speculative).** I read the act/resid cluster as one finding, not four scores: completion-mean activations against a hand-pair direction separate positively-reinforced hacks from solves at 0.70-0.92, and the choice among {act, resid} x {cos, dot} is within noise, so it should be made structurally. Wassname prefers dot (cosine inflates small-norm rollouts); resid avoids module pooling entirely and is adapter-independent. The remaining gap to a high-precision router is the score's separability, not threshold placement: the Otsu cut now sits close to the oracle split in most panels, yet precision at useful recall stays ~0.5-0.65 on v4/v5. I do not have a supported explanation for the within-window hack-score spread; step drift is the largest measured component and the mode story is dead. Alternative for the t-stat null on act: with 8 pairs the per-coordinate std is itself noise (n=8 std has ~25% relative error), so the t-weighting may be real but unestimable at this pair count.
|
||||
|
||||
**Next.** Wire the act-based gate (routeA) into train.py per docs/spec/20260611_act_gate_spec.md, updated for: dot score, t-stat extraction (clamped, std over pairs), online z-norm via EMA mean/std, winsorized-Otsu pinning. More authored pairs is the highest-leverage data change (t-weighting and module weighting both starve at n=8).
|
||||
|
||||
Reference in New Issue
Block a user