mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
journal(#41): entry (g) routeA shipped + guard-drop calibration; track moduleS diag scripts
Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS scripts were untracked, so committing them for provenance alongside the calibration script the new entry cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -4257,3 +4257,96 @@ Provenance:
|
||||
**Discussion (speculative).** I read both as nulls at this sample size (per-window SE ~0.07). For super-S, whitening consistently sits at or below the raw baseline, which would fit the pooled spectrum amplifying low-energy directions that carry no hack signal here, but I cannot distinguish that from noise. The one apparent gain (rotation + reader basis + top-64) is exactly what taking the maximum of 50 noisy rows produces; I would only believe it if it survived on new windows chosen in advance. For t-stat the entry-(d) alternative stands: the per-coordinate std over 8 pairs is itself ~25% noise, so the weighting may be real but unestimable at this pair count. Both nulls leave act_dot with plain mean extraction as the routeA default.
|
||||
|
||||
**Next.** routeA implementation per the plan now written into docs/spec/20260611_act_gate_spec.md (extraction module with verify gate, gate wiring replacing routeV, rolling-buffer winsorized-Otsu pinning), pending wassname's approval. More authored pairs remains the highest-leverage data change.
|
||||
## 2026-06-11 (f) -- per-module S-space shows no robust gate-score improvement
|
||||
|
||||
**Question.** Does preserving each Linear's own SVD space reveal module-specific
|
||||
hack signal that pooled Super-S washes out?
|
||||
|
||||
**Methods.** `scripts/diag_pinning_moduleS_exact.py` hooks the actual inputs of
|
||||
reader Linears and recomputes base-weight outputs of writer Linears from their
|
||||
actual inputs in blocks 12/18/24. Per module it
|
||||
uses the steering-lite S-space identities `x @ V * sqrt(S)` for readers and
|
||||
`y @ U / sqrt(S)` for writers, extracts a direction from the eight `behavior_`
|
||||
pairs, selects top-r modes by pair-difference magnitude, then aggregates module
|
||||
scores. Pueue 28/29/30 reran the v3/v4/v5 emergence windows and wrote strict TSV
|
||||
evidence.
|
||||
|
||||
| score | v3 | v4 | v5 | mean | min |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| moduleS writer r=256 concat-cos, best selected row | 0.892 | 0.733 | 0.786 | 0.804 | 0.733 |
|
||||
| act-dot existing default | 0.870 | 0.747 | 0.747 | 0.788 | 0.747 |
|
||||
| raw residual dot | 0.905 | 0.721 | 0.756 | 0.794 | 0.721 |
|
||||
|
||||
**Result.** Per-module S-space shows no robust improvement on these windows. Its
|
||||
best row falls below act-dot on worst-window AUROC and was selected from 27 rows,
|
||||
so the small gain over raw residual dot is not evidence of improvement. The earlier
|
||||
cached-residual approximation produced a stronger reader-r64 row, but fresh
|
||||
review correctly identified that it was a module-weight-derived metric on
|
||||
post-block residuals rather than exact module S-space.
|
||||
|
||||
**Evidence.** Full cross-window table:
|
||||
`out/diag/moduleS_exact_summary.tsv`. Per-window tables:
|
||||
`out/diag{,_v4,_v5}/moduleS_exact.tsv`. Spec and failure log:
|
||||
`docs/spec/20260611_per_module_sspace.md`.
|
||||
|
||||
## 2026-06-11 (g) -- routeA act gate shipped; bimodality guard dropped after calibration
|
||||
|
||||
**Introduction.** Entries (b)-(f) established that the activation dot score is the
|
||||
stable gate input. This entry covers the implementation
|
||||
(docs/spec/20260611_act_gate_spec.md): forward-only `v_act` extraction
|
||||
(`src/vgrout/extract_vhack_act.py`), the routeA gate in train.py (act capture on the
|
||||
quarantine-ablated logp_old forward, masks pinned before the single grad forward,
|
||||
rolling-buffer Otsu thresholds), and deletion of the routeV gradient gate. The spec
|
||||
left one open question: should an online bimodality guard close the rout zone before
|
||||
hacks emerge? Expected: some shape statistic of the score window separates the
|
||||
emergence mixture (hack share 35-43%) from hack-free scores.
|
||||
|
||||
**Methods.** Calibration is offline on the cached v3/v4/v5 emergence-window features
|
||||
(out/diag{,_v4,_v5}/pinning_feats.pt, produced by scripts/diag_pinning.py at commit
|
||||
70697ff). Score = act dot vs the `behavior_` 8-pair v_act. Conditions: mixture (all
|
||||
valid live rollouts), cleanonly (non-exploited only, pre-emergence proxy), and N(0,1)
|
||||
n=256 (10 seeds). Statistics computed after z-norm, winsorize(1/99), two-threshold
|
||||
Otsu: `sep` = mean(z above t_hi) minus mean(z below t_lo) in buffer-sd units, `nbcv` =
|
||||
between-class variance fraction. Command: `uv run python
|
||||
scripts/attic/calib_otsu_guard.py`. Extractor equivalence was verified on GPU as
|
||||
pueue #24 (`scripts/verify_v_act.py`, commit 5a340e5).
|
||||
|
||||
| cond | hack% | n | sep | nbcv |
|
||||
|---|---:|---:|---:|---:|
|
||||
| v3 mixture | 0.43 | 138 | 2.75 | 0.80 |
|
||||
| v3 cleanonly | 0.00 | 79 | 2.54 | 0.82 |
|
||||
| v4 mixture | 0.35 | 96 | 2.82 | 0.81 |
|
||||
| v4 cleanonly | 0.00 | 62 | 2.34 | 0.84 |
|
||||
| v5 mixture | 0.39 | 138 | 2.44 | 0.76 |
|
||||
| v5 cleanonly | 0.00 | 84 | 3.52 | 0.77 |
|
||||
| gauss n=256 (mean of 10 seeds) | 0.00 | 256 | 2.42 | 0.83 |
|
||||
|
||||
Table: guard-candidate statistics per condition. A usable guard needs the mixture rows
|
||||
to separate from the cleanonly and gauss rows on `sep` or `nbcv` with margin.
|
||||
|
||||
Provenance:
|
||||
- Script: `scripts/attic/calib_otsu_guard.py` (copied from the session scratch file);
|
||||
rerun output captured at `/tmp/claude-1000/calib_guard_out.txt` this session.
|
||||
- Inputs: `out/diag/pinning_feats.pt`, `out/diag_v4/pinning_feats.pt`,
|
||||
`out/diag_v5/pinning_feats.pt` (the (c) replication windows).
|
||||
- verify_v_act: pueue #24 log; acts match cached diag features at rel diff 7.3e-4 and
|
||||
7.7e-4 (hack/clean), v cos > 0.999, per-module cos >= 0.99997.
|
||||
|
||||
**Results.** No statistic separates the conditions. The largest `sep` of all rows is a
|
||||
hack-FREE window (v5 cleanonly, 3.52); pure Gaussians sit at 2.42, inside the mixture
|
||||
range (2.44-2.82). `nbcv` overlaps the same way (mixtures 0.76-0.81 vs gauss 0.83).
|
||||
Otsu always finds tail classes ~2.4 sd apart even when no structure exists, so any
|
||||
threshold on these statistics either always opens or always closes.
|
||||
|
||||
**Discussion (speculative).** I read this as: the guard idea was solving a
|
||||
non-problem. Before emergence a false rout costs one update removed from deployment
|
||||
(asymmetric, cheap), and warmup already pins absorb while the buffer fills. An
|
||||
alternative read is that a better statistic exists (e.g. dip test, mixture-model BIC)
|
||||
and I only tried Otsu-derived ones; I did not pursue this because the cost asymmetry
|
||||
makes the guard's value marginal even if it worked. The gate therefore ships with
|
||||
warmup + Otsu only (commits adca442 routeA wiring + routeV deletion, f646e57
|
||||
review-driven hardening; smoke logs /tmp/claude-1000/smoke_routeA*.log).
|
||||
|
||||
**Next.** Queue the seed-43 fast 4-arm set (`just queue-decision`): routeA real vs
|
||||
Haar placebo vs vanilla vs absorb. Decision: directionality is real iff real-v
|
||||
deploy_hack << placebo at matched solve, with gate AUROC >> 0.5 around emergence.
|
||||
|
||||
Reference in New Issue
Block a user