journal(#41): entry (g) routeA shipped + guard-drop calibration; track moduleS diag scripts

Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS scripts were untracked, so committing them for provenance alongside the calibration script the new entry cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-11 12:50:07 +00:00
parent f646e57028
commit 8000aa48f4
4 changed files with 560 additions and 0 deletions
@@ -4257,3 +4257,96 @@ Provenance:
 **Discussion (speculative).** I read both as nulls at this sample size (per-window SE ~0.07). For super-S, whitening consistently sits at or below the raw baseline, which would fit the pooled spectrum amplifying low-energy directions that carry no hack signal here, but I cannot distinguish that from noise. The one apparent gain (rotation + reader basis + top-64) is exactly what taking the maximum of 50 noisy rows produces; I would only believe it if it survived on new windows chosen in advance. For t-stat the entry-(d) alternative stands: the per-coordinate std over 8 pairs is itself ~25% noise, so the weighting may be real but unestimable at this pair count. Both nulls leave act_dot with plain mean extraction as the routeA default.

 **Next.** routeA implementation per the plan now written into docs/spec/20260611_act_gate_spec.md (extraction module with verify gate, gate wiring replacing routeV, rolling-buffer winsorized-Otsu pinning), pending wassname's approval. More authored pairs remains the highest-leverage data change.
+## 2026-06-11 (f) -- per-module S-space shows no robust gate-score improvement
+
+**Question.** Does preserving each Linear's own SVD space reveal module-specific
+hack signal that pooled Super-S washes out?
+
+**Methods.** `scripts/diag_pinning_moduleS_exact.py` hooks the actual inputs of
+reader Linears and recomputes base-weight outputs of writer Linears from their
+actual inputs in blocks 12/18/24. Per module it
+uses the steering-lite S-space identities `x @ V * sqrt(S)` for readers and
+`y @ U / sqrt(S)` for writers, extracts a direction from the eight `behavior_`
+pairs, selects top-r modes by pair-difference magnitude, then aggregates module
+scores. Pueue 28/29/30 reran the v3/v4/v5 emergence windows and wrote strict TSV
+evidence.
+
+| score | v3 | v4 | v5 | mean | min |
+|---|---:|---:|---:|---:|---:|
+| moduleS writer r=256 concat-cos, best selected row | 0.892 | 0.733 | 0.786 | 0.804 | 0.733 |
+| act-dot existing default | 0.870 | 0.747 | 0.747 | 0.788 | 0.747 |
+| raw residual dot | 0.905 | 0.721 | 0.756 | 0.794 | 0.721 |
+
+**Result.** Per-module S-space shows no robust improvement on these windows. Its
+best row falls below act-dot on worst-window AUROC and was selected from 27 rows,
+so the small gain over raw residual dot is not evidence of improvement. The earlier
+cached-residual approximation produced a stronger reader-r64 row, but fresh
+review correctly identified that it was a module-weight-derived metric on
+post-block residuals rather than exact module S-space.
+
+**Evidence.** Full cross-window table:
+`out/diag/moduleS_exact_summary.tsv`. Per-window tables:
+`out/diag{,_v4,_v5}/moduleS_exact.tsv`. Spec and failure log:
+`docs/spec/20260611_per_module_sspace.md`.
+
+## 2026-06-11 (g) -- routeA act gate shipped; bimodality guard dropped after calibration
+
+**Introduction.** Entries (b)-(f) established that the activation dot score is the
+stable gate input. This entry covers the implementation
+(docs/spec/20260611_act_gate_spec.md): forward-only `v_act` extraction
+(`src/vgrout/extract_vhack_act.py`), the routeA gate in train.py (act capture on the
+quarantine-ablated logp_old forward, masks pinned before the single grad forward,
+rolling-buffer Otsu thresholds), and deletion of the routeV gradient gate. The spec
+left one open question: should an online bimodality guard close the rout zone before
+hacks emerge? Expected: some shape statistic of the score window separates the
+emergence mixture (hack share 35-43%) from hack-free scores.
+
+**Methods.** Calibration is offline on the cached v3/v4/v5 emergence-window features
+(out/diag{,_v4,_v5}/pinning_feats.pt, produced by scripts/diag_pinning.py at commit
+70697ff). Score = act dot vs the `behavior_` 8-pair v_act. Conditions: mixture (all
+valid live rollouts), cleanonly (non-exploited only, pre-emergence proxy), and N(0,1)
+n=256 (10 seeds). Statistics computed after z-norm, winsorize(1/99), two-threshold
+Otsu: `sep` = mean(z above t_hi) minus mean(z below t_lo) in buffer-sd units, `nbcv` =
+between-class variance fraction. Command: `uv run python
+scripts/attic/calib_otsu_guard.py`. Extractor equivalence was verified on GPU as
+pueue #24 (`scripts/verify_v_act.py`, commit 5a340e5).
+
+| cond | hack% | n | sep | nbcv |
+|---|---:|---:|---:|---:|
+| v3 mixture | 0.43 | 138 | 2.75 | 0.80 |
+| v3 cleanonly | 0.00 | 79 | 2.54 | 0.82 |
+| v4 mixture | 0.35 | 96 | 2.82 | 0.81 |
+| v4 cleanonly | 0.00 | 62 | 2.34 | 0.84 |
+| v5 mixture | 0.39 | 138 | 2.44 | 0.76 |
+| v5 cleanonly | 0.00 | 84 | 3.52 | 0.77 |
+| gauss n=256 (mean of 10 seeds) | 0.00 | 256 | 2.42 | 0.83 |
+
+Table: guard-candidate statistics per condition. A usable guard needs the mixture rows
+to separate from the cleanonly and gauss rows on `sep` or `nbcv` with margin.
+
+Provenance:
+- Script: `scripts/attic/calib_otsu_guard.py` (copied from the session scratch file);
+  rerun output captured at `/tmp/claude-1000/calib_guard_out.txt` this session.
+- Inputs: `out/diag/pinning_feats.pt`, `out/diag_v4/pinning_feats.pt`,
+  `out/diag_v5/pinning_feats.pt` (the (c) replication windows).
+- verify_v_act: pueue #24 log; acts match cached diag features at rel diff 7.3e-4 and
+  7.7e-4 (hack/clean), v cos > 0.999, per-module cos >= 0.99997.
+
+**Results.** No statistic separates the conditions. The largest `sep` of all rows is a
+hack-FREE window (v5 cleanonly, 3.52); pure Gaussians sit at 2.42, inside the mixture
+range (2.44-2.82). `nbcv` overlaps the same way (mixtures 0.76-0.81 vs gauss 0.83).
+Otsu always finds tail classes ~2.4 sd apart even when no structure exists, so any
+threshold on these statistics either always opens or always closes.
+
+**Discussion (speculative).** I read this as: the guard idea was solving a
+non-problem. Before emergence a false rout costs one update removed from deployment
+(asymmetric, cheap), and warmup already pins absorb while the buffer fills. An
+alternative read is that a better statistic exists (e.g. dip test, mixture-model BIC)
+and I only tried Otsu-derived ones; I did not pursue this because the cost asymmetry
+makes the guard's value marginal even if it worked. The gate therefore ships with
+warmup + Otsu only (commits adca442 routeA wiring + routeV deletion, f646e57
+review-driven hardening; smoke logs /tmp/claude-1000/smoke_routeA*.log).
+
+**Next.** Queue the seed-43 fast 4-arm set (`just queue-decision`): routeA real vs
+Haar placebo vs vanilla vs absorb. Decision: directionality is real iff real-v
+deploy_hack << placebo at matched solve, with gate AUROC >> 0.5 around emergence.