journal(#41): entry (g) routeA shipped + guard-drop calibration; track moduleS diag scripts

Entry (f) already cited scripts/diag_pinning_moduleS_exact.py; both moduleS
scripts were untracked, so committing them for provenance alongside the
calibration script the new entry cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-11 12:50:07 +00:00
parent f646e57028
commit 8000aa48f4
4 changed files with 560 additions and 0 deletions
+93
View File
@@ -4257,3 +4257,96 @@ Provenance:
**Discussion (speculative).** I read both as nulls at this sample size (per-window SE ~0.07). For super-S, whitening consistently sits at or below the raw baseline, which would fit the pooled spectrum amplifying low-energy directions that carry no hack signal here, but I cannot distinguish that from noise. The one apparent gain (rotation + reader basis + top-64) is exactly what taking the maximum of 50 noisy rows produces; I would only believe it if it survived on new windows chosen in advance. For t-stat the entry-(d) alternative stands: the per-coordinate std over 8 pairs is itself ~25% noise, so the weighting may be real but unestimable at this pair count. Both nulls leave act_dot with plain mean extraction as the routeA default.
**Next.** routeA implementation per the plan now written into docs/spec/20260611_act_gate_spec.md (extraction module with verify gate, gate wiring replacing routeV, rolling-buffer winsorized-Otsu pinning), pending wassname's approval. More authored pairs remains the highest-leverage data change.
## 2026-06-11 (f) -- per-module S-space shows no robust gate-score improvement
**Question.** Does preserving each Linear's own SVD space reveal module-specific
hack signal that pooled Super-S washes out?
**Methods.** `scripts/diag_pinning_moduleS_exact.py` hooks the actual inputs of
reader Linears and recomputes base-weight outputs of writer Linears from their
actual inputs in blocks 12/18/24. Per module it
uses the steering-lite S-space identities `x @ V * sqrt(S)` for readers and
`y @ U / sqrt(S)` for writers, extracts a direction from the eight `behavior_`
pairs, selects top-r modes by pair-difference magnitude, then aggregates module
scores. Pueue 28/29/30 reran the v3/v4/v5 emergence windows and wrote strict TSV
evidence.
| score | v3 | v4 | v5 | mean | min |
|---|---:|---:|---:|---:|---:|
| moduleS writer r=256 concat-cos, best selected row | 0.892 | 0.733 | 0.786 | 0.804 | 0.733 |
| act-dot existing default | 0.870 | 0.747 | 0.747 | 0.788 | 0.747 |
| raw residual dot | 0.905 | 0.721 | 0.756 | 0.794 | 0.721 |
**Result.** Per-module S-space shows no robust improvement on these windows. Its
best row falls below act-dot on worst-window AUROC and was selected from 27 rows,
so the small gain over raw residual dot is not evidence of improvement. The earlier
cached-residual approximation produced a stronger reader-r64 row, but fresh
review correctly identified that it was a module-weight-derived metric on
post-block residuals rather than exact module S-space.
**Evidence.** Full cross-window table:
`out/diag/moduleS_exact_summary.tsv`. Per-window tables:
`out/diag{,_v4,_v5}/moduleS_exact.tsv`. Spec and failure log:
`docs/spec/20260611_per_module_sspace.md`.
## 2026-06-11 (g) -- routeA act gate shipped; bimodality guard dropped after calibration
**Introduction.** Entries (b)-(f) established that the activation dot score is the
stable gate input. This entry covers the implementation
(docs/spec/20260611_act_gate_spec.md): forward-only `v_act` extraction
(`src/vgrout/extract_vhack_act.py`), the routeA gate in train.py (act capture on the
quarantine-ablated logp_old forward, masks pinned before the single grad forward,
rolling-buffer Otsu thresholds), and deletion of the routeV gradient gate. The spec
left one open question: should an online bimodality guard close the rout zone before
hacks emerge? Expected: some shape statistic of the score window separates the
emergence mixture (hack share 35-43%) from hack-free scores.
**Methods.** Calibration is offline on the cached v3/v4/v5 emergence-window features
(out/diag{,_v4,_v5}/pinning_feats.pt, produced by scripts/diag_pinning.py at commit
70697ff). Score = act dot vs the `behavior_` 8-pair v_act. Conditions: mixture (all
valid live rollouts), cleanonly (non-exploited only, pre-emergence proxy), and N(0,1)
n=256 (10 seeds). Statistics computed after z-norm, winsorize(1/99), two-threshold
Otsu: `sep` = mean(z above t_hi) minus mean(z below t_lo) in buffer-sd units, `nbcv` =
between-class variance fraction. Command: `uv run python
scripts/attic/calib_otsu_guard.py`. Extractor equivalence was verified on GPU as
pueue #24 (`scripts/verify_v_act.py`, commit 5a340e5).
| cond | hack% | n | sep | nbcv |
|---|---:|---:|---:|---:|
| v3 mixture | 0.43 | 138 | 2.75 | 0.80 |
| v3 cleanonly | 0.00 | 79 | 2.54 | 0.82 |
| v4 mixture | 0.35 | 96 | 2.82 | 0.81 |
| v4 cleanonly | 0.00 | 62 | 2.34 | 0.84 |
| v5 mixture | 0.39 | 138 | 2.44 | 0.76 |
| v5 cleanonly | 0.00 | 84 | 3.52 | 0.77 |
| gauss n=256 (mean of 10 seeds) | 0.00 | 256 | 2.42 | 0.83 |
Table: guard-candidate statistics per condition. A usable guard needs the mixture rows
to separate from the cleanonly and gauss rows on `sep` or `nbcv` with margin.
Provenance:
- Script: `scripts/attic/calib_otsu_guard.py` (copied from the session scratch file);
rerun output captured at `/tmp/claude-1000/calib_guard_out.txt` this session.
- Inputs: `out/diag/pinning_feats.pt`, `out/diag_v4/pinning_feats.pt`,
`out/diag_v5/pinning_feats.pt` (the (c) replication windows).
- verify_v_act: pueue #24 log; acts match cached diag features at rel diff 7.3e-4 and
7.7e-4 (hack/clean), v cos > 0.999, per-module cos >= 0.99997.
**Results.** No statistic separates the conditions. The largest `sep` of all rows is a
hack-FREE window (v5 cleanonly, 3.52); pure Gaussians sit at 2.42, inside the mixture
range (2.44-2.82). `nbcv` overlaps the same way (mixtures 0.76-0.81 vs gauss 0.83).
Otsu always finds tail classes ~2.4 sd apart even when no structure exists, so any
threshold on these statistics either always opens or always closes.
**Discussion (speculative).** I read this as: the guard idea was solving a
non-problem. Before emergence a false rout costs one update removed from deployment
(asymmetric, cheap), and warmup already pins absorb while the buffer fills. An
alternative read is that a better statistic exists (e.g. dip test, mixture-model BIC)
and I only tried Otsu-derived ones; I did not pursue this because the cost asymmetry
makes the guard's value marginal even if it worked. The gate therefore ships with
warmup + Otsu only (commits adca442 routeA wiring + routeV deletion, f646e57
review-driven hardening; smoke logs /tmp/claude-1000/smoke_routeA*.log).
**Next.** Queue the seed-43 fast 4-arm set (`just queue-decision`): routeA real vs
Haar placebo vs vanilla vs absorb. Decision: directionality is real iff real-v
deploy_hack << placebo at matched solve, with gate AUROC >> 0.5 around emergence.