Doc cleanup: mark susp gate as REMOVED in design doc

The runtime suspicion gate was removed in 8d170a0 but the design doc still advertised it as a live pillar. Replace gate section with a brief "why we tried it, why we removed it" note. Also fix N=12 (was N=14): pairs.py has 12, not 14. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 18:43:00 +08:00 · 2026-05-27 09:08:34 +00:00
parent 8d170a0753
commit 8d2c9afb01
1 changed files with 28 additions and 73 deletions
@@ -1,27 +1,24 @@
-# v_hack extraction: gradient-space SVD with magnitudes + runtime suspicion gate
+# v_hack extraction: gradient-space SVD with magnitudes

 Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
-This explains *what we extract*, *why*, and *what runtime gating prevents*.
+This explains *what we extract* and *why*.

 ## TL;DR

 `v_hack[name]` is a per-module top-k orthonormal basis in **AntiPaSTO
 δS-gradient space**, computed by PCA on **paired (hack − clean) NLL gradients**
-over a small set of contrastive completion pairs (currently N=14, 12 train + 2
+over a small set of contrastive completion pairs (currently N=12, 10 train + 2
 heldout). At training time we project the live policy-gradient component along
-this basis out of `δS.grad`, optionally gated so we only ablate when there's
-positive evidence the live gradient is hack-aligned.
+this basis out of `δS.grad`.

-The 2026-05-27 refactor added three things on top of the older mean-diff design:
+The 2026-05-27 refactor added two things on top of the older mean-diff design:

 1. **Top-k extraction** (k=12 max) with **load-time slicing** (`v_hack_k`,
   default 5) so k=1 vs k=5 vs k=12 is a config flip, not a re-extract.
 2. **Singular-value recording** (`_sv/{name}` keys) so v_i carries its
-   extract-time confidence S_i, not just direction.
-3. **Runtime suspicion gate** (`susp_drop_frac`): per step, drop the top-frac
-   (module, axis) pairs by `r_i = |g·v_i| / S_i`. Live alignment ≫ extract
-   confidence means v_i is probably aligned with a structured coding direction,
-   not hack — skip the projection.
+   extract-time confidence S_i, not just direction. (Currently unused at
+   runtime — earlier draft used it for a suspicion gate, removed 2026-05-27;
+   see below.)

 ## Why gradient space, not activation space?

@@ -125,67 +122,30 @@ k=1 vs k=5 vs k=12 is a **config flip**, not a re-extract. The
 ~71% of per-module D-variance — load-time slice at 5 is a reasonable
 default that we can ablate cheaply.

-## Runtime suspicion gate
+## Runtime suspicion gate (REMOVED 2026-05-27)

-**Hypothesis:** module M has small `||D(M)||_F` (weak hack signal at
-extract time). Its top SVD direction `v_1(M)` is dominated by noise
-shape, not hack shape. At training time, `g(M)` is the policy gradient
-flowing through M — a structured (non-isotropic) signal living in a
-low-d subspace of "directions that matter for next-token prediction." If
-`v_1(M)` coincidentally lies in that subspace, projecting `g(M)` along
-`v_1(M)` removes a chunk of useful coding-relevant gradient with no
-compensating reduction in hack signal.
+**Why it was tried:** if a module has small `||D||_F` at extract time
+(weak hack signal), its top SVD direction `v_1` could coincidentally
+align with a structured coding-gradient direction at training time,
+ablating capability rather than hack.

-**Why I'd initially dismissed this concern:** in a high-d random model
-(g and v isotropic), `|g · v| ≈ ||g||/√r ≈ 2% of ||g||`. So one bad
-direction costs ~2% of the live gradient — tolerable. **What I missed:**
-neither `g` nor `v` is isotropic. Both live in low-d structured
-subspaces. If those subspaces happen to overlap, the projection magnitude
-is much larger.
+**Gate design (since removed):** `r_i = |g·v_i|/S_i` as a per-step
+quantile drop of the top-25% (module, axis) pairs.

-**Gate design:**
+**Why removed:** the quantile design is a fixed-budget knob, not a
+detector — `frac_axes_susp` was deterministically 0.25 every step (true
+by definition of quantile), so the column carried no information.
+Codex review independently flagged: `|g·v_i|` scales with live-grad norm
+and `S_i` scales with extract-time-grad norm, so the cross-module ratio
+is not dimensionless and high-gradient modules dominate regardless of
+genuine suspiciousness. In a high-d model the worst-case damage per
+spurious axis is ~`1/√r ≈ 2%` of `||g||` anyway, so the cure was
+costlier than the disease.

-```
-r_i(M) = |g(M) · v_i(M)| / S_i(M)
-```
-
- High `r_i`: live grad cares about v_i much more than the extract-time
-  hack signal did → suspicious, this v_i is probably picking up
-  structured coding flow.
- Low `r_i`: live alignment is in proportion to extract-time confidence
-  → trust the projection.
-
-**Per-step quantile gate:** collect `r_i` across all `(module, axis)`
-pairs in one step, find the `(1 − drop_top_frac)`-quantile, suppress all
-axes above that threshold for this step. Default `drop_top_frac = 0.25`.
-
-```python
-# pseudo: in project_delta_S_grad
-all_r = []
-for name, info in wrappers.items():
-    c = V[name] @ info.delta_S.grad        # [k_use]
-    S = v_sv[name]                          # [k_use]
-    all_r.append(c.abs() / S.clamp_min(eps))
-threshold = quantile(cat(all_r), 1 − drop_top_frac)
-for ...:
-    keep = (r <= threshold)
-    g_proj = g − (c * keep * gate_mode_mask) @ V
-```
-
-## Known limitations (caveats from codex external review, 2026-05-27)
-
-1. **r_i is not dimensionless across modules.** `|g·v_i|` scales with
-   live-grad norm; `S_i` scales with extract-time-grad norm. A
-   high-gradient module dominates the global quantile regardless of
-   whether its axis is actually suspicious. Fix candidate: within-module
-   ratio `(|c_i|/||g||) / (S_i/||D||_F)`. **Not yet applied.**
-2. **Quantile gate is a fixed budget, not a detector.** It always drops
-   `drop_top_frac` of axes per step, even when nothing is genuinely
-   suspicious — and never drops more than that when many axes are. Fix
-   candidate: absolute threshold post-normalization, or run as
-   measure-only diagnostic until calibrated. **Not yet applied.**
-3. **Old v1 files (no `_sv/` keys) silently disable the gate.** Should
-   fail-fast when `susp_drop_frac > 0` and v_sv is empty.
+`_sv/{name}` keys are still saved — they're cheap and may feed a
+future, principled gate (extract-time `tau_axis` zeros rows where
+`S_i/S_0 < tau_axis`, which is the same idea but applied once at
+extract rather than at every step).

 ## Validation: cheap discriminative tests

@@ -239,11 +199,6 @@ is fitting noise. **Cost: 5 × ~5 min = 25 min total.**

 ## Open design questions

- **Is the suspicion gate redundant?** Codex argued the quantile design
-  is a fixed-budget knob, not a detector. The right answer is probably:
-  ship it as measure-only first (log `frac_axes_susp` and per-step
-  `r_i` distribution histograms), confirm whether suspicious modules
-  actually exist empirically, *then* turn on projection-side gating.
 - **Should we whiten by S?** I.e. parameterize the AntiPaSTO knob as
  `δS_i / σ_i(W)` so all directions have equal forward-pass impact.
  Currently we don't. This is a separate, larger question.