Doc cleanup: mark susp gate as REMOVED in design doc

The runtime suspicion gate was removed in 8d170a0 but the design doc
still advertised it as a live pillar. Replace gate section with a brief
"why we tried it, why we removed it" note.

Also fix N=12 (was N=14): pairs.py has 12, not 14.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-27 09:08:34 +00:00
parent 8d170a0753
commit 8d2c9afb01
+28 -73
View File
@@ -1,27 +1,24 @@
# v_hack extraction: gradient-space SVD with magnitudes + runtime suspicion gate
# v_hack extraction: gradient-space SVD with magnitudes
Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
This explains *what we extract*, *why*, and *what runtime gating prevents*.
This explains *what we extract* and *why*.
## TL;DR
`v_hack[name]` is a per-module top-k orthonormal basis in **AntiPaSTO
δS-gradient space**, computed by PCA on **paired (hack clean) NLL gradients**
over a small set of contrastive completion pairs (currently N=14, 12 train + 2
over a small set of contrastive completion pairs (currently N=12, 10 train + 2
heldout). At training time we project the live policy-gradient component along
this basis out of `δS.grad`, optionally gated so we only ablate when there's
positive evidence the live gradient is hack-aligned.
this basis out of `δS.grad`.
The 2026-05-27 refactor added three things on top of the older mean-diff design:
The 2026-05-27 refactor added two things on top of the older mean-diff design:
1. **Top-k extraction** (k=12 max) with **load-time slicing** (`v_hack_k`,
default 5) so k=1 vs k=5 vs k=12 is a config flip, not a re-extract.
2. **Singular-value recording** (`_sv/{name}` keys) so v_i carries its
extract-time confidence S_i, not just direction.
3. **Runtime suspicion gate** (`susp_drop_frac`): per step, drop the top-frac
(module, axis) pairs by `r_i = |g·v_i| / S_i`. Live alignment ≫ extract
confidence means v_i is probably aligned with a structured coding direction,
not hack — skip the projection.
extract-time confidence S_i, not just direction. (Currently unused at
runtime — earlier draft used it for a suspicion gate, removed 2026-05-27;
see below.)
## Why gradient space, not activation space?
@@ -125,67 +122,30 @@ k=1 vs k=5 vs k=12 is a **config flip**, not a re-extract. The
~71% of per-module D-variance — load-time slice at 5 is a reasonable
default that we can ablate cheaply.
## Runtime suspicion gate
## Runtime suspicion gate (REMOVED 2026-05-27)
**Hypothesis:** module M has small `||D(M)||_F` (weak hack signal at
extract time). Its top SVD direction `v_1(M)` is dominated by noise
shape, not hack shape. At training time, `g(M)` is the policy gradient
flowing through M — a structured (non-isotropic) signal living in a
low-d subspace of "directions that matter for next-token prediction." If
`v_1(M)` coincidentally lies in that subspace, projecting `g(M)` along
`v_1(M)` removes a chunk of useful coding-relevant gradient with no
compensating reduction in hack signal.
**Why it was tried:** if a module has small `||D||_F` at extract time
(weak hack signal), its top SVD direction `v_1` could coincidentally
align with a structured coding-gradient direction at training time,
ablating capability rather than hack.
**Why I'd initially dismissed this concern:** in a high-d random model
(g and v isotropic), `|g · v| ≈ ||g||/√r ≈ 2% of ||g||`. So one bad
direction costs ~2% of the live gradient — tolerable. **What I missed:**
neither `g` nor `v` is isotropic. Both live in low-d structured
subspaces. If those subspaces happen to overlap, the projection magnitude
is much larger.
**Gate design (since removed):** `r_i = |g·v_i|/S_i` as a per-step
quantile drop of the top-25% (module, axis) pairs.
**Gate design:**
**Why removed:** the quantile design is a fixed-budget knob, not a
detector — `frac_axes_susp` was deterministically 0.25 every step (true
by definition of quantile), so the column carried no information.
Codex review independently flagged: `|g·v_i|` scales with live-grad norm
and `S_i` scales with extract-time-grad norm, so the cross-module ratio
is not dimensionless and high-gradient modules dominate regardless of
genuine suspiciousness. In a high-d model the worst-case damage per
spurious axis is ~`1/√r ≈ 2%` of `||g||` anyway, so the cure was
costlier than the disease.
```
r_i(M) = |g(M) · v_i(M)| / S_i(M)
```
- High `r_i`: live grad cares about v_i much more than the extract-time
hack signal did → suspicious, this v_i is probably picking up
structured coding flow.
- Low `r_i`: live alignment is in proportion to extract-time confidence
→ trust the projection.
**Per-step quantile gate:** collect `r_i` across all `(module, axis)`
pairs in one step, find the `(1 drop_top_frac)`-quantile, suppress all
axes above that threshold for this step. Default `drop_top_frac = 0.25`.
```python
# pseudo: in project_delta_S_grad
all_r = []
for name, info in wrappers.items():
c = V[name] @ info.delta_S.grad # [k_use]
S = v_sv[name] # [k_use]
all_r.append(c.abs() / S.clamp_min(eps))
threshold = quantile(cat(all_r), 1 drop_top_frac)
for ...:
keep = (r <= threshold)
g_proj = g (c * keep * gate_mode_mask) @ V
```
## Known limitations (caveats from codex external review, 2026-05-27)
1. **r_i is not dimensionless across modules.** `|g·v_i|` scales with
live-grad norm; `S_i` scales with extract-time-grad norm. A
high-gradient module dominates the global quantile regardless of
whether its axis is actually suspicious. Fix candidate: within-module
ratio `(|c_i|/||g||) / (S_i/||D||_F)`. **Not yet applied.**
2. **Quantile gate is a fixed budget, not a detector.** It always drops
`drop_top_frac` of axes per step, even when nothing is genuinely
suspicious — and never drops more than that when many axes are. Fix
candidate: absolute threshold post-normalization, or run as
measure-only diagnostic until calibrated. **Not yet applied.**
3. **Old v1 files (no `_sv/` keys) silently disable the gate.** Should
fail-fast when `susp_drop_frac > 0` and v_sv is empty.
`_sv/{name}` keys are still saved — they're cheap and may feed a
future, principled gate (extract-time `tau_axis` zeros rows where
`S_i/S_0 < tau_axis`, which is the same idea but applied once at
extract rather than at every step).
## Validation: cheap discriminative tests
@@ -239,11 +199,6 @@ is fitting noise. **Cost: 5 × ~5 min = 25 min total.**
## Open design questions
- **Is the suspicion gate redundant?** Codex argued the quantile design
is a fixed-budget knob, not a detector. The right answer is probably:
ship it as measure-only first (log `frac_axes_susp` and per-step
`r_i` distribution histograms), confirm whether suspicious modules
actually exist empirically, *then* turn on projection-side gating.
- **Should we whiten by S?** I.e. parameterize the AntiPaSTO knob as
`δS_i / σ_i(W)` so all directions have equal forward-pass impact.
Currently we don't. This is a separate, larger question.