mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:43:00 +08:00
Doc cleanup: mark susp gate as REMOVED in design doc
The runtime suspicion gate was removed in 8d170a0 but the design doc
still advertised it as a live pillar. Replace gate section with a brief
"why we tried it, why we removed it" note.
Also fix N=12 (was N=14): pairs.py has 12, not 14.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,27 +1,24 @@
|
||||
# v_hack extraction: gradient-space SVD with magnitudes + runtime suspicion gate
|
||||
# v_hack extraction: gradient-space SVD with magnitudes
|
||||
|
||||
Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
|
||||
This explains *what we extract*, *why*, and *what runtime gating prevents*.
|
||||
This explains *what we extract* and *why*.
|
||||
|
||||
## TL;DR
|
||||
|
||||
`v_hack[name]` is a per-module top-k orthonormal basis in **AntiPaSTO
|
||||
δS-gradient space**, computed by PCA on **paired (hack − clean) NLL gradients**
|
||||
over a small set of contrastive completion pairs (currently N=14, 12 train + 2
|
||||
over a small set of contrastive completion pairs (currently N=12, 10 train + 2
|
||||
heldout). At training time we project the live policy-gradient component along
|
||||
this basis out of `δS.grad`, optionally gated so we only ablate when there's
|
||||
positive evidence the live gradient is hack-aligned.
|
||||
this basis out of `δS.grad`.
|
||||
|
||||
The 2026-05-27 refactor added three things on top of the older mean-diff design:
|
||||
The 2026-05-27 refactor added two things on top of the older mean-diff design:
|
||||
|
||||
1. **Top-k extraction** (k=12 max) with **load-time slicing** (`v_hack_k`,
|
||||
default 5) so k=1 vs k=5 vs k=12 is a config flip, not a re-extract.
|
||||
2. **Singular-value recording** (`_sv/{name}` keys) so v_i carries its
|
||||
extract-time confidence S_i, not just direction.
|
||||
3. **Runtime suspicion gate** (`susp_drop_frac`): per step, drop the top-frac
|
||||
(module, axis) pairs by `r_i = |g·v_i| / S_i`. Live alignment ≫ extract
|
||||
confidence means v_i is probably aligned with a structured coding direction,
|
||||
not hack — skip the projection.
|
||||
extract-time confidence S_i, not just direction. (Currently unused at
|
||||
runtime — earlier draft used it for a suspicion gate, removed 2026-05-27;
|
||||
see below.)
|
||||
|
||||
## Why gradient space, not activation space?
|
||||
|
||||
@@ -125,67 +122,30 @@ k=1 vs k=5 vs k=12 is a **config flip**, not a re-extract. The
|
||||
~71% of per-module D-variance — load-time slice at 5 is a reasonable
|
||||
default that we can ablate cheaply.
|
||||
|
||||
## Runtime suspicion gate
|
||||
## Runtime suspicion gate (REMOVED 2026-05-27)
|
||||
|
||||
**Hypothesis:** module M has small `||D(M)||_F` (weak hack signal at
|
||||
extract time). Its top SVD direction `v_1(M)` is dominated by noise
|
||||
shape, not hack shape. At training time, `g(M)` is the policy gradient
|
||||
flowing through M — a structured (non-isotropic) signal living in a
|
||||
low-d subspace of "directions that matter for next-token prediction." If
|
||||
`v_1(M)` coincidentally lies in that subspace, projecting `g(M)` along
|
||||
`v_1(M)` removes a chunk of useful coding-relevant gradient with no
|
||||
compensating reduction in hack signal.
|
||||
**Why it was tried:** if a module has small `||D||_F` at extract time
|
||||
(weak hack signal), its top SVD direction `v_1` could coincidentally
|
||||
align with a structured coding-gradient direction at training time,
|
||||
ablating capability rather than hack.
|
||||
|
||||
**Why I'd initially dismissed this concern:** in a high-d random model
|
||||
(g and v isotropic), `|g · v| ≈ ||g||/√r ≈ 2% of ||g||`. So one bad
|
||||
direction costs ~2% of the live gradient — tolerable. **What I missed:**
|
||||
neither `g` nor `v` is isotropic. Both live in low-d structured
|
||||
subspaces. If those subspaces happen to overlap, the projection magnitude
|
||||
is much larger.
|
||||
**Gate design (since removed):** `r_i = |g·v_i|/S_i` as a per-step
|
||||
quantile drop of the top-25% (module, axis) pairs.
|
||||
|
||||
**Gate design:**
|
||||
**Why removed:** the quantile design is a fixed-budget knob, not a
|
||||
detector — `frac_axes_susp` was deterministically 0.25 every step (true
|
||||
by definition of quantile), so the column carried no information.
|
||||
Codex review independently flagged: `|g·v_i|` scales with live-grad norm
|
||||
and `S_i` scales with extract-time-grad norm, so the cross-module ratio
|
||||
is not dimensionless and high-gradient modules dominate regardless of
|
||||
genuine suspiciousness. In a high-d model the worst-case damage per
|
||||
spurious axis is ~`1/√r ≈ 2%` of `||g||` anyway, so the cure was
|
||||
costlier than the disease.
|
||||
|
||||
```
|
||||
r_i(M) = |g(M) · v_i(M)| / S_i(M)
|
||||
```
|
||||
|
||||
- High `r_i`: live grad cares about v_i much more than the extract-time
|
||||
hack signal did → suspicious, this v_i is probably picking up
|
||||
structured coding flow.
|
||||
- Low `r_i`: live alignment is in proportion to extract-time confidence
|
||||
→ trust the projection.
|
||||
|
||||
**Per-step quantile gate:** collect `r_i` across all `(module, axis)`
|
||||
pairs in one step, find the `(1 − drop_top_frac)`-quantile, suppress all
|
||||
axes above that threshold for this step. Default `drop_top_frac = 0.25`.
|
||||
|
||||
```python
|
||||
# pseudo: in project_delta_S_grad
|
||||
all_r = []
|
||||
for name, info in wrappers.items():
|
||||
c = V[name] @ info.delta_S.grad # [k_use]
|
||||
S = v_sv[name] # [k_use]
|
||||
all_r.append(c.abs() / S.clamp_min(eps))
|
||||
threshold = quantile(cat(all_r), 1 − drop_top_frac)
|
||||
for ...:
|
||||
keep = (r <= threshold)
|
||||
g_proj = g − (c * keep * gate_mode_mask) @ V
|
||||
```
|
||||
|
||||
## Known limitations (caveats from codex external review, 2026-05-27)
|
||||
|
||||
1. **r_i is not dimensionless across modules.** `|g·v_i|` scales with
|
||||
live-grad norm; `S_i` scales with extract-time-grad norm. A
|
||||
high-gradient module dominates the global quantile regardless of
|
||||
whether its axis is actually suspicious. Fix candidate: within-module
|
||||
ratio `(|c_i|/||g||) / (S_i/||D||_F)`. **Not yet applied.**
|
||||
2. **Quantile gate is a fixed budget, not a detector.** It always drops
|
||||
`drop_top_frac` of axes per step, even when nothing is genuinely
|
||||
suspicious — and never drops more than that when many axes are. Fix
|
||||
candidate: absolute threshold post-normalization, or run as
|
||||
measure-only diagnostic until calibrated. **Not yet applied.**
|
||||
3. **Old v1 files (no `_sv/` keys) silently disable the gate.** Should
|
||||
fail-fast when `susp_drop_frac > 0` and v_sv is empty.
|
||||
`_sv/{name}` keys are still saved — they're cheap and may feed a
|
||||
future, principled gate (extract-time `tau_axis` zeros rows where
|
||||
`S_i/S_0 < tau_axis`, which is the same idea but applied once at
|
||||
extract rather than at every step).
|
||||
|
||||
## Validation: cheap discriminative tests
|
||||
|
||||
@@ -239,11 +199,6 @@ is fitting noise. **Cost: 5 × ~5 min = 25 min total.**
|
||||
|
||||
## Open design questions
|
||||
|
||||
- **Is the suspicion gate redundant?** Codex argued the quantile design
|
||||
is a fixed-budget knob, not a detector. The right answer is probably:
|
||||
ship it as measure-only first (log `frac_axes_susp` and per-step
|
||||
`r_i` distribution histograms), confirm whether suspicious modules
|
||||
actually exist empirically, *then* turn on projection-side gating.
|
||||
- **Should we whiten by S?** I.e. parameterize the AntiPaSTO knob as
|
||||
`δS_i / σ_i(W)` so all directions have equal forward-pass impact.
|
||||
Currently we don't. This is a separate, larger question.
|
||||
|
||||
Reference in New Issue
Block a user