diff --git a/docs/extract_vhack_grad-vec.md b/docs/extract_vhack_grad-vec.md index f54cfb7..4e6539c 100644 --- a/docs/extract_vhack_grad-vec.md +++ b/docs/extract_vhack_grad-vec.md @@ -1,27 +1,24 @@ -# v_hack extraction: gradient-space SVD with magnitudes + runtime suspicion gate +# v_hack extraction: gradient-space SVD with magnitudes Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`. -This explains *what we extract*, *why*, and *what runtime gating prevents*. +This explains *what we extract* and *why*. ## TL;DR `v_hack[name]` is a per-module top-k orthonormal basis in **AntiPaSTO δS-gradient space**, computed by PCA on **paired (hack − clean) NLL gradients** -over a small set of contrastive completion pairs (currently N=14, 12 train + 2 +over a small set of contrastive completion pairs (currently N=12, 10 train + 2 heldout). At training time we project the live policy-gradient component along -this basis out of `δS.grad`, optionally gated so we only ablate when there's -positive evidence the live gradient is hack-aligned. +this basis out of `δS.grad`. -The 2026-05-27 refactor added three things on top of the older mean-diff design: +The 2026-05-27 refactor added two things on top of the older mean-diff design: 1. **Top-k extraction** (k=12 max) with **load-time slicing** (`v_hack_k`, default 5) so k=1 vs k=5 vs k=12 is a config flip, not a re-extract. 2. **Singular-value recording** (`_sv/{name}` keys) so v_i carries its - extract-time confidence S_i, not just direction. -3. **Runtime suspicion gate** (`susp_drop_frac`): per step, drop the top-frac - (module, axis) pairs by `r_i = |g·v_i| / S_i`. Live alignment ≫ extract - confidence means v_i is probably aligned with a structured coding direction, - not hack — skip the projection. + extract-time confidence S_i, not just direction. (Currently unused at + runtime — earlier draft used it for a suspicion gate, removed 2026-05-27; + see below.) ## Why gradient space, not activation space? @@ -125,67 +122,30 @@ k=1 vs k=5 vs k=12 is a **config flip**, not a re-extract. The ~71% of per-module D-variance — load-time slice at 5 is a reasonable default that we can ablate cheaply. -## Runtime suspicion gate +## Runtime suspicion gate (REMOVED 2026-05-27) -**Hypothesis:** module M has small `||D(M)||_F` (weak hack signal at -extract time). Its top SVD direction `v_1(M)` is dominated by noise -shape, not hack shape. At training time, `g(M)` is the policy gradient -flowing through M — a structured (non-isotropic) signal living in a -low-d subspace of "directions that matter for next-token prediction." If -`v_1(M)` coincidentally lies in that subspace, projecting `g(M)` along -`v_1(M)` removes a chunk of useful coding-relevant gradient with no -compensating reduction in hack signal. +**Why it was tried:** if a module has small `||D||_F` at extract time +(weak hack signal), its top SVD direction `v_1` could coincidentally +align with a structured coding-gradient direction at training time, +ablating capability rather than hack. -**Why I'd initially dismissed this concern:** in a high-d random model -(g and v isotropic), `|g · v| ≈ ||g||/√r ≈ 2% of ||g||`. So one bad -direction costs ~2% of the live gradient — tolerable. **What I missed:** -neither `g` nor `v` is isotropic. Both live in low-d structured -subspaces. If those subspaces happen to overlap, the projection magnitude -is much larger. +**Gate design (since removed):** `r_i = |g·v_i|/S_i` as a per-step +quantile drop of the top-25% (module, axis) pairs. -**Gate design:** +**Why removed:** the quantile design is a fixed-budget knob, not a +detector — `frac_axes_susp` was deterministically 0.25 every step (true +by definition of quantile), so the column carried no information. +Codex review independently flagged: `|g·v_i|` scales with live-grad norm +and `S_i` scales with extract-time-grad norm, so the cross-module ratio +is not dimensionless and high-gradient modules dominate regardless of +genuine suspiciousness. In a high-d model the worst-case damage per +spurious axis is ~`1/√r ≈ 2%` of `||g||` anyway, so the cure was +costlier than the disease. -``` -r_i(M) = |g(M) · v_i(M)| / S_i(M) -``` - -- High `r_i`: live grad cares about v_i much more than the extract-time - hack signal did → suspicious, this v_i is probably picking up - structured coding flow. -- Low `r_i`: live alignment is in proportion to extract-time confidence - → trust the projection. - -**Per-step quantile gate:** collect `r_i` across all `(module, axis)` -pairs in one step, find the `(1 − drop_top_frac)`-quantile, suppress all -axes above that threshold for this step. Default `drop_top_frac = 0.25`. - -```python -# pseudo: in project_delta_S_grad -all_r = [] -for name, info in wrappers.items(): - c = V[name] @ info.delta_S.grad # [k_use] - S = v_sv[name] # [k_use] - all_r.append(c.abs() / S.clamp_min(eps)) -threshold = quantile(cat(all_r), 1 − drop_top_frac) -for ...: - keep = (r <= threshold) - g_proj = g − (c * keep * gate_mode_mask) @ V -``` - -## Known limitations (caveats from codex external review, 2026-05-27) - -1. **r_i is not dimensionless across modules.** `|g·v_i|` scales with - live-grad norm; `S_i` scales with extract-time-grad norm. A - high-gradient module dominates the global quantile regardless of - whether its axis is actually suspicious. Fix candidate: within-module - ratio `(|c_i|/||g||) / (S_i/||D||_F)`. **Not yet applied.** -2. **Quantile gate is a fixed budget, not a detector.** It always drops - `drop_top_frac` of axes per step, even when nothing is genuinely - suspicious — and never drops more than that when many axes are. Fix - candidate: absolute threshold post-normalization, or run as - measure-only diagnostic until calibrated. **Not yet applied.** -3. **Old v1 files (no `_sv/` keys) silently disable the gate.** Should - fail-fast when `susp_drop_frac > 0` and v_sv is empty. +`_sv/{name}` keys are still saved — they're cheap and may feed a +future, principled gate (extract-time `tau_axis` zeros rows where +`S_i/S_0 < tau_axis`, which is the same idea but applied once at +extract rather than at every step). ## Validation: cheap discriminative tests @@ -239,11 +199,6 @@ is fitting noise. **Cost: 5 × ~5 min = 25 min total.** ## Open design questions -- **Is the suspicion gate redundant?** Codex argued the quantile design - is a fixed-budget knob, not a detector. The right answer is probably: - ship it as measure-only first (log `frac_axes_susp` and per-step - `r_i` distribution histograms), confirm whether suspicious modules - actually exist empirically, *then* turn on projection-side gating. - **Should we whiten by S?** I.e. parameterize the AntiPaSTO knob as `δS_i / σ_i(W)` so all directions have equal forward-pass impact. Currently we don't. This is a separate, larger question.