mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Files

T

wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 06:39:05 +00:00

12 KiB

Raw Blame History

v_hack extraction: gradient-space SVD with magnitudes + runtime suspicion gate

Living design doc for the v_hack pipeline. Sibling to RESEARCH_JOURNAL.md. This explains what we extract, why, and what runtime gating prevents.

TL;DR

v_hack[name] is a per-module top-k orthonormal basis in AntiPaSTO δS-gradient space, computed by PCA on paired (hack − clean) NLL gradients over a small set of contrastive completion pairs (currently N=14, 12 train + 2 heldout). At training time we project the live policy-gradient component along this basis out of δS.grad, optionally gated so we only ablate when there's positive evidence the live gradient is hack-aligned.

The 2026-05-27 refactor added three things on top of the older mean-diff design:

Top-k extraction (k=12 max) with load-time slicing (v_hack_k, default 5) so k=1 vs k=5 vs k=12 is a config flip, not a re-extract.
Singular-value recording (_sv/{name} keys) so v_i carries its extract-time confidence S_i, not just direction.
Runtime suspicion gate (susp_drop_frac): per step, drop the top-frac (module, axis) pairs by r_i = |g·v_i| / S_i. Live alignment ≫ extract confidence means v_i is probably aligned with a structured coding direction, not hack — skip the projection.

Why gradient space, not activation space?

Most representation-steering work (ActAdd, RepE, CHaRS) operates on activations (forward pass), shifting hidden states at inference. We operate on gradients of δS, the trainable per-Linear AntiPaSTO knob.

Reasons:

We're not steering inference; we're shaping training. The projection modifies δS.grad before the optimizer step, so the model itself doesn't drift toward hack-aligned weight updates.
δS gradients have a fixed, low-dimensional structure per module (δS ∈ R^r where r = SVD rank of W). PCA-on-grads is computationally cheap (12 pairs × N modules; r=2560 for largest mat) and gives a clean per-module subspace.
This is closest in spirit to CHaRS-PCT (Principal Component Thresholding, §3.3 of docs/paper_chars.md): the L principal components of local-shift covariance. We do the same maneuver on paired δS-gradient diffs.

Why δS basis (= weight-SVD basis), not raw param basis?

AntiPaSTO wraps each Linear with δW = U · diag(δS) · V_h, where U, S, V_h = SVD(W_pretrained). So δS ∈ R^r are coordinates in the weight-SVD basis. The basis change is just a rotation — no whitening, no rescaling.

Two things this buys us:

The number of trainable scalars is r per module (∼500–2500), not d_in×d_out. A few hundred contrastive pairs would be needed to estimate dense d_in × d_out direction; only a few pairs are needed in R^r.
Low-rank perturbations (LoRA-style hack adapters) are sparse in this basis, which makes per-direction gating in δS meaningful even with N=12 pairs.

What this does not buy us: regularization. The weight-SVD basis is just a convenient coordinate system; PCA on top of it still has to do the work of finding which coordinates carry hack-clean discriminative signal.

Extraction pipeline

# pseudo: extract_v_hack(model, tokenizer, wrappers, pairs, top_k, tau_axis, n_heldout, device)

train_pairs = pairs[:-n_heldout]                # currently 12 of 14

# Gather per-pair, per-module gradients on hack-completion and clean-completion NLL.
grads_hack[name]:  list of [r]-tensors, length n_pairs
grads_clean[name]: list of [r]-tensors, length n_pairs

for pair in train_pairs:
  for label, completion in [("hack", pair.hack), ("clean", pair.clean)]:
    model.zero_grad()
    loss = mean_NLL_on_completion_tokens(model, pair.prompt + completion)
    loss.backward()                              # populates δS.grad per module
    for name, info in wrappers.items():
      bucket[name].append(info.delta_S.grad.detach().cpu().float().clone())

# Per module: PCA on paired diff.
for name in wrappers:
  G_h = stack(grads_hack[name])                  # [n_pairs, r]
  G_c = stack(grads_clean[name])
  D   = G_h - G_c                                # [n_pairs, r]: per-pair hack-axis displacement
  U_d, S_d, Vh_d = svd(D)                        # truncated, m = min(n_pairs, r)
  V = Vh_d[:k_max]                               # [k_max, r], orthonormal rows
  # Orient v_i so +v_i points hack-ward (majority vote across pairs).
  proj = D @ V.T                                 # [n_pairs, k_max]
  n_pos = (proj > 0).sum(0)
  flip = where(n_pos < n_pairs/2, -1, +1)
  V = V * flip[:, None]
  v_hack[name] = V
  v_hack[f"_sv/{name}"] = S_d[:k_max]            # NEW: singular values saved alongside

File schema (v2):

{name} → Tensor[k_max, r], orthonormal hack-axis basis, oriented +hack
_sv/{name} → Tensor[k_max], singular values of D in that basis
metadata: model, dtype, top_k, tau_axis, schema=v2_with_sv

Load-or-extract (2026-05-27)

train.py derives v_hack_path from (model_name, v_hack_extract_top_k) unless overridden. If the file is missing, it extracts inline on the already-wrapped model:

v_hack_path = OUT_DIR / f"v_hack_{model_slug}_k{extract_top_k}.safetensors"
if not v_hack_path.exists():
    v_hack_dict, raw_grads, _ = extract_v_hack(model, tok, wrappers, PAIRS,
                                                top_k=extract_top_k, ...)
    save_file(v_hack_dict, v_hack_path, metadata={...})
v_hack, v_sv = load_v_hack(v_hack_path, model_name, wrappers, k_use=v_hack_k)

This means a fresh model with no cached v_hack just runs extract once (~5 min for 4B-class) and proceeds. No prerequisite jobs, no manual flags.

Load-time k-slicing

Extract saves k_max (default 12). Load slices to k_use (default 5). So k=1 vs k=5 vs k=12 is a config flip, not a re-extract. The mean_sv_top5_frac from our 2026-05-26 extract was 0.71, so k=5 covers ~71% of per-module D-variance — load-time slice at 5 is a reasonable default that we can ablate cheaply.

Runtime suspicion gate

Hypothesis: module M has small ||D(M)||_F (weak hack signal at extract time). Its top SVD direction v_1(M) is dominated by noise shape, not hack shape. At training time, g(M) is the policy gradient flowing through M — a structured (non-isotropic) signal living in a low-d subspace of "directions that matter for next-token prediction." If v_1(M) coincidentally lies in that subspace, projecting g(M) along v_1(M) removes a chunk of useful coding-relevant gradient with no compensating reduction in hack signal.

Why I'd initially dismissed this concern: in a high-d random model (g and v isotropic), |g · v| ≈ ||g||/√r ≈ 2% of ||g||. So one bad direction costs ~2% of the live gradient — tolerable. What I missed: neither g nor v is isotropic. Both live in low-d structured subspaces. If those subspaces happen to overlap, the projection magnitude is much larger.

Gate design:

r_i(M) = |g(M) · v_i(M)| / S_i(M)

High r_i: live grad cares about v_i much more than the extract-time hack signal did → suspicious, this v_i is probably picking up structured coding flow.
Low r_i: live alignment is in proportion to extract-time confidence → trust the projection.

Per-step quantile gate: collect r_i across all (module, axis) pairs in one step, find the (1 − drop_top_frac)-quantile, suppress all axes above that threshold for this step. Default drop_top_frac = 0.25.

# pseudo: in project_delta_S_grad
all_r = []
for name, info in wrappers.items():
    c = V[name] @ info.delta_S.grad        # [k_use]
    S = v_sv[name]                          # [k_use]
    all_r.append(c.abs() / S.clamp_min(eps))
threshold = quantile(cat(all_r), 1 − drop_top_frac)
for ...:
    keep = (r <= threshold)
    g_proj = g − (c * keep * gate_mode_mask) @ V

Known limitations (caveats from codex external review, 2026-05-27)

r_i is not dimensionless across modules. |g·v_i| scales with live-grad norm; S_i scales with extract-time-grad norm. A high-gradient module dominates the global quantile regardless of whether its axis is actually suspicious. Fix candidate: within-module ratio (|c_i|/||g||) / (S_i/||D||_F). Not yet applied.
Quantile gate is a fixed budget, not a detector. It always drops drop_top_frac of axes per step, even when nothing is genuinely suspicious — and never drops more than that when many axes are. Fix candidate: absolute threshold post-normalization, or run as measure-only diagnostic until calibrated. Not yet applied.
Old v1 files (no _sv/ keys) silently disable the gate. Should fail-fast when susp_drop_frac > 0 and v_sv is empty.

Validation: cheap discriminative tests

The fundamental question: does v_hack actually discriminate hack from clean gradients, or is it picking up irrelevant variance?

Test 1: cin_hack vs cin_clean on disk pools (cheap, ~5 min)

We already have out/probe_distill/teacher_pool/ (hacking samples) and out/probe_distill/base_pool/ (clean samples). For N samples each:

for prompt, completion, label in samples:
    model.zero_grad()
    loss = mean_NLL(model, prompt + completion).backward()
    cin = (V @ delta_S.grad).norm() / delta_S.grad.norm()
    record(label, cin)

Discriminator: cin_hack_mean − cin_clean_mean. If ≫ 0, v_hack discriminates. If ≈ 0, v_hack is capturing prompt-length / generic variance, not hack-specific direction. Cost: ~5 min, no training.

Test 2: held-out pair projection (existing)

verify_vhack_heldout.py projects gradients from held-out pairs (last n_heldout of PAIRS) onto trained v_hack. Already in CI-style flow.

Test 3: random-direction null

For each module, compute cin onto v_hack vs onto a random unit vector of the same shape. If cin_v_hack > cin_random by a large margin, v_hack is non-spurious. Trivial to implement.

Test 4: per-source cin during training (live)

In mixed-pool runs we have student rollouts (initially ~no hack) and teacher rollouts (all hack). Currently cin is computed on the accumulated gradient (mixed). With ONE extra backward per step we can compute cin_s (student-only grad) and cin_t (teacher-only grad) separately. Predict: if v_hack is a real hack direction, cin_t > cin_s initially; the gap shrinks as student picks up hack (if it does). Useful for diagnosing whether the projection is doing real work or just gradient noise.

Test 5: bootstrap sign-stability

Bootstrap pairs (sample N-2 with replacement), re-extract v_hack, compare cos(v_hack_original, v_hack_bootstrap). If unstable, v_hack is fitting noise. Cost: 5 × ~5 min = 25 min total.

Open design questions

Is the suspicion gate redundant? Codex argued the quantile design is a fixed-budget knob, not a detector. The right answer is probably: ship it as measure-only first (log frac_axes_susp and per-step r_i distribution histograms), confirm whether suspicious modules actually exist empirically, then turn on projection-side gating.
Should we whiten by S? I.e. parameterize the AntiPaSTO knob as δS_i / σ_i(W) so all directions have equal forward-pass impact. Currently we don't. This is a separate, larger question.
Should we record per-pair pair tags / hack flavors? With 12 unlabeled pairs we can't do supervised LDA. With flavor labels (hardcode / weak-tests / persona / format-leak) we could do LDA-on- labels, which would beat unsupervised PCA at this N.

src/projected_grpo/extract_vhack_grad.py — extract function + CLI
src/projected_grpo/proj.py — runtime projection + gates
src/projected_grpo/train.py:load_v_hack — load + slice + auto-extract
src/projected_grpo/verify_vhack_heldout.py — Test 2 above
src/projected_grpo/pairs.py — the 14 contrastive pairs
docs/paper_chars.md — CHaRS notes (PCT comparison)
RESEARCH_JOURNAL.md — chronological progress log

12 KiB Raw Blame History Unescape Escape