docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots

- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
  current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
  de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
  banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
  for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
  sampled, not greedy)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-02 01:24:29 +00:00
parent 997de37b26
commit 19deef4fb9
6 changed files with 31 additions and 30 deletions
+13 -1
View File
@@ -89,9 +89,14 @@ Four arms, identical model and teacher pool, differing only in the gradient poli
- **vanilla** -- no intervention; the emergence reference.
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis.
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis. (v1: shared basis, relu gate on the kept-axis coords, same as erase but routed not erased.)
- **route2** -- current routing arm. Per-rollout gate `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) decides whether a rollout's whole gradient routes into a scale-matched, distinct-basis `delta_S_hack`. An exploration floor (`rollout_ablate_frac`) generates a fraction of rollouts knob-off so the deployed knob always sees solve signal even if the full policy saturates on hacking.
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
Deploy hack/solve is measured the same way for every arm now: knob-off forward
(`ablate_quarantine` for route/route2, the trained model itself for vanilla/erase),
n=64 prompts*group, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples.
The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
v_hack every N steps on the current adapter (for route, with the quarantine
ablated during extraction, see the blog).
@@ -112,6 +117,13 @@ measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
## Current findings (preliminary, n=1 seed)
> Stale as of 2026-06-02. The numbers below are the late-May erase/basis-width
> result at the old default mix=0.5. Default mix is now locked to 0.125, the
> primary arm is route2, and the live comparison is per-arm deploy hack/solve
> (knob-off, n=64, T=0.7). n=3 no-floor route2 + matched vanilla refs are
> landing (pueue jobs 68-79); this section gets rewritten on those numbers.
> Latest results live in `RESEARCH_JOURNAL.md`.
These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
@@ -1,14 +1,15 @@
# Erasing the hack direction from a GRPO gradient: a preliminary result
*WIP draft for LessWrong. n=2 matched seeds at time of writing; n=3 queued. Numbers may shift; will update.*
*WIP draft for LessWrong. This is the "erase" (one-sided projection) story at n=2; n=3 is queued. The work has since moved to route2 (per-rollout calibrated-tau routing into a scale-matched quarantine, plus an exploration floor) with an apples-to-apples knob-off deploy-eval. Once those land this gets re-headlined. Numbers may shift.*
## The one-line version
If you give a language model a coding RL environment where it can either solve the problem honestly or write tests that always pass against its own wrong answer, [GRPO](https://arxiv.org/abs/2402.03300) teaches it to take the second option. We hand-wrote twenty-one (hack, clean) example pairs, computed the GRPO gradient each pair would emit, and SVD'd the result per Linear to get a basis I'll call **G_hack**. During training we subtract the component of the live gradient that lies in G_hack before the optimizer steps. On the two seeds where I have a vanilla baseline and a projected run at the same configuration, the projected runs sit 12-23 percentage points below vanilla on whole-run student hack rate, and overall pass rate moves by no more than 10-12pp (one-sided, see Table 1). Re-extracting G_hack every two optimizer steps adds another ~5pp of hack suppression. This is preliminary n=2 matched seeds is not enough to be sure but the direction is consistent and the mechanism diagnostic agrees with the headline.
If you give a language model a coding RL environment where it can either solve the problem honestly or write tests that always pass against its own wrong answer, [GRPO](https://arxiv.org/abs/2402.03300) teaches it to take the second option. We hand-wrote twenty-one (hack, clean) example pairs, computed the GRPO gradient each pair would emit, and SVD'd the result per Linear to get a basis I'll call **G_hack**. During training we subtract the component of the live gradient that lies in G_hack before the optimizer steps. On the two seeds where I have a vanilla baseline and a projected run at the same configuration, the projected runs sit 12-23 percentage points below vanilla on whole-run student hack rate, and overall pass rate moves by no more than 10-12pp (one-sided, see Table 1). Re-extracting G_hack every two optimizer steps adds another ~5pp of hack suppression. This is preliminary, n=2 matched seeds is not enough to be sure, but the direction is consistent and the mechanism diagnostic agrees with the headline.
## Why this is worth a post even though n=2
Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interventions live at the reward level (better reward functions, KL penalties, advantage shaping). Concurrent work like Wu and Tang 2026's "Advantage Modification" sits there too. The intervention here is one level deeper: we operate on the gradient before the optimizer applies it, after the reward has already been computed. If gradient-level interventions can be made to work, you don't need to know the specific hack in advance you only need a probe that identifies its subspace, which seems easier than enumerating hacks ahead of time.
Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interventions live at the reward level (better reward functions, KL penalties, advantage shaping). Concurrent work like Wu and Tang 2026's "Advantage Modification" sits there too. The intervention here is one level deeper: we operate on the gradient before the optimizer applies it, after the reward has already been computed. If gradient-level interventions can be made to work, you don't need to know the specific hack in advance, you only need a probe that identifies its subspace, which seems easier than enumerating hacks ahead of time.
Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works.
@@ -90,10 +91,10 @@ The basis lives in the [SVD-of-W basis](../../README.md) of each Linear ("rotate
Given G_hack, there is more than one way to keep the live gradient out of it. We compare four arms, same model and teacher pool, differing only in the gradient policy:
1. **Vanilla.** No intervention. The reference for how fast GRPO learns each of the four loopholes.
2. **Erase.** Subtract the G_hack component from the live `delta_S` gradient before the optimizer step. This is the arm Table 1 reports. One-sided: we only remove the part of the gradient that points toward hacking, leaving anti-hack motion untouched.
3. **Route.** Instead of deleting the hack component, send it into a separate throwaway knob `delta_S_hack` that we delete at deployment. This is gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332), their Expand-Route-Ablate), adapted to the SVD basis: the main knob `delta_S` learns the orthogonal complement, the quarantine knob absorbs the hack-ward part, and ablating it at deploy removes whatever localized there.
4. **Route, weak detector.** The honest generalization test. The first three arms build G_hack from all four loophole modes. Here we build it from only two of the four (the "known" hacks A, B that our weak detector can flag), route on that subspace, and measure whether the two held-out modes (C, D, which the detector never saw) are also suppressed. This mimics deployment, where there are known hacks and unknown hacks, and we only get to build a detector for the known ones.
1. Vanilla. No intervention. The reference for how fast GRPO learns each of the four loopholes.
2. Erase. Subtract the G_hack component from the live `delta_S` gradient before the optimizer step. This is the arm Table 1 reports. One-sided: we only remove the part of the gradient that points toward hacking, leaving anti-hack motion untouched.
3. Route. Instead of deleting the hack component, send it into a separate throwaway knob `delta_S_hack` that we delete at deployment. This is gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332), their Expand-Route-Ablate), adapted to the SVD basis: the main knob `delta_S` learns the orthogonal complement, the quarantine knob absorbs the hack-ward part, and ablating it at deploy removes whatever localized there.
4. Route, weak detector. The honest generalization test. The first three arms build G_hack from all four loophole modes. Here we build it from only two of the four (the "known" hacks A, B that our weak detector can flag), route on that subspace, and measure whether the two held-out modes (C, D, which the detector never saw) are also suppressed. This mimics deployment, where there are known hacks and unknown hacks, and we only get to build a detector for the known ones.
The pseudocode for erase and route is short. Both run inside the optimizer step, per Linear, on the rank-r gradient:
@@ -113,6 +114,13 @@ opt.step(delta_S_hack, removed) # delta_S_hack absorbs the hack-ward part
# at deploy: delta_S_hack := 0 (ablate the quarantine)
```
The route arm above is v1 (relu gate, shared basis). The current routing arm, route2, gates per rollout instead: if `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) the whole rollout gradient goes into a scale-matched, distinct-basis quarantine, and an exploration floor generates a fraction of rollouts knob-off so the deployed knob always sees solve signal. Its deploy-eval table is pending the n=3 runs.
![Hack rate (top) and solve rate (bottom) over training, one line per arm. routing2 stays near-zero hack while its solve climbs above the erasure/vanilla arms.](../../out/figs/dyn_sub4_hack_overlay.png)
*Training dynamics by arm. routing2 (purple) holds deployed hack near zero and lifts solve above vanilla/erasure. Preliminary: vanilla/erase still read off per-step training hack until their knob-off deploy-eval reruns land (jobs 75/76/79).*
A caveat on erase. The optimizer is fast-Adam, which carries momentum. Projecting `g` does not project the momentum buffer, so the projected-out direction can re-enter via momentum. On a frozen G_hack the leak is bounded (the buffer is a decaying average of already-projected gradients), but under refresh it is not obviously small. I have not measured it directly yet. If you have intuition about whether it kills the result, please push back.
### Why refresh G_hack at all
Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 99 KiB

+2 -21
View File
@@ -8,8 +8,8 @@ model's hack_s (red) and solve/gt_s (green) over training. Per-seed thin lines
APPLES-TO-APPLES. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
present: the same estimator across arms (n=64, T=0.7, every --eval-ablate-every
steps). For route/route2 the deployed model = quarantine knob zeroed; for
vanilla/erase deploy == the trained model. Sparse deploy-eval points are dotted
(see _mark_if_sparse) so the EMA-held line doesn't oversell per-step density.
vanilla/erase deploy == the trained model. Sparse deploy-eval steps are EMA-held
between samples, drawn as a plain line (same as the dense curves).
Older logs that gated the eval to route only fall back to per-step training
hack_s for vanilla/erase (noisier, n=28, but estimates the same deployed rate
since those arms have no quarantine).
@@ -173,19 +173,6 @@ def _onset(steps: np.ndarray, hack: np.ndarray) -> int | None:
return int(steps[nz[0]]) if len(nz) else None
def _mark_if_sparse(ax, x: np.ndarray, y_raw: np.ndarray, color, alpha=1.0) -> None:
"""Dot the REAL measured points when a series is mostly-NaN. route2's plotted
hack/solve is the DEPLOY eval (hk_dep), sampled every eval_ablate_every steps and
EMA-held flat between -- without the dots the held line looks per-step-dense and
oversells route2's smoothness vs the per-step-sampled training curves (Tufte:
the rendering must not imply more data than was measured). Dense series (every
step finite, e.g. training hack_s, cos sep/leak) stay unmarked to avoid clutter."""
finite = np.isfinite(y_raw)
if finite.sum() and finite.mean() < 0.5:
ax.plot(x[finite], y_raw[finite], ls="", marker="o", ms=2.5,
color=color, alpha=alpha, zorder=4)
def _ema(y: np.ndarray, span: int = 5) -> np.ndarray:
"""Causal EMA, span=5. Less lag than a trailing SMA(5) since it weights
recent steps more. NaNs hold the previous smoothed value (don't reset it)."""
@@ -213,7 +200,6 @@ def _series_panel(ax, runs, cols, colors, ylim, label_series=False):
for r in present:
ys = _ema(r[col])
ax.plot(r["steps"], ys, color=color, lw=0.7, alpha=0.35, solid_capstyle="round")
_mark_if_sparse(ax, r["steps"], r[col], color, alpha=0.5)
stacked.append(ys)
# mean over seeds of the smoothed series (runs share the step grid within an arm)
L = min(len(y) for y in stacked)
@@ -289,16 +275,11 @@ def _overlay_panel(ax, by_arm, arms, key, *, label, with_onset):
for r in rs:
ys = _ema(r[key])
ax.plot(r["steps"], ys, color=color, lw=0.6, alpha=0.25, solid_capstyle="round")
_mark_if_sparse(ax, r["steps"], r[key], color, alpha=0.4)
stacked.append(ys)
L = min(len(y) for y in stacked)
ym = np.nanmean(np.stack([y[:L] for y in stacked]), axis=0)
xm = rs[0]["steps"][:L]
ax.plot(xm, ym, color=color, lw=2.0, solid_capstyle="round")
# dot the bold mean at its REAL (pre-EMA) measured steps -- ym is EMA-held so
# it would otherwise read as per-step-dense (see _mark_if_sparse)
raw_mean = np.nanmean(np.stack([r[key][:L] for r in rs]), axis=0)
_mark_if_sparse(ax, xm, raw_mean, color)
if with_onset:
onsets = [s for r in rs if (s := _onset(r["steps"], r["hack_s"])) is not None]
if onsets:
+1 -1
View File
@@ -419,7 +419,7 @@ class StepLogger:
if arm in ("routing", "routing2"):
cols += [
_Col("q_egy", 6, "qE", ".2f", "grad energy into quarantine ‖g_quar‖/(‖g_keep‖+‖g_quar‖); ~0.5+ rising = learning dumped into the thrown-away knob"),
_Col("hack_deploy", 7, "hk_dep", "+.2f", "DEPLOY-eval hack (quarantine deleted = deployed model); held-out greedy, eval_ablate_every steps; the plot number"),
_Col("hack_deploy", 7, "hk_dep", "+.2f", "DEPLOY-eval hack (quarantine deleted = deployed model); held-out eval subset, T=0.7, every eval_ablate_every steps; the plot number"),
_Col("solve_deploy", 7, "slv_dep", "+.2f", "DEPLOY-eval solve"),
_Col("hack_abl", 6, "hk_abl", "frac", "FREE per-step deploy proxy: hack rate on the ablated (deploy-mode) rollout slice; train prompts, noisier than hk_dep"),
_Col("solve_abl", 6, "slv_abl", "frac", "free per-step deploy proxy: solve rate on the ablated rollout slice"),