mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -1,14 +1,15 @@
|
||||
# Erasing the hack direction from a GRPO gradient: a preliminary result
|
||||
|
||||
*WIP draft for LessWrong. n=2 matched seeds at time of writing; n=3 queued. Numbers may shift; will update.*
|
||||
*WIP draft for LessWrong. This is the "erase" (one-sided projection) story at n=2; n=3 is queued. The work has since moved to route2 (per-rollout calibrated-tau routing into a scale-matched quarantine, plus an exploration floor) with an apples-to-apples knob-off deploy-eval. Once those land this gets re-headlined. Numbers may shift.*
|
||||
|
||||
## The one-line version
|
||||
|
||||
If you give a language model a coding RL environment where it can either solve the problem honestly or write tests that always pass against its own wrong answer, [GRPO](https://arxiv.org/abs/2402.03300) teaches it to take the second option. We hand-wrote twenty-one (hack, clean) example pairs, computed the GRPO gradient each pair would emit, and SVD'd the result per Linear to get a basis I'll call **G_hack**. During training we subtract the component of the live gradient that lies in G_hack before the optimizer steps. On the two seeds where I have a vanilla baseline and a projected run at the same configuration, the projected runs sit 12-23 percentage points below vanilla on whole-run student hack rate, and overall pass rate moves by no more than 10-12pp (one-sided, see Table 1). Re-extracting G_hack every two optimizer steps adds another ~5pp of hack suppression. This is preliminary — n=2 matched seeds is not enough to be sure — but the direction is consistent and the mechanism diagnostic agrees with the headline.
|
||||
If you give a language model a coding RL environment where it can either solve the problem honestly or write tests that always pass against its own wrong answer, [GRPO](https://arxiv.org/abs/2402.03300) teaches it to take the second option. We hand-wrote twenty-one (hack, clean) example pairs, computed the GRPO gradient each pair would emit, and SVD'd the result per Linear to get a basis I'll call **G_hack**. During training we subtract the component of the live gradient that lies in G_hack before the optimizer steps. On the two seeds where I have a vanilla baseline and a projected run at the same configuration, the projected runs sit 12-23 percentage points below vanilla on whole-run student hack rate, and overall pass rate moves by no more than 10-12pp (one-sided, see Table 1). Re-extracting G_hack every two optimizer steps adds another ~5pp of hack suppression. This is preliminary, n=2 matched seeds is not enough to be sure, but the direction is consistent and the mechanism diagnostic agrees with the headline.
|
||||
|
||||
|
||||
## Why this is worth a post even though n=2
|
||||
|
||||
Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interventions live at the reward level (better reward functions, KL penalties, advantage shaping). Concurrent work like Wu and Tang 2026's "Advantage Modification" sits there too. The intervention here is one level deeper: we operate on the gradient before the optimizer applies it, after the reward has already been computed. If gradient-level interventions can be made to work, you don't need to know the specific hack in advance — you only need a probe that identifies its subspace, which seems easier than enumerating hacks ahead of time.
|
||||
Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interventions live at the reward level (better reward functions, KL penalties, advantage shaping). Concurrent work like Wu and Tang 2026's "Advantage Modification" sits there too. The intervention here is one level deeper: we operate on the gradient before the optimizer applies it, after the reward has already been computed. If gradient-level interventions can be made to work, you don't need to know the specific hack in advance, you only need a probe that identifies its subspace, which seems easier than enumerating hacks ahead of time.
|
||||
|
||||
Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works.
|
||||
|
||||
@@ -90,10 +91,10 @@ The basis lives in the [SVD-of-W basis](../../README.md) of each Linear ("rotate
|
||||
|
||||
Given G_hack, there is more than one way to keep the live gradient out of it. We compare four arms, same model and teacher pool, differing only in the gradient policy:
|
||||
|
||||
1. **Vanilla.** No intervention. The reference for how fast GRPO learns each of the four loopholes.
|
||||
2. **Erase.** Subtract the G_hack component from the live `delta_S` gradient before the optimizer step. This is the arm Table 1 reports. One-sided: we only remove the part of the gradient that points toward hacking, leaving anti-hack motion untouched.
|
||||
3. **Route.** Instead of deleting the hack component, send it into a separate throwaway knob `delta_S_hack` that we delete at deployment. This is gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332), their Expand-Route-Ablate), adapted to the SVD basis: the main knob `delta_S` learns the orthogonal complement, the quarantine knob absorbs the hack-ward part, and ablating it at deploy removes whatever localized there.
|
||||
4. **Route, weak detector.** The honest generalization test. The first three arms build G_hack from all four loophole modes. Here we build it from only two of the four (the "known" hacks A, B that our weak detector can flag), route on that subspace, and measure whether the two held-out modes (C, D, which the detector never saw) are also suppressed. This mimics deployment, where there are known hacks and unknown hacks, and we only get to build a detector for the known ones.
|
||||
1. Vanilla. No intervention. The reference for how fast GRPO learns each of the four loopholes.
|
||||
2. Erase. Subtract the G_hack component from the live `delta_S` gradient before the optimizer step. This is the arm Table 1 reports. One-sided: we only remove the part of the gradient that points toward hacking, leaving anti-hack motion untouched.
|
||||
3. Route. Instead of deleting the hack component, send it into a separate throwaway knob `delta_S_hack` that we delete at deployment. This is gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332), their Expand-Route-Ablate), adapted to the SVD basis: the main knob `delta_S` learns the orthogonal complement, the quarantine knob absorbs the hack-ward part, and ablating it at deploy removes whatever localized there.
|
||||
4. Route, weak detector. The honest generalization test. The first three arms build G_hack from all four loophole modes. Here we build it from only two of the four (the "known" hacks A, B that our weak detector can flag), route on that subspace, and measure whether the two held-out modes (C, D, which the detector never saw) are also suppressed. This mimics deployment, where there are known hacks and unknown hacks, and we only get to build a detector for the known ones.
|
||||
|
||||
The pseudocode for erase and route is short. Both run inside the optimizer step, per Linear, on the rank-r gradient:
|
||||
|
||||
@@ -113,6 +114,13 @@ opt.step(delta_S_hack, removed) # delta_S_hack absorbs the hack-ward part
|
||||
# at deploy: delta_S_hack := 0 (ablate the quarantine)
|
||||
```
|
||||
|
||||
The route arm above is v1 (relu gate, shared basis). The current routing arm, route2, gates per rollout instead: if `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) the whole rollout gradient goes into a scale-matched, distinct-basis quarantine, and an exploration floor generates a fraction of rollouts knob-off so the deployed knob always sees solve signal. Its deploy-eval table is pending the n=3 runs.
|
||||
|
||||

|
||||
|
||||
*Training dynamics by arm. routing2 (purple) holds deployed hack near zero and lifts solve above vanilla/erasure. Preliminary: vanilla/erase still read off per-step training hack until their knob-off deploy-eval reruns land (jobs 75/76/79).*
|
||||
|
||||
|
||||
A caveat on erase. The optimizer is fast-Adam, which carries momentum. Projecting `g` does not project the momentum buffer, so the projected-out direction can re-enter via momentum. On a frozen G_hack the leak is bounded (the buffer is a decaying average of already-projected gradients), but under refresh it is not obviously small. I have not measured it directly yet. If you have intuition about whether it kills the result, please push back.
|
||||
|
||||
### Why refresh G_hack at all
|
||||
|
||||
Reference in New Issue
Block a user