blog: drop reader-facing route2 tag -> route (consistency with paper)

route2 is an internal run-tag, not something a reader cares about.
Rename to route in the WIP banner, the routing-arm paragraph, and two
figure captions; describe the earlier relu-gate/shared-basis sketch as
'an early version' rather than v1.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 02:20:13 +00:00
parent dbcc3a5ad3
commit ffc2df540f
@@ -1,6 +1,6 @@
# Erasing the hack direction from a GRPO gradient: a preliminary result
*WIP draft for LessWrong. This is the "erase" (one-sided projection) story at n=2; n=3 is queued. The work has since moved to route2 (per-rollout calibrated-tau routing into a scale-matched quarantine, plus an exploration floor) with an apples-to-apples knob-off deploy-eval. Once those land this gets re-headlined. Numbers may shift.*
*WIP draft for LessWrong. This is the "erase" (one-sided projection) story at n=2; n=3 is queued. The work has since moved to a routing arm (route): per-rollout calibrated-tau routing of the whole rollout gradient into a scale-matched quarantine that is deleted at deploy, plus an exploration floor, with an apples-to-apples knob-off deploy-eval. Once the n=3 route runs land this post gets re-headlined around them. Numbers may shift.*
## The one-line version
@@ -114,11 +114,11 @@ opt.step(delta_S_hack, removed) # delta_S_hack absorbs the hack-ward part
# at deploy: delta_S_hack := 0 (ablate the quarantine)
```
The route arm above is v1 (relu gate, shared basis). The current routing arm, route2, gates per rollout instead: if `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) the whole rollout gradient goes into a scale-matched, distinct-basis quarantine, and an exploration floor generates a fraction of rollouts knob-off so the deployed knob always sees solve signal. Its deploy-eval table is pending the n=3 runs.
The routing sketch above is an early version (relu gate, shared basis). The route arm we report gates per rollout instead: if `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) the whole rollout gradient goes into a scale-matched, distinct-basis quarantine, and an exploration floor generates a fraction of rollouts knob-off so the deployed knob always sees solve signal. Its deploy-eval table is pending the n=3 runs.
![Hack rate (top) and solve rate (bottom) over training, one line per arm. routing2 stays near-zero hack while its solve climbs above the erasure/vanilla arms.](../../out/figs/dyn_sub4_hack_overlay.png)
![Hack rate (top) and solve rate (bottom) over training, one line per arm. route stays near-zero hack while its solve climbs above the erase/vanilla arms.](../../out/figs/dyn_sub4_hack_overlay.png)
*Training dynamics by arm. routing2 (purple) holds deployed hack near zero and lifts solve above vanilla/erasure. Preliminary: vanilla/erase still read off per-step training hack until their knob-off deploy-eval reruns land (jobs 75/76/79).*
*Training dynamics by arm. route (purple) holds deployed hack near zero and lifts solve above vanilla/erase. Preliminary: vanilla/erase still read off per-step training hack until their knob-off deploy-eval reruns land (jobs 75/76/79).*
A caveat on erase. The optimizer is fast-Adam, which carries momentum. Projecting `g` does not project the momentum buffer, so the projected-out direction can re-enter via momentum. On a frozen G_hack the leak is bounded (the buffer is a decaying average of already-projected gradients), but under refresh it is not obviously small. I have not measured it directly yet. If you have intuition about whether it kills the result, please push back.