From ffc2df540f05e1c11902e60a5dddd12926a819fe Mon Sep 17 00:00:00 2001 From: wassname Date: Wed, 3 Jun 2026 02:20:13 +0000 Subject: [PATCH] blog: drop reader-facing route2 tag -> route (consistency with paper) route2 is an internal run-tag, not something a reader cares about. Rename to route in the WIP banner, the routing-arm paragraph, and two figure captions; describe the earlier relu-gate/shared-basis sketch as 'an early version' rather than v1. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- ...0529_gradient_projection_vs_reward_hacking_LW_draft.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md b/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md index d4a889a..6ee3dc5 100644 --- a/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md +++ b/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md @@ -1,6 +1,6 @@ # Erasing the hack direction from a GRPO gradient: a preliminary result -*WIP draft for LessWrong. This is the "erase" (one-sided projection) story at n=2; n=3 is queued. The work has since moved to route2 (per-rollout calibrated-tau routing into a scale-matched quarantine, plus an exploration floor) with an apples-to-apples knob-off deploy-eval. Once those land this gets re-headlined. Numbers may shift.* +*WIP draft for LessWrong. This is the "erase" (one-sided projection) story at n=2; n=3 is queued. The work has since moved to a routing arm (route): per-rollout calibrated-tau routing of the whole rollout gradient into a scale-matched quarantine that is deleted at deploy, plus an exploration floor, with an apples-to-apples knob-off deploy-eval. Once the n=3 route runs land this post gets re-headlined around them. Numbers may shift.* ## The one-line version @@ -114,11 +114,11 @@ opt.step(delta_S_hack, removed) # delta_S_hack absorbs the hack-ward part # at deploy: delta_S_hack := 0 (ablate the quarantine) ``` -The route arm above is v1 (relu gate, shared basis). The current routing arm, route2, gates per rollout instead: if `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) the whole rollout gradient goes into a scale-matched, distinct-basis quarantine, and an exploration floor generates a fraction of rollouts knob-off so the deployed knob always sees solve signal. Its deploy-eval table is pending the n=3 runs. +The routing sketch above is an early version (relu gate, shared basis). The route arm we report gates per rollout instead: if `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) the whole rollout gradient goes into a scale-matched, distinct-basis quarantine, and an exploration floor generates a fraction of rollouts knob-off so the deployed knob always sees solve signal. Its deploy-eval table is pending the n=3 runs. -![Hack rate (top) and solve rate (bottom) over training, one line per arm. routing2 stays near-zero hack while its solve climbs above the erasure/vanilla arms.](../../out/figs/dyn_sub4_hack_overlay.png) +![Hack rate (top) and solve rate (bottom) over training, one line per arm. route stays near-zero hack while its solve climbs above the erase/vanilla arms.](../../out/figs/dyn_sub4_hack_overlay.png) -*Training dynamics by arm. routing2 (purple) holds deployed hack near zero and lifts solve above vanilla/erasure. Preliminary: vanilla/erase still read off per-step training hack until their knob-off deploy-eval reruns land (jobs 75/76/79).* +*Training dynamics by arm. route (purple) holds deployed hack near zero and lifts solve above vanilla/erase. Preliminary: vanilla/erase still read off per-step training hack until their knob-off deploy-eval reruns land (jobs 75/76/79).* A caveat on erase. The optimizer is fast-Adam, which carries momentum. Projecting `g` does not project the momentum buffer, so the projected-out direction can re-enter via momentum. On a frozen G_hack the leak is bounded (the buffer is a decaying average of already-projected gradients), but under refresh it is not obviously small. I have not measured it directly yet. If you have intuition about whether it kills the result, please push back.