name the method vGROUT (vector gradient routing)

- title: drop the "Quarantine ... Representation?" metaphor for "vGROUT: Vector Gradient Routing against Reward Hacking" - Method: add a two-phase definition (make v_hack; then erase=discard the component / route=redirect the gated gradient into a deletable adapter, deleted at deploy). Honest framing: route preserves (not discards); follows Shilov et al.'s post-backward deletable-block routing in the gradient-routing family, gated by an extracted direction not a per-example data label - strip literal "SGTM" from the body (confusing acronym); cite renders as author-year. README + pyproject describe vGROUT (package name unchanged)
2026-06-27 16:15:35 +08:00 · 2026-06-05 14:45:11 +08:00
parent 07e1eb8753
commit 03693e4f30
3 changed files with 37 additions and 20 deletions
@@ -1,8 +1,10 @@
-# projected_grpo
+# projected_grpo — vGROUT

-SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
-the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
-basis) reduces reward-hack rate in GRPO without tanking pass rate.
+**vGROUT** (vector gradient routing): route the GRPO gradient against an
+extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
+reward-hack rate without tanking pass rate. A representation-routing variant of
+gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
+extracted direction rather than a per-example data label.

 Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
 LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,