name the method vGROUT (vector gradient routing)

- title: drop the "Quarantine ... Representation?" metaphor for
  "vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
  component / route=redirect the gated gradient into a deletable adapter,
  deleted at deploy). Honest framing: route preserves (not discards); follows
  Shilov et al.'s post-backward deletable-block routing in the gradient-routing
  family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
  author-year. README + pyproject describe vGROUT (package name unchanged)
This commit is contained in:
wassname
2026-06-05 14:45:11 +08:00
parent 07e1eb8753
commit 03693e4f30
3 changed files with 37 additions and 20 deletions
+6 -4
View File
@@ -1,8 +1,10 @@
# projected_grpo
# projected_grpo — vGROUT
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
basis) reduces reward-hack rate in GRPO without tanking pass rate.
**vGROUT** (vector gradient routing): route the GRPO gradient against an
extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
reward-hack rate without tanking pass rate. A representation-routing variant of
gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
extracted direction rather than a per-example data label.
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,