mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
name the method vGROUT (vector gradient routing)
- title: drop the "Quarantine ... Representation?" metaphor for "vGROUT: Vector Gradient Routing against Reward Hacking" - Method: add a two-phase definition (make v_hack; then erase=discard the component / route=redirect the gated gradient into a deletable adapter, deleted at deploy). Honest framing: route preserves (not discards); follows Shilov et al.'s post-backward deletable-block routing in the gradient-routing family, gated by an extracted direction not a per-example data label - strip literal "SGTM" from the body (confusing acronym); cite renders as author-year. README + pyproject describe vGROUT (package name unchanged)
This commit is contained in:
@@ -1,8 +1,10 @@
|
||||
# projected_grpo
|
||||
# projected_grpo — vGROUT
|
||||
|
||||
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
|
||||
the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
|
||||
basis) reduces reward-hack rate in GRPO without tanking pass rate.
|
||||
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
||||
extracted reward-hacking direction (in the SVD-of-W basis) to reduce the
|
||||
reward-hack rate without tanking pass rate. A representation-routing variant of
|
||||
gradient routing (Cloud et al.; Shilov et al.), where the routing is gated by an
|
||||
extracted direction rather than a per-example data label.
|
||||
|
||||
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
||||
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
||||
|
||||
Reference in New Issue
Block a user