name the method vGROUT (vector gradient routing)

- title: drop the "Quarantine ... Representation?" metaphor for "vGROUT: Vector Gradient Routing against Reward Hacking" - Method: add a two-phase definition (make v_hack; then erase=discard the component / route=redirect the gated gradient into a deletable adapter, deleted at deploy). Honest framing: route preserves (not discards); follows Shilov et al.'s post-backward deletable-block routing in the gradient-routing family, gated by an extracted direction not a per-example data label - strip literal "SGTM" from the body (confusing acronym); cite renders as author-year. README + pyproject describe vGROUT (package name unchanged)
2026-06-27 17:30:41 +08:00 · 2026-06-05 14:45:11 +08:00
parent 07e1eb8753
commit 03693e4f30
3 changed files with 37 additions and 20 deletions
@@ -67,7 +67,7 @@
 % "hand-authored" in make_dataset_pairsets.py means hand-authored by the model.
 % We do not argue the point in prose; we just SHOW the pairs (the actual hack/clean
 % completions that build v_hack) in an appendix and let the reader judge.
-\title{Can We Quarantine Reward Hacking with a Reward-Hacking Representation?}
+\title{vGROUT: Vector Gradient Routing against Reward Hacking}

 % Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
 \author{Anonymous Author(s)\\ Affiliation\\ \texttt{email}}
@@ -145,16 +145,16 @@ README ``How it works'' + blog intro.}

 \paragraph{Contributions.} % author-dictated; factual claims.
 \begin{enumerate}
-  \item We adapt selective gradient masking (SGTM \citep{sgtm2025localization}),
-        post-backward masking of a forget subspace deleted at deploy, from
-        supervised unlearning to reward hacking in RL post-training. We keep the
-        localize-then-ablate framing of gradient routing
-        \citep{cloud2024gradientrouting} but route gradients post-backward, the
-        SGTM parameter-masking family rather than Cloud's forward
-        \texttt{.detach()} on activations.
-  \item We replace the routing mask itself. SGTM and gradient routing tag the
-        training \emph{data} (per-example / per-token, $O(\text{dataset})$
-        labels); we extract one hack \emph{direction}, representation-engineering
+  \item We adapt the post-backward parameter-gradient routing of
+        \citet{sgtm2025localization} (reserve a forget subspace, delete it at
+        deploy) from supervised unlearning to reward hacking in RL post-training.
+        We keep the localize-then-ablate framing of gradient routing
+        \citep{cloud2024gradientrouting} but route post-backward on parameter
+        gradients rather than via Cloud's forward \texttt{.detach()} on activations.
+  \item We replace the routing signal itself. \citet{sgtm2025localization} and
+        gradient routing tag the training \emph{data} (per-example / per-token,
+        $O(\text{dataset})$ labels); we extract one hack \emph{direction},
+        representation-engineering
        style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
        $\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
  \item We extend the Ariahw LeetCode reward-hacking RL environment
@@ -165,6 +165,19 @@ README ``How it works'' + blog intro.}
 \section{Method}
 \label{sec:method}

+We call the method \textbf{vGROUT} (vector gradient routing). It has two phases.
+(1) \emph{Make} a reward-hacking direction $v_{\text{hack}}$ from a few contrastive
+(hack, clean) pairs (Section~\ref{sec:extract}). (2) During GRPO, use $v_{\text{hack}}$
+to separate the live gradient: the hack-aligned part is either \emph{erased}
+(subtracted and discarded, leaving the orthogonal complement) or \emph{routed}
+(the whole gated rollout's gradient is sent, not discarded, into a separate adapter
+$\delta_{S,\text{hack}}$ that is deleted at deploy). The split acts on the
+\emph{gradient} during training; the deletion acts on the \emph{weights} at deploy.
+Mechanically vGROUT follows the post-backward, deletable-block routing of
+\citet{sgtm2025localization} (the gradient-routing family of
+\citealp{cloud2024gradientrouting}); it differs from both in that the routing is
+gated by an extracted direction, not a per-example data label.
+
 \subsection{The SVD-basis adapter}
 % PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
 % Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
@@ -186,6 +199,7 @@ and the projection all live in this same low-rank, weight-aligned space
 \end{itemize}

 \subsection{Extracting the hack direction}
+\label{sec:extract}
 \TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs
 (Appendix~\ref{app:pairs}), compute
 the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
@@ -215,8 +229,8 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
        $\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
        update so the kept adapter is free of it, and deleted at deploy.
        Mechanically this is
-        SGTM-style post-backward parameter-gradient masking
-        \citep{sgtm2025localization} in the SVD basis (we strip/redirect the
+        post-backward parameter-gradient routing \citep{sgtm2025localization}
+        in the SVD basis (we strip/redirect the
        $\delta_S$ gradient after the backward pass, not via a forward
        \texttt{.detach()}), routed into a deletable subspace rather than zeroed,
        the localize-then-ablate idea of gradient routing
@@ -594,12 +608,13 @@ one-liners are in docs/grad\_routing/related\_work.md.}
        We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
        Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
        we operate post-backward on parameter gradients (next bullet).
-  \item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
+  \item Capability-localization routing (\citealp{sgtm2025localization}; the
+        \texttt{selective-gradient-masking} line): our
        closest mechanistic analogue, a post-backward parameter-gradient mask
        over a forget subspace, tolerant to label noise, where the forgotten
        capability leaking back into the kept weights shrinks as model size
        grows (supports our scalability argument). We differ in the
-        mask \emph{source}: SGTM tags training data per example over fixed
+        mask \emph{source}: they tag training data per example over fixed
        reserved dims; we extract one hack direction from a few contrastive pairs
        and route by cosine. Their TPR/FPR detector quality maps to our
        weak-detector (held-out) axis.