name the method vGROUT (vector gradient routing)

- title: drop the "Quarantine ... Representation?" metaphor for
  "vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
  component / route=redirect the gated gradient into a deletable adapter,
  deleted at deploy). Honest framing: route preserves (not discards); follows
  Shilov et al.'s post-backward deletable-block routing in the gradient-routing
  family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
  author-year. README + pyproject describe vGROUT (package name unchanged)
This commit is contained in:
wassname
2026-06-05 14:45:11 +08:00
parent 07e1eb8753
commit 03693e4f30
3 changed files with 37 additions and 20 deletions
+30 -15
View File
@@ -67,7 +67,7 @@
% "hand-authored" in make_dataset_pairsets.py means hand-authored by the model.
% We do not argue the point in prose; we just SHOW the pairs (the actual hack/clean
% completions that build v_hack) in an appendix and let the reader judge.
\title{Can We Quarantine Reward Hacking with a Reward-Hacking Representation?}
\title{vGROUT: Vector Gradient Routing against Reward Hacking}
% Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
\author{Anonymous Author(s)\\ Affiliation\\ \texttt{email}}
@@ -145,16 +145,16 @@ README ``How it works'' + blog intro.}
\paragraph{Contributions.} % author-dictated; factual claims.
\begin{enumerate}
\item We adapt selective gradient masking (SGTM \citep{sgtm2025localization}),
post-backward masking of a forget subspace deleted at deploy, from
supervised unlearning to reward hacking in RL post-training. We keep the
localize-then-ablate framing of gradient routing
\citep{cloud2024gradientrouting} but route gradients post-backward, the
SGTM parameter-masking family rather than Cloud's forward
\texttt{.detach()} on activations.
\item We replace the routing mask itself. SGTM and gradient routing tag the
training \emph{data} (per-example / per-token, $O(\text{dataset})$
labels); we extract one hack \emph{direction}, representation-engineering
\item We adapt the post-backward parameter-gradient routing of
\citet{sgtm2025localization} (reserve a forget subspace, delete it at
deploy) from supervised unlearning to reward hacking in RL post-training.
We keep the localize-then-ablate framing of gradient routing
\citep{cloud2024gradientrouting} but route post-backward on parameter
gradients rather than via Cloud's forward \texttt{.detach()} on activations.
\item We replace the routing signal itself. \citet{sgtm2025localization} and
gradient routing tag the training \emph{data} (per-example / per-token,
$O(\text{dataset})$ labels); we extract one hack \emph{direction},
representation-engineering
style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
$\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
\item We extend the Ariahw LeetCode reward-hacking RL environment
@@ -165,6 +165,19 @@ README ``How it works'' + blog intro.}
\section{Method}
\label{sec:method}
We call the method \textbf{vGROUT} (vector gradient routing). It has two phases.
(1) \emph{Make} a reward-hacking direction $v_{\text{hack}}$ from a few contrastive
(hack, clean) pairs (Section~\ref{sec:extract}). (2) During GRPO, use $v_{\text{hack}}$
to separate the live gradient: the hack-aligned part is either \emph{erased}
(subtracted and discarded, leaving the orthogonal complement) or \emph{routed}
(the whole gated rollout's gradient is sent, not discarded, into a separate adapter
$\delta_{S,\text{hack}}$ that is deleted at deploy). The split acts on the
\emph{gradient} during training; the deletion acts on the \emph{weights} at deploy.
Mechanically vGROUT follows the post-backward, deletable-block routing of
\citet{sgtm2025localization} (the gradient-routing family of
\citealp{cloud2024gradientrouting}); it differs from both in that the routing is
gated by an extracted direction, not a per-example data label.
\subsection{The SVD-basis adapter}
% PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
% Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
@@ -186,6 +199,7 @@ and the projection all live in this same low-rank, weight-aligned space
\end{itemize}
\subsection{Extracting the hack direction}
\label{sec:extract}
\TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs
(Appendix~\ref{app:pairs}), compute
the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
@@ -215,8 +229,8 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
update so the kept adapter is free of it, and deleted at deploy.
Mechanically this is
SGTM-style post-backward parameter-gradient masking
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
post-backward parameter-gradient routing \citep{sgtm2025localization}
in the SVD basis (we strip/redirect the
$\delta_S$ gradient after the backward pass, not via a forward
\texttt{.detach()}), routed into a deletable subspace rather than zeroed,
the localize-then-ablate idea of gradient routing
@@ -594,12 +608,13 @@ one-liners are in docs/grad\_routing/related\_work.md.}
We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
we operate post-backward on parameter gradients (next bullet).
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
\item Capability-localization routing (\citealp{sgtm2025localization}; the
\texttt{selective-gradient-masking} line): our
closest mechanistic analogue, a post-backward parameter-gradient mask
over a forget subspace, tolerant to label noise, where the forgotten
capability leaking back into the kept weights shrinks as model size
grows (supports our scalability argument). We differ in the
mask \emph{source}: SGTM tags training data per example over fixed
mask \emph{source}: they tag training data per example over fixed
reserved dims; we extract one hack direction from a few contrastive pairs
and route by cosine. Their TPR/FPR detector quality maps to our
weak-detector (held-out) axis.