paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title

- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 01:19:35 +00:00
parent 05731cc0e4
commit 97a4c5d7b1
+43 -46
View File
@@ -20,28 +20,19 @@
% TODO-marker: renders red in the PDF and is grep-able by `just paper-qc`.
\newcommand{\TODO}[1]{{\color{red}\textbf{[TODO: #1]}}}
% Title states the mechanism (route by a direction). Naming decisions (AFK
% 2026-06-03):
% - DROP "route2" in prose -- it's an internal version tag; readers don't care
% about the 2. The arm is "route"; "directional/few-shot routing" if a fuller
% name is wanted.
% - Lineage we CAN claim: extraction is representation engineering (contrastive
% pairs -> difference direction, Zou et al.) but read in GRADIENT/SVD space;
% application is gradient routing (Cloud et al.) with that direction as the
% mask instead of per-token data labels. We sit at the RepE x grad-routing
% intersection. Fine to say "representation engineering in gradient space";
% do NOT imply we steer activations (we route gradients).
% - "label-free"/"without labels": globally FALSE -- our ~10-21 pairs + the
% known-mode teacher anchors + the live E detector are all labels on KNOWN
% hacks. The BACKABLE, scoped claim is "suppresses the HELD-OUT hacks with
% zero labels of their own", valid iff the held-out modes never trip the live
% E detector (E = run_tests loophole; side-channel modes sentinel/stdout/file
% are graded elsewhere, so pick those as held-out). Earn it with A5 (job 95
% harvest -> scripts/pairs_from_rollouts.py 2 known modes -> held-out route
% run -> held-out deploy hack ~0) AND a hacked_E-by-mode==0 check on held-out.
% If A5 lands clean, upgrade title to:
% "Routing Reward Hacking out of RL Gradients -- Suppressing Hacks We Never Labeled"
\title{Routing Reward Hacking out of RL Gradients with a Steering Vector}
% Title (user-chosen, AFK 2026-06-03): "quarantine" = the deletable delta_S_hack
% knob; "routing gradients" = post-backward parameter-gradient routing (SGTM
% family, not Cloud's activation detach); "hacking representation" = the
% RepE-extracted hack direction. Tighter variant if wanted: "...with a
% Reward-Hack Representation". Do NOT title it "label-free" -- our pairs ARE
% labels; the scoped backable claim ("held-out hacks suppressed with zero labels
% of their own") needs A5 + a hacked_E==0-on-held-out check first.
% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic
% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs
% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored
% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract
% the headline from synthetic pairs or describe the hand-authored set honestly.
\title{Quarantining Reward-Hacking Gradients with a Hacking Representation}
% Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
\author{Anonymous Author(s)\\ Affiliation\\ \texttt{email}}
@@ -64,22 +55,19 @@ today act on reward/advantage \citep{wu2026rebound} and need a detector at
scoring time; (3) at deploy some hacks are unknown; (4) here we route the GRPO
gradient away from a weak-detector hack direction.}
\paragraph{Contributions.} % author-dictated; factual claims, keep verbatim.
% FLAG (AFK 2026-06-03, do not silently rewrite -- author's call): contribution 1
% says "gradient routing (Cloud)", but our MECHANISM is SGTM-style post-backward
% PARAMETER-gradient masking (antipasto.py:73 sums the quarantine, no forward
% detach; train.py parks the flagged grad into delta_S_hack.grad). Cloud's
% gradient routing is activation-level .detach(). So the closer relative is
% sgtm2025localization (also: their TPR/FPR detector knob == our no-cheat weak
% detector). Suggest reframing to "extend SGTM-style parameter-gradient masking
% (and the localize-then-ablate idea of gradient routing) with a learned hack
% direction". See docs/papers/grad_routing/sgtm_vs_ours.md.
\paragraph{Contributions.} % author-dictated; factual claims.
\begin{enumerate}
\item We extend gradient routing \citep{cloud2024gradientrouting} to reward
hacking in RL post-training.
\item We show a weak hack direction extracted in \emph{gradient space} can
replace the weak per-token data labels gradient routing normally
requires as its routing mask.
\item We adapt selective gradient masking (SGTM \citep{sgtm2025localization})
-- post-backward masking of a forget subspace, deleted at deploy -- from
supervised unlearning to reward hacking in RL post-training. (We route
\emph{gradients}, not activations: the localize-then-ablate framing of
gradient routing \citep{cloud2024gradientrouting}, but mechanically the
SGTM parameter-masking family, not Cloud's forward \texttt{.detach()}.)
\item We replace the routing mask itself. SGTM and gradient routing tag the
training \emph{data} (per-example / per-token, $O(\text{dataset})$
labels); we extract one hack \emph{direction}, representation-engineering
style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
$\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
\item We extend the Ariahw LeetCode reward-hacking RL environment
\citep{ariahw2025steering} with three additional loophole types (four
total: run\_tests, sentinel, stdout\_marker, file\_marker).
@@ -132,8 +120,13 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
calibrated each step from the hack-vs-clean cosine gap) sends the whole
rollout gradient into a scale-matched, distinct-basis quarantine knob
$\delta_{S,\text{hack}}$, deleted at deploy. Gradient routing
\citep{cloud2024gradientrouting} in the SVD basis.
$\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is
SGTM-style post-backward parameter-gradient masking
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
$\delta_S$ gradient after the backward pass, not via a forward
\texttt{.detach()}), routed into a deletable subspace rather than zeroed
-- the localize-then-ablate idea of gradient routing
\citep{cloud2024gradientrouting}.
\item \emph{offline (frozen)} vs.\ \emph{online (refresh-$N$)}: re-extract
$v_{\text{hack}}$ every $N$ steps on the current adapter, since the
basis goes stale as training moves the model (Appendix~\ref{app:refresh}).
@@ -314,13 +307,17 @@ one-liners are in docs/grad\_routing/related\_work.md.}
extracted \emph{hack} subspace. Their clean subspace is fixed and only
delays drift; this is the baseline to differentiate from.
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
We inherit the route+ablate machinery but in the SVD-of-$W$ basis, with
the mask sourced from an extracted hack subspace rather than a per-token
data label.
\item Capability-localization routing \citep{sgtm2025localization}: a
parameter-gradient zero-mask tolerant to label noise, with measured
leakage that shrinks with scale. Supports our scalability argument; we
differ in mask source (persona direction) and setting (RL hacking).
We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
we operate post-backward on parameter gradients (next bullet).
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
closest mechanistic relative -- a post-backward parameter-gradient mask
over a forget subspace, tolerant to label noise, with leakage that
shrinks with scale (supports our scalability argument). We differ in the
mask \emph{source}: SGTM tags training data per example over fixed
reserved dims; we extract one hack direction from a few contrastive pairs
and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
weak-detector axis.
\item Advantage-level intervention \citep{wu2026rebound}: representation-
informed advantage modulation; ours is gradient-level (one step deeper,
after the reward is computed). A matched-compute head-to-head is future