paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title

- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation' - contributions: (1) adapt SGTM parameter-gradient masking from supervised unlearning to RL reward hacking, route+ablate framing from gradient routing but NOT Cloud's activation .detach(); (2) replace the data-label mask with a RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled). - method 'Arms': call route SGTM-style post-backward parameter masking in SVD basis, routed into a deletable subspace. - related work: Cloud = localize-then-ablate idea only; SGTM = closest mechanistic relative, their TPR/FPR knob = our weak-detector axis. - title comment flags the OPEN synthetic-pairs question (headline v_hack is hand-authored prog_wide, not AI-prompted). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 18:59:35 +08:00 · 2026-06-03 01:19:35 +00:00
parent 05731cc0e4
commit 97a4c5d7b1
1 changed files with 43 additions and 46 deletions
@@ -20,28 +20,19 @@
 % TODO-marker: renders red in the PDF and is grep-able by `just paper-qc`.
 \newcommand{\TODO}[1]{{\color{red}\textbf{[TODO: #1]}}}

-% Title states the mechanism (route by a direction). Naming decisions (AFK
-% 2026-06-03):
-%  - DROP "route2" in prose -- it's an internal version tag; readers don't care
-%    about the 2. The arm is "route"; "directional/few-shot routing" if a fuller
-%    name is wanted.
-%  - Lineage we CAN claim: extraction is representation engineering (contrastive
-%    pairs -> difference direction, Zou et al.) but read in GRADIENT/SVD space;
-%    application is gradient routing (Cloud et al.) with that direction as the
-%    mask instead of per-token data labels. We sit at the RepE x grad-routing
-%    intersection. Fine to say "representation engineering in gradient space";
-%    do NOT imply we steer activations (we route gradients).
-%  - "label-free"/"without labels": globally FALSE -- our ~10-21 pairs + the
-%    known-mode teacher anchors + the live E detector are all labels on KNOWN
-%    hacks. The BACKABLE, scoped claim is "suppresses the HELD-OUT hacks with
-%    zero labels of their own", valid iff the held-out modes never trip the live
-%    E detector (E = run_tests loophole; side-channel modes sentinel/stdout/file
-%    are graded elsewhere, so pick those as held-out). Earn it with A5 (job 95
-%    harvest -> scripts/pairs_from_rollouts.py 2 known modes -> held-out route
-%    run -> held-out deploy hack ~0) AND a hacked_E-by-mode==0 check on held-out.
-%    If A5 lands clean, upgrade title to:
-%      "Routing Reward Hacking out of RL Gradients -- Suppressing Hacks We Never Labeled"
-\title{Routing Reward Hacking out of RL Gradients with a Steering Vector}
+% Title (user-chosen, AFK 2026-06-03): "quarantine" = the deletable delta_S_hack
+% knob; "routing gradients" = post-backward parameter-gradient routing (SGTM
+% family, not Cloud's activation detach); "hacking representation" = the
+% RepE-extracted hack direction. Tighter variant if wanted: "...with a
+% Reward-Hack Representation". Do NOT title it "label-free" -- our pairs ARE
+% labels; the scoped backable claim ("held-out hacks suppressed with zero labels
+% of their own") needs A5 + a hacked_E==0-on-held-out check first.
+% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic
+% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs
+% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored
+% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract
+% the headline from synthetic pairs or describe the hand-authored set honestly.
+\title{Quarantining Reward-Hacking Gradients with a Hacking Representation}

 % Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
 \author{Anonymous Author(s)\\ Affiliation\\ \texttt{email}}
@@ -64,22 +55,19 @@ today act on reward/advantage \citep{wu2026rebound} and need a detector at
 scoring time; (3) at deploy some hacks are unknown; (4) here we route the GRPO
 gradient away from a weak-detector hack direction.}

-\paragraph{Contributions.} % author-dictated; factual claims, keep verbatim.
-% FLAG (AFK 2026-06-03, do not silently rewrite -- author's call): contribution 1
-% says "gradient routing (Cloud)", but our MECHANISM is SGTM-style post-backward
-% PARAMETER-gradient masking (antipasto.py:73 sums the quarantine, no forward
-% detach; train.py parks the flagged grad into delta_S_hack.grad). Cloud's
-% gradient routing is activation-level .detach(). So the closer relative is
-% sgtm2025localization (also: their TPR/FPR detector knob == our no-cheat weak
-% detector). Suggest reframing to "extend SGTM-style parameter-gradient masking
-% (and the localize-then-ablate idea of gradient routing) with a learned hack
-% direction". See docs/papers/grad_routing/sgtm_vs_ours.md.
+\paragraph{Contributions.} % author-dictated; factual claims.
 \begin{enumerate}
-  \item We extend gradient routing \citep{cloud2024gradientrouting} to reward
-        hacking in RL post-training.
-  \item We show a weak hack direction extracted in \emph{gradient space} can
-        replace the weak per-token data labels gradient routing normally
-        requires as its routing mask.
+  \item We adapt selective gradient masking (SGTM \citep{sgtm2025localization})
+        -- post-backward masking of a forget subspace, deleted at deploy -- from
+        supervised unlearning to reward hacking in RL post-training. (We route
+        \emph{gradients}, not activations: the localize-then-ablate framing of
+        gradient routing \citep{cloud2024gradientrouting}, but mechanically the
+        SGTM parameter-masking family, not Cloud's forward \texttt{.detach()}.)
+  \item We replace the routing mask itself. SGTM and gradient routing tag the
+        training \emph{data} (per-example / per-token, $O(\text{dataset})$
+        labels); we extract one hack \emph{direction}, representation-engineering
+        style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
+        $\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
  \item We extend the Ariahw LeetCode reward-hacking RL environment
        \citep{ariahw2025steering} with three additional loophole types (four
        total: run\_tests, sentinel, stdout\_marker, file\_marker).
@@ -132,8 +120,13 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
  \item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
        calibrated each step from the hack-vs-clean cosine gap) sends the whole
        rollout gradient into a scale-matched, distinct-basis quarantine knob
-        $\delta_{S,\text{hack}}$, deleted at deploy. Gradient routing
-        \citep{cloud2024gradientrouting} in the SVD basis.
+        $\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is
+        SGTM-style post-backward parameter-gradient masking
+        \citep{sgtm2025localization} in the SVD basis (we strip/redirect the
+        $\delta_S$ gradient after the backward pass, not via a forward
+        \texttt{.detach()}), routed into a deletable subspace rather than zeroed
+        -- the localize-then-ablate idea of gradient routing
+        \citep{cloud2024gradientrouting}.
  \item \emph{offline (frozen)} vs.\ \emph{online (refresh-$N$)}: re-extract
        $v_{\text{hack}}$ every $N$ steps on the current adapter, since the
        basis goes stale as training moves the model (Appendix~\ref{app:refresh}).
@@ -314,13 +307,17 @@ one-liners are in docs/grad\_routing/related\_work.md.}
        extracted \emph{hack} subspace. Their clean subspace is fixed and only
        delays drift; this is the baseline to differentiate from.
  \item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
-        We inherit the route+ablate machinery but in the SVD-of-$W$ basis, with
-        the mask sourced from an extracted hack subspace rather than a per-token
-        data label.
-  \item Capability-localization routing \citep{sgtm2025localization}: a
-        parameter-gradient zero-mask tolerant to label noise, with measured
-        leakage that shrinks with scale. Supports our scalability argument; we
-        differ in mask source (persona direction) and setting (RL hacking).
+        We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
+        Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
+        we operate post-backward on parameter gradients (next bullet).
+  \item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
+        closest mechanistic relative -- a post-backward parameter-gradient mask
+        over a forget subspace, tolerant to label noise, with leakage that
+        shrinks with scale (supports our scalability argument). We differ in the
+        mask \emph{source}: SGTM tags training data per example over fixed
+        reserved dims; we extract one hack direction from a few contrastive pairs
+        and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
+        weak-detector axis.
  \item Advantage-level intervention \citep{wu2026rebound}: representation-
        informed advantage modulation; ours is gradient-level (one step deeper,
        after the reward is computed). A matched-compute head-to-head is future