From 97a4c5d7b1e67d506316bb8808714e2fb0df1ec3 Mon Sep 17 00:00:00 2001 From: wassname Date: Wed, 3 Jun 2026 01:19:35 +0000 Subject: [PATCH] paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title - title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation' - contributions: (1) adapt SGTM parameter-gradient masking from supervised unlearning to RL reward hacking, route+ablate framing from gradient routing but NOT Cloud's activation .detach(); (2) replace the data-label mask with a RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled). - method 'Arms': call route SGTM-style post-backward parameter masking in SVD basis, routed into a deletable subspace. - related work: Cloud = localize-then-ablate idea only; SGTM = closest mechanistic relative, their TPR/FPR knob = our weak-detector axis. - title comment flags the OPEN synthetic-pairs question (headline v_hack is hand-authored prog_wide, not AI-prompted). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- docs/writeup/main.tex | 89 +++++++++++++++++++++---------------------- 1 file changed, 43 insertions(+), 46 deletions(-) diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index a72a975..60686c7 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -20,28 +20,19 @@ % TODO-marker: renders red in the PDF and is grep-able by `just paper-qc`. \newcommand{\TODO}[1]{{\color{red}\textbf{[TODO: #1]}}} -% Title states the mechanism (route by a direction). Naming decisions (AFK -% 2026-06-03): -% - DROP "route2" in prose -- it's an internal version tag; readers don't care -% about the 2. The arm is "route"; "directional/few-shot routing" if a fuller -% name is wanted. -% - Lineage we CAN claim: extraction is representation engineering (contrastive -% pairs -> difference direction, Zou et al.) but read in GRADIENT/SVD space; -% application is gradient routing (Cloud et al.) with that direction as the -% mask instead of per-token data labels. We sit at the RepE x grad-routing -% intersection. Fine to say "representation engineering in gradient space"; -% do NOT imply we steer activations (we route gradients). -% - "label-free"/"without labels": globally FALSE -- our ~10-21 pairs + the -% known-mode teacher anchors + the live E detector are all labels on KNOWN -% hacks. The BACKABLE, scoped claim is "suppresses the HELD-OUT hacks with -% zero labels of their own", valid iff the held-out modes never trip the live -% E detector (E = run_tests loophole; side-channel modes sentinel/stdout/file -% are graded elsewhere, so pick those as held-out). Earn it with A5 (job 95 -% harvest -> scripts/pairs_from_rollouts.py 2 known modes -> held-out route -% run -> held-out deploy hack ~0) AND a hacked_E-by-mode==0 check on held-out. -% If A5 lands clean, upgrade title to: -% "Routing Reward Hacking out of RL Gradients -- Suppressing Hacks We Never Labeled" -\title{Routing Reward Hacking out of RL Gradients with a Steering Vector} +% Title (user-chosen, AFK 2026-06-03): "quarantine" = the deletable delta_S_hack +% knob; "routing gradients" = post-backward parameter-gradient routing (SGTM +% family, not Cloud's activation detach); "hacking representation" = the +% RepE-extracted hack direction. Tighter variant if wanted: "...with a +% Reward-Hack Representation". Do NOT title it "label-free" -- our pairs ARE +% labels; the scoped backable claim ("held-out hacks suppressed with zero labels +% of their own") needs A5 + a hacked_E==0-on-held-out check first. +% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic +% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs +% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored +% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract +% the headline from synthetic pairs or describe the hand-authored set honestly. +\title{Quarantining Reward-Hacking Gradients with a Hacking Representation} % Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready. \author{Anonymous Author(s)\\ Affiliation\\ \texttt{email}} @@ -64,22 +55,19 @@ today act on reward/advantage \citep{wu2026rebound} and need a detector at scoring time; (3) at deploy some hacks are unknown; (4) here we route the GRPO gradient away from a weak-detector hack direction.} -\paragraph{Contributions.} % author-dictated; factual claims, keep verbatim. -% FLAG (AFK 2026-06-03, do not silently rewrite -- author's call): contribution 1 -% says "gradient routing (Cloud)", but our MECHANISM is SGTM-style post-backward -% PARAMETER-gradient masking (antipasto.py:73 sums the quarantine, no forward -% detach; train.py parks the flagged grad into delta_S_hack.grad). Cloud's -% gradient routing is activation-level .detach(). So the closer relative is -% sgtm2025localization (also: their TPR/FPR detector knob == our no-cheat weak -% detector). Suggest reframing to "extend SGTM-style parameter-gradient masking -% (and the localize-then-ablate idea of gradient routing) with a learned hack -% direction". See docs/papers/grad_routing/sgtm_vs_ours.md. +\paragraph{Contributions.} % author-dictated; factual claims. \begin{enumerate} - \item We extend gradient routing \citep{cloud2024gradientrouting} to reward - hacking in RL post-training. - \item We show a weak hack direction extracted in \emph{gradient space} can - replace the weak per-token data labels gradient routing normally - requires as its routing mask. + \item We adapt selective gradient masking (SGTM \citep{sgtm2025localization}) + -- post-backward masking of a forget subspace, deleted at deploy -- from + supervised unlearning to reward hacking in RL post-training. (We route + \emph{gradients}, not activations: the localize-then-ablate framing of + gradient routing \citep{cloud2024gradientrouting}, but mechanically the + SGTM parameter-masking family, not Cloud's forward \texttt{.detach()}.) + \item We replace the routing mask itself. SGTM and gradient routing tag the + training \emph{data} (per-example / per-token, $O(\text{dataset})$ + labels); we extract one hack \emph{direction}, representation-engineering + style, from $\sim$10--21 contrastive (hack, clean) pairs and route by + $\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels. \item We extend the Ariahw LeetCode reward-hacking RL environment \citep{ariahw2025steering} with three additional loophole types (four total: run\_tests, sentinel, stdout\_marker, file\_marker). @@ -132,8 +120,13 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:} \item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$ calibrated each step from the hack-vs-clean cosine gap) sends the whole rollout gradient into a scale-matched, distinct-basis quarantine knob - $\delta_{S,\text{hack}}$, deleted at deploy. Gradient routing - \citep{cloud2024gradientrouting} in the SVD basis. + $\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is + SGTM-style post-backward parameter-gradient masking + \citep{sgtm2025localization} in the SVD basis (we strip/redirect the + $\delta_S$ gradient after the backward pass, not via a forward + \texttt{.detach()}), routed into a deletable subspace rather than zeroed + -- the localize-then-ablate idea of gradient routing + \citep{cloud2024gradientrouting}. \item \emph{offline (frozen)} vs.\ \emph{online (refresh-$N$)}: re-extract $v_{\text{hack}}$ every $N$ steps on the current adapter, since the basis goes stale as training moves the model (Appendix~\ref{app:refresh}). @@ -314,13 +307,17 @@ one-liners are in docs/grad\_routing/related\_work.md.} extracted \emph{hack} subspace. Their clean subspace is fixed and only delays drift; this is the baseline to differentiate from. \item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate. - We inherit the route+ablate machinery but in the SVD-of-$W$ basis, with - the mask sourced from an extracted hack subspace rather than a per-token - data label. - \item Capability-localization routing \citep{sgtm2025localization}: a - parameter-gradient zero-mask tolerant to label noise, with measured - leakage that shrinks with scale. Supports our scalability argument; we - differ in mask source (persona direction) and setting (RL hacking). + We inherit the localize-then-ablate \emph{idea}, but not the mechanism: + Cloud routes by a forward \texttt{.detach()} on labelled activation dims; + we operate post-backward on parameter gradients (next bullet). + \item Capability-localization routing (SGTM \citep{sgtm2025localization}): our + closest mechanistic relative -- a post-backward parameter-gradient mask + over a forget subspace, tolerant to label noise, with leakage that + shrinks with scale (supports our scalability argument). We differ in the + mask \emph{source}: SGTM tags training data per example over fixed + reserved dims; we extract one hack direction from a few contrastive pairs + and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat + weak-detector axis. \item Advantage-level intervention \citep{wu2026rebound}: representation- informed advantage modulation; ours is gradient-level (one step deeper, after the reward is computed). A matched-compute head-to-head is future