mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:59:35 +08:00
paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation' - contributions: (1) adapt SGTM parameter-gradient masking from supervised unlearning to RL reward hacking, route+ablate framing from gradient routing but NOT Cloud's activation .detach(); (2) replace the data-label mask with a RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled). - method 'Arms': call route SGTM-style post-backward parameter masking in SVD basis, routed into a deletable subspace. - related work: Cloud = localize-then-ablate idea only; SGTM = closest mechanistic relative, their TPR/FPR knob = our weak-detector axis. - title comment flags the OPEN synthetic-pairs question (headline v_hack is hand-authored prog_wide, not AI-prompted). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+43
-46
@@ -20,28 +20,19 @@
|
||||
% TODO-marker: renders red in the PDF and is grep-able by `just paper-qc`.
|
||||
\newcommand{\TODO}[1]{{\color{red}\textbf{[TODO: #1]}}}
|
||||
|
||||
% Title states the mechanism (route by a direction). Naming decisions (AFK
|
||||
% 2026-06-03):
|
||||
% - DROP "route2" in prose -- it's an internal version tag; readers don't care
|
||||
% about the 2. The arm is "route"; "directional/few-shot routing" if a fuller
|
||||
% name is wanted.
|
||||
% - Lineage we CAN claim: extraction is representation engineering (contrastive
|
||||
% pairs -> difference direction, Zou et al.) but read in GRADIENT/SVD space;
|
||||
% application is gradient routing (Cloud et al.) with that direction as the
|
||||
% mask instead of per-token data labels. We sit at the RepE x grad-routing
|
||||
% intersection. Fine to say "representation engineering in gradient space";
|
||||
% do NOT imply we steer activations (we route gradients).
|
||||
% - "label-free"/"without labels": globally FALSE -- our ~10-21 pairs + the
|
||||
% known-mode teacher anchors + the live E detector are all labels on KNOWN
|
||||
% hacks. The BACKABLE, scoped claim is "suppresses the HELD-OUT hacks with
|
||||
% zero labels of their own", valid iff the held-out modes never trip the live
|
||||
% E detector (E = run_tests loophole; side-channel modes sentinel/stdout/file
|
||||
% are graded elsewhere, so pick those as held-out). Earn it with A5 (job 95
|
||||
% harvest -> scripts/pairs_from_rollouts.py 2 known modes -> held-out route
|
||||
% run -> held-out deploy hack ~0) AND a hacked_E-by-mode==0 check on held-out.
|
||||
% If A5 lands clean, upgrade title to:
|
||||
% "Routing Reward Hacking out of RL Gradients -- Suppressing Hacks We Never Labeled"
|
||||
\title{Routing Reward Hacking out of RL Gradients with a Steering Vector}
|
||||
% Title (user-chosen, AFK 2026-06-03): "quarantine" = the deletable delta_S_hack
|
||||
% knob; "routing gradients" = post-backward parameter-gradient routing (SGTM
|
||||
% family, not Cloud's activation detach); "hacking representation" = the
|
||||
% RepE-extracted hack direction. Tighter variant if wanted: "...with a
|
||||
% Reward-Hack Representation". Do NOT title it "label-free" -- our pairs ARE
|
||||
% labels; the scoped backable claim ("held-out hacks suppressed with zero labels
|
||||
% of their own") needs A5 + a hacked_E==0-on-held-out check first.
|
||||
% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic
|
||||
% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs
|
||||
% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored
|
||||
% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract
|
||||
% the headline from synthetic pairs or describe the hand-authored set honestly.
|
||||
\title{Quarantining Reward-Hacking Gradients with a Hacking Representation}
|
||||
|
||||
% Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
|
||||
\author{Anonymous Author(s)\\ Affiliation\\ \texttt{email}}
|
||||
@@ -64,22 +55,19 @@ today act on reward/advantage \citep{wu2026rebound} and need a detector at
|
||||
scoring time; (3) at deploy some hacks are unknown; (4) here we route the GRPO
|
||||
gradient away from a weak-detector hack direction.}
|
||||
|
||||
\paragraph{Contributions.} % author-dictated; factual claims, keep verbatim.
|
||||
% FLAG (AFK 2026-06-03, do not silently rewrite -- author's call): contribution 1
|
||||
% says "gradient routing (Cloud)", but our MECHANISM is SGTM-style post-backward
|
||||
% PARAMETER-gradient masking (antipasto.py:73 sums the quarantine, no forward
|
||||
% detach; train.py parks the flagged grad into delta_S_hack.grad). Cloud's
|
||||
% gradient routing is activation-level .detach(). So the closer relative is
|
||||
% sgtm2025localization (also: their TPR/FPR detector knob == our no-cheat weak
|
||||
% detector). Suggest reframing to "extend SGTM-style parameter-gradient masking
|
||||
% (and the localize-then-ablate idea of gradient routing) with a learned hack
|
||||
% direction". See docs/papers/grad_routing/sgtm_vs_ours.md.
|
||||
\paragraph{Contributions.} % author-dictated; factual claims.
|
||||
\begin{enumerate}
|
||||
\item We extend gradient routing \citep{cloud2024gradientrouting} to reward
|
||||
hacking in RL post-training.
|
||||
\item We show a weak hack direction extracted in \emph{gradient space} can
|
||||
replace the weak per-token data labels gradient routing normally
|
||||
requires as its routing mask.
|
||||
\item We adapt selective gradient masking (SGTM \citep{sgtm2025localization})
|
||||
-- post-backward masking of a forget subspace, deleted at deploy -- from
|
||||
supervised unlearning to reward hacking in RL post-training. (We route
|
||||
\emph{gradients}, not activations: the localize-then-ablate framing of
|
||||
gradient routing \citep{cloud2024gradientrouting}, but mechanically the
|
||||
SGTM parameter-masking family, not Cloud's forward \texttt{.detach()}.)
|
||||
\item We replace the routing mask itself. SGTM and gradient routing tag the
|
||||
training \emph{data} (per-example / per-token, $O(\text{dataset})$
|
||||
labels); we extract one hack \emph{direction}, representation-engineering
|
||||
style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
|
||||
$\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
|
||||
\item We extend the Ariahw LeetCode reward-hacking RL environment
|
||||
\citep{ariahw2025steering} with three additional loophole types (four
|
||||
total: run\_tests, sentinel, stdout\_marker, file\_marker).
|
||||
@@ -132,8 +120,13 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
|
||||
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
|
||||
calibrated each step from the hack-vs-clean cosine gap) sends the whole
|
||||
rollout gradient into a scale-matched, distinct-basis quarantine knob
|
||||
$\delta_{S,\text{hack}}$, deleted at deploy. Gradient routing
|
||||
\citep{cloud2024gradientrouting} in the SVD basis.
|
||||
$\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is
|
||||
SGTM-style post-backward parameter-gradient masking
|
||||
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
|
||||
$\delta_S$ gradient after the backward pass, not via a forward
|
||||
\texttt{.detach()}), routed into a deletable subspace rather than zeroed
|
||||
-- the localize-then-ablate idea of gradient routing
|
||||
\citep{cloud2024gradientrouting}.
|
||||
\item \emph{offline (frozen)} vs.\ \emph{online (refresh-$N$)}: re-extract
|
||||
$v_{\text{hack}}$ every $N$ steps on the current adapter, since the
|
||||
basis goes stale as training moves the model (Appendix~\ref{app:refresh}).
|
||||
@@ -314,13 +307,17 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
extracted \emph{hack} subspace. Their clean subspace is fixed and only
|
||||
delays drift; this is the baseline to differentiate from.
|
||||
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
|
||||
We inherit the route+ablate machinery but in the SVD-of-$W$ basis, with
|
||||
the mask sourced from an extracted hack subspace rather than a per-token
|
||||
data label.
|
||||
\item Capability-localization routing \citep{sgtm2025localization}: a
|
||||
parameter-gradient zero-mask tolerant to label noise, with measured
|
||||
leakage that shrinks with scale. Supports our scalability argument; we
|
||||
differ in mask source (persona direction) and setting (RL hacking).
|
||||
We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
|
||||
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
|
||||
we operate post-backward on parameter gradients (next bullet).
|
||||
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
|
||||
closest mechanistic relative -- a post-backward parameter-gradient mask
|
||||
over a forget subspace, tolerant to label noise, with leakage that
|
||||
shrinks with scale (supports our scalability argument). We differ in the
|
||||
mask \emph{source}: SGTM tags training data per example over fixed
|
||||
reserved dims; we extract one hack direction from a few contrastive pairs
|
||||
and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
|
||||
weak-detector axis.
|
||||
\item Advantage-level intervention \citep{wu2026rebound}: representation-
|
||||
informed advantage modulation; ours is gradient-level (one step deeper,
|
||||
after the reward is computed). A matched-compute head-to-head is future
|
||||
|
||||
Reference in New Issue
Block a user