mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
paper: fix build, vector figs, +2 plots, de-jargon prose
- drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub icon was decorative - switch the two included figures PNG->PDF (vector; now-tracked, smaller) - add fig:generalisation (A5 dumbbell) next to tab:generalisation and fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd - rename leaked config codenames in appendix tables (v_hack_full -> "weak (10 pairs)", null_city -> "random (placebo)") with paper:code mapping comments - de-jargon reader-facing prose per a 3-model external panel (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter, quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward -> hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is. 14 pages, zero unresolved refs.
This commit is contained in:
+148
-75
@@ -23,7 +23,6 @@
|
||||
\usepackage{listings}
|
||||
\usepackage{algorithm}
|
||||
\usepackage{algpseudocode}
|
||||
\usepackage{fontawesome5}
|
||||
\usepackage{hyperref}
|
||||
% hyperref defaults colour citations green / links red (ugly). Make refs+cites
|
||||
% black (print-clean) and URLs a muted blue.
|
||||
@@ -80,7 +79,7 @@
|
||||
% "coming soon" at camera-ready (cf. the AntiPaSTO github line).
|
||||
\begin{center}
|
||||
\small
|
||||
\faGithub~\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
|
||||
\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
|
||||
\end{center}
|
||||
|
||||
\begin{abstract}
|
||||
@@ -103,10 +102,10 @@ the three claims C1/C2/C3.}
|
||||
% --min-steps 60 --out out/figs/dyn_sub4.png
|
||||
\begin{figure}[!ht]
|
||||
\centering
|
||||
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.png}
|
||||
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.pdf}
|
||||
\caption{Deploy hack rate (top) and solve rate (bottom) over 60 GRPO steps for
|
||||
route (ours) and vanilla GRPO, three seeds each; thick line is the mean, thin
|
||||
lines per seed (EMA-5). Deploy evaluation is knob-off at $n{=}64$, $T{=}0.7$.
|
||||
lines per seed (EMA-5). Deploy evaluation is adapter-off at $n{=}64$, $T{=}0.7$.
|
||||
Vanilla acquires the hack, rising to ${\sim}0.4$, while route stays near zero and
|
||||
reaches a higher solve rate (${\sim}0.6$ vs ${\sim}0.4$). The arms differ only in
|
||||
the gradient policy (Section~\ref{sec:method}); the gap persists to 200 steps
|
||||
@@ -124,8 +123,8 @@ the three claims C1/C2/C3.}
|
||||
RL post-training reliably induces reward hacking: the policy learns to exploit
|
||||
flaws in the grader rather than solve the task. Today's interventions act on the
|
||||
reward or advantage \citep{wu2026rebound} and need a detector at scoring time,
|
||||
yet at deployment some hacks are unknown. We test whether intervening one step
|
||||
deeper, on the \emph{gradient} itself, can stop the policy picking up a hack as
|
||||
yet at deployment some hacks are unknown. We test whether intervening at the
|
||||
\emph{gradient} level can stop the policy acquiring a hack as
|
||||
it forms (Figure~\ref{fig:keynote}). Our hypothesis:
|
||||
|
||||
\begin{quote}
|
||||
@@ -171,19 +170,19 @@ README ``How it works'' + blog intro.}
|
||||
% Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
|
||||
% U, Vh frozen and double as the v_hack basis.
|
||||
\TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
|
||||
singular-value coordinates; we freeze $U,V$ and train a per-module knob
|
||||
$\delta_S\in\mathbb{R}^r$ (and a routing knob $\delta_{S,\text{hack}}$) in that
|
||||
singular-value coordinates; we freeze $U,V$ and train a per-module adapter
|
||||
parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
|
||||
basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
|
||||
and the projection all live in this same low-rank, weight-aligned space
|
||||
($r\sim500$--$2560$). Two consequences we use:
|
||||
\begin{itemize}
|
||||
\item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
|
||||
never reconstructed on the main path), so a knob-off forward gives
|
||||
$\pi_{\text{ref}}$ for free, with no second model.
|
||||
never reconstructed on the main path), so an adapter-off forward gives
|
||||
$\pi_{\text{ref}}$ with no second model.
|
||||
\item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
|
||||
hack-ward update parked in the quarantine still moves the training
|
||||
model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates exactly
|
||||
that routed capability.
|
||||
hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
|
||||
training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
|
||||
exactly that routed capability.
|
||||
\end{itemize}
|
||||
|
||||
\subsection{Extracting the hack direction}
|
||||
@@ -193,8 +192,8 @@ the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
|
||||
algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on
|
||||
$\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by
|
||||
majority sign, drop the global bottom-25\% singular values as noise floor.
|
||||
Pseudocode in Appendix~\ref{app:pseudocode}. No-cheat invariant: the pairs may
|
||||
select/calibrate; live routing never reads \texttt{gt\_pass}.}
|
||||
Pseudocode in Appendix~\ref{app:pseudocode}. No-label-leakage invariant: the pairs
|
||||
may select/calibrate; live routing never reads \texttt{gt\_pass}.}
|
||||
% Known issue from docs/pseudocode/02_extract_vhack.py (REVIEW comment): the
|
||||
% GRPO=NLL-diff identity holds only at adv=+/-1, no ratio-clip, no length-norm;
|
||||
% but extraction is length-normalized (.mean() per completion) while the live
|
||||
@@ -212,9 +211,9 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
|
||||
part orthogonal to $v_{\text{hack}}$).
|
||||
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
|
||||
calibrated each step from the hack-vs-clean cosine gap) sends that
|
||||
rollout's whole gradient into a separate quarantine knob
|
||||
rollout's whole gradient into a separate auxiliary adapter
|
||||
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
|
||||
update so the kept knob is fully relieved of it, and deleted at deploy.
|
||||
update so the kept adapter is free of it, and deleted at deploy.
|
||||
Mechanically this is
|
||||
SGTM-style post-backward parameter-gradient masking
|
||||
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
|
||||
@@ -231,7 +230,7 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
|
||||
\TODO{outline: Ariahw LeetCode loophole substrate \citep{ariahw2025steering}, 4
|
||||
modes, even non-overlapping partition (Appendix~\ref{app:traces},
|
||||
6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125;
|
||||
deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
|
||||
deploy-eval = adapter-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
|
||||
|
||||
% ===================================================================
|
||||
% RESULTS -- evidence tables + figures. Numbers are real where present,
|
||||
@@ -243,7 +242,10 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
|
||||
|
||||
Over three seeds at the 60-step preset, route holds deploy hack near zero while
|
||||
vanilla GRPO acquires it, and route also raises the solve rate
|
||||
(Figure~\ref{fig:keynote}, Table~\ref{tab:keynote}). \TODO{prose -- author.}
|
||||
(Figure~\ref{fig:keynote}, Table~\ref{tab:keynote}). The on-policy path still
|
||||
learns to hack; route isolates it into the deletable auxiliary adapter, so removing
|
||||
that adapter at deploy leaves the hack rate $\approx0$ (Figure~\ref{fig:traindeploy}).
|
||||
\TODO{prose -- author.}
|
||||
|
||||
% --- Figure: keynote dynamics -----------------------------------------------
|
||||
% Provenance: out/figs/dyn_sub4_hack_overlay.png. Regenerate from the 6 explicit
|
||||
@@ -279,7 +281,7 @@ vanilla GRPO acquires it, and route also raises the solve rate
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\caption{Deploy hack and solve rate, mean$\pm$SEM, both arms over 3 seeds
|
||||
(41/42/43). 60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off,
|
||||
(41/42/43). 60-step fast preset, Qwen3-4B, mix=0.125; deploy = adapter-off,
|
||||
$n{=}64$, $T{=}0.7$. Paired $t$-test (route$-$vanilla per seed, df$=$2): hack
|
||||
$t{=}8.54$, $p{\approx}0.013$; solve $t{=}2.87$, $p{\approx}0.10$.}
|
||||
\label{tab:keynote}
|
||||
@@ -295,6 +297,20 @@ vanilla GRPO acquires it, and route also raises the solve rate
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
% Mechanism view: the knob holds deploy hack ~0 while the on-policy (adapter-on)
|
||||
% path still learns to hack. Source: scripts/plot_dynamics.py, 60-step preset,
|
||||
% out/figs/train_vs_deploy_60_train_deploy.{pdf,csv}.
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.92\linewidth]{../../out/figs/train_vs_deploy_60_train_deploy.pdf}
|
||||
\caption{Train (adapter on, dashed) vs deploy (adapter off, solid) over training,
|
||||
vanilla vs route (60-step preset). Vanilla encodes the hack in the weights, so
|
||||
train and deploy hack rise together. route isolates it: the adapter-on path
|
||||
still learns to hack ($\sim0.22$), but with the auxiliary adapter removed at deploy
|
||||
the hack rate stays $\approx0$ while solve is preserved.}
|
||||
\label{fig:traindeploy}
|
||||
\end{figure}
|
||||
|
||||
\subsection{C3: directional specificity (controls)}
|
||||
% Precedent at the training-hack metric (Appendix~\ref{app:context}, Q10):
|
||||
% the null_city placebo pairset gave delta hack +0.024 (no effect) and a
|
||||
@@ -302,7 +318,7 @@ vanilla GRPO acquires it, and route also raises the solve rate
|
||||
% mechanism, not a generic direction. The deploy-metric replication is jobs
|
||||
% 80 (placebo) / 81 (random-V) below.
|
||||
The deploy-metric controls below replicate a training-hack precedent: at the
|
||||
fast preset a semantically random (``null\_city'') pairset moved hack by only
|
||||
fast preset a semantically random (placebo) pairset moved hack by only
|
||||
$+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
|
||||
(Appendix~\ref{app:context}, Q10).
|
||||
|
||||
@@ -314,8 +330,8 @@ arm to zero deploy hack at $0.625$ solve, because routing needs the direction on
|
||||
to \emph{discriminate} hack rollouts, not to \emph{span} the hack subspace in weight
|
||||
space; the absorption property of gradient routing
|
||||
\citep{cloud2024gradientrouting, sgtm2025localization} then localises the capability
|
||||
into the discarded knob. A detector too weak to erase a trained hack is still strong
|
||||
enough to route one as it forms.
|
||||
into the deleted auxiliary adapter. A detector too weak to erase a trained hack is
|
||||
still strong enough to route one as it forms.
|
||||
|
||||
% --- Table: ablation --------------------------------------------------------
|
||||
% Provenance (seed 41, 60-step fast preset):
|
||||
@@ -351,9 +367,9 @@ enough to route one as it forms.
|
||||
$\neg$hack-pairs swaps in a semantically random (placebo) pairset. If route's
|
||||
suppression were directional, these controls would return toward the vanilla hack
|
||||
level; instead the placebo also reaches zero deploy hack, so route's gate is
|
||||
direction-agnostic and the suppression is the routed-and-discarded knob absorbing
|
||||
whatever crosses the per-step energy threshold, not $v_{\text{hack}}$ pointing at
|
||||
the hack. Directional specificity is what the erase arm tests (it subtracts
|
||||
direction-agnostic and the suppression is the routed-and-deleted auxiliary adapter
|
||||
absorbing whatever crosses the per-step energy threshold, not $v_{\text{hack}}$
|
||||
aligned with the hack. Directional specificity is what the erase arm tests (it subtracts
|
||||
$\propto\cos(g,v)$); that pair is pending. The post-hoc block (different checkpoint,
|
||||
own baseline $0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time
|
||||
routing.}
|
||||
@@ -386,7 +402,7 @@ enough to route one as it forms.
|
||||
\subsection{Long-run convergence}
|
||||
|
||||
The 60-step gap persists to convergence: at 200 steps route's deploy hack stays
|
||||
pinned at zero while vanilla acquires the hack and rises to ${\sim}0.32$
|
||||
held at zero while vanilla acquires the hack and rises to ${\sim}0.32$
|
||||
(Figure~\ref{fig:longrun}), with route's solve rate ending higher
|
||||
(${\sim}0.61$ vs ${\sim}0.47$). \TODO{prose -- author.}
|
||||
|
||||
@@ -408,7 +424,7 @@ pinned at zero while vanilla acquires the hack and rises to ${\sim}0.32$
|
||||
% coherent: lp_s in [-0.47,-0.29] through step 200), not a property of vanilla GRPO.
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.95\linewidth]{../../out/figs/dyn_longrun_200.png}
|
||||
\includegraphics[width=0.95\linewidth]{../../out/figs/dyn_longrun_200.pdf}
|
||||
\caption{Deploy hack (red) and solve (green) over 200 GRPO steps, seed 41
|
||||
(deploy evaluation $n{=}64$, $T{=}0.7$, EMA-5). route (right) holds deploy hack
|
||||
at $0$ for all 200 steps while solve climbs to ${\sim}0.61$ and plateaus.
|
||||
@@ -421,10 +437,11 @@ pinned at zero while vanilla acquires the hack and rises to ${\sim}0.32$
|
||||
\label{fig:longrun}
|
||||
\end{figure}
|
||||
|
||||
\subsection{C2: generalisation to held-out modes (the no-cheat payload)}
|
||||
\subsection{C2: generalisation to held-out modes (the zero-label test)}
|
||||
|
||||
route suppresses deploy hack on loophole modes the route gate never saw a label
|
||||
for, not only the demonstrated mode (Table~\ref{tab:generalisation}).
|
||||
for, not only the demonstrated mode (Table~\ref{tab:generalisation},
|
||||
Figure~\ref{fig:generalisation}).
|
||||
\TODO{prose -- author. This is now the designed held-out test (jobs 103/104):
|
||||
the teacher demonstrates only \texttt{run\_tests}; the other three modes are
|
||||
suppressed with zero labels. Vanilla baseline column filled (job 103); route2
|
||||
@@ -461,17 +478,17 @@ column pending job 104.}
|
||||
and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
|
||||
absent from both: it is not in $v$, not in the teacher pool, and the live route
|
||||
gate (\texttt{teacher OR hacked\_E}) is near-silent there: the \texttt{run\_tests}
|
||||
detector cross-fires on only $0/1386$ \texttt{sentinel}, $2/1337$
|
||||
detector fires on only $0/1386$ \texttt{sentinel}, $2/1337$
|
||||
\texttt{file\_marker}, and $17/1540$ \texttt{stdout\_marker} student rollouts
|
||||
($\le 1.1\%$, detector false positives, not by-design labels), so $\ge 98.9\%$
|
||||
of held-out hacks reach the optimiser unlabelled. The student still learns the
|
||||
held-out
|
||||
cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
|
||||
\texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
|
||||
three. So the quarantine removes cheats it was never given a label for. Mechanism
|
||||
cheats on-policy (adapter-on train-hack: \texttt{file\_marker} $0.54$,
|
||||
\texttt{stdout\_marker} $0.60$), yet adapter-off deploy hack is near zero on all
|
||||
three. So the auxiliary adapter removes hacks it was never given a label for. Mechanism
|
||||
caveat: the placebo control (Table~\ref{tab:ablation}) shows route's gate is
|
||||
direction-agnostic, so the held-out suppression is the deletable quarantine
|
||||
generalising, not $v(\texttt{run\_tests})$ pointing at the other modes; the
|
||||
direction-agnostic, so the held-out suppression is the deletable auxiliary adapter
|
||||
generalising, not $v(\texttt{run\_tests})$ aligned with the other modes; the
|
||||
zero-label result stands either way. $n{=}1$; $n{=}3$ queued (jobs 107--110).}
|
||||
\label{tab:generalisation}
|
||||
\begin{tabular}{lccc}
|
||||
@@ -486,6 +503,20 @@ column pending job 104.}
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
% Dumbbell view of Table~\ref{tab:generalisation}. Source: jobs 103 (vanilla)
|
||||
% + 104 (route2), per_mode_deploy.json; same numbers as the table.
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{../../out/figs/a5_generalisation.pdf}
|
||||
\caption{Per-mode deploy hack (left, $\downarrow$) and solve (right, $\uparrow$),
|
||||
vanilla vs route, A5 held-out test ($n{=}1$, seed 41; same run as
|
||||
Table~\ref{tab:generalisation}). Only \texttt{run\_tests} is demonstrated to the
|
||||
teacher and present in $v_{\text{hack}}$ (``IN''); the other three modes are
|
||||
held out. The arrow is route's effect: deploy hack drops on every held-out mode
|
||||
($\le1\%$ of their labels reach the gate) while solve is largely preserved.}
|
||||
\label{fig:generalisation}
|
||||
\end{figure}
|
||||
|
||||
\subsection{C4: the teacher bootstrap is an accelerant, not the signal}
|
||||
\label{sec:bootstrap}
|
||||
% PROVENANCE:
|
||||
@@ -523,7 +554,7 @@ Three of the four suppressed modes have zero teacher examples, so their gradient
|
||||
cannot be teacher-injected.
|
||||
|
||||
Third, the clean control cuts the teacher entirely at step 40 (seed, then pure
|
||||
on-policy to 200) for both vanilla and route2. If the teacher were load-bearing,
|
||||
on-policy to 200) for both vanilla and route2. If the teacher were necessary,
|
||||
vanilla hacking would decay and route2's suppression would lose its target after
|
||||
the cut; if it is an accelerant, vanilla keeps hacking and route2 keeps holding
|
||||
deploy hack near zero. \TODO{figure from jobs 93/94 (\texttt{--teacher-off-step=40},
|
||||
@@ -545,16 +576,16 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
% only-constrains-training" is thin unless argued; and we never measured
|
||||
% whether our hack-basis and their clean-basis are the same subspace (if they
|
||||
% coincide, +project-onto and -project-out converge). Attack vector for a reviewer.
|
||||
\item Trusted-direction projection \citep{huang2026directional}: the near-twin.
|
||||
\item Trusted-direction projection \citep{huang2026directional}: the closest method.
|
||||
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
|
||||
clean warmup and project the live gradient \emph{onto} its dominant
|
||||
left-singular directions. We extract a hack direction from a few
|
||||
contrastive (hack, clean) pair gradients and project it \emph{out}, in the
|
||||
frozen SVD-of-$W$ $\delta_S$ coordinates. Both directions live in weight
|
||||
space; the signal differs (their clean update trajectory needs a warmup,
|
||||
ours is a handful of labelled pair gradients), and we quarantine the
|
||||
removed part into a deploy-deletable knob, where their projection only
|
||||
constrains training.
|
||||
ours is a handful of labelled pair gradients), and we route the
|
||||
removed part into a deploy-deletable auxiliary adapter, where their projection
|
||||
only constrains training.
|
||||
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the
|
||||
% API. "post-backward vs forward .detach()" reads as engineering taste to an
|
||||
% RL reader; "we route in parameter-gradient space, Cloud routes in activation
|
||||
@@ -564,18 +595,31 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
|
||||
we operate post-backward on parameter gradients (next bullet).
|
||||
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
|
||||
closest mechanistic relative, a post-backward parameter-gradient mask
|
||||
closest mechanistic analogue, a post-backward parameter-gradient mask
|
||||
over a forget subspace, tolerant to label noise, where the forgotten
|
||||
capability leaking back into the kept weights shrinks as model size
|
||||
grows (supports our scalability argument). We differ in the
|
||||
mask \emph{source}: SGTM tags training data per example over fixed
|
||||
reserved dims; we extract one hack direction from a few contrastive pairs
|
||||
and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
|
||||
weak-detector axis.
|
||||
and route by cosine. Their TPR/FPR detector quality maps to our
|
||||
weak-detector (held-out) axis.
|
||||
\item Advantage-level intervention \citep{wu2026rebound}: representation-
|
||||
informed advantage modulation; ours is gradient-level (one step deeper,
|
||||
after the reward is computed). A matched-compute head-to-head is future
|
||||
work.
|
||||
informed advantage modulation; ours is gradient-level (after the reward
|
||||
is computed). A matched-compute comparison is future work.
|
||||
\item Other gradient-level reward-hacking work
|
||||
\citep{wang2026grift, ono2026signcert, ackermann2026gradreg}: GRIFT
|
||||
\emph{detects} hacking from the gradient of the CoT (we intervene on it);
|
||||
SignCert-PO down-weights advantage-sign-non-robust completions in the
|
||||
policy gradient; gradient regularisation biases updates toward flat,
|
||||
reward-accurate minima. All act on the gradient but none localise a hack
|
||||
subspace into a deletable auxiliary adapter.
|
||||
\item Contrastive-direction removal on the reward model
|
||||
\citep{liu2026harve, fierro2025weightarithmetic}: HARVE builds a hacking
|
||||
subspace from residual-stream diffs of gold-vs-hacked pairs and projects
|
||||
the reward-head vector off it; weight arithmetic isolates a behaviour
|
||||
direction by subtracting two opposite fine-tunes. Both remove a
|
||||
contrastive direction like our $v_{\text{hack}}$, but edit a static
|
||||
model, not the live policy gradient.
|
||||
\item Reward-for-honesty \citep{joglekar2025confessions}: we reject this
|
||||
design, since it reintroduces a live judge over student rollouts and
|
||||
invites monitor obfuscation (arXiv:2503.11926).
|
||||
@@ -600,11 +644,31 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
|
||||
older idea that a capability can be confined to a weight subset, via a
|
||||
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
|
||||
Our quarantine $\delta_{S,\text{hack}}$ is a deletable adapter in that
|
||||
Our auxiliary $\delta_{S,\text{hack}}$ is a deletable adapter in that
|
||||
family. Two differences: these methods \emph{add} a wanted task and pick
|
||||
the subset from a given task label, whereas we \emph{remove} an unwanted
|
||||
capability and pick the subset from a gradient signal ($\cos$ to
|
||||
$v_{\text{hack}}$), with no per-rollout label.
|
||||
$v_{\text{hack}}$), with no per-rollout label. The deletable-adapter idea
|
||||
itself has a 2023 precedent: separable ``security vectors''
|
||||
\citep{zhou2023securityvectors} absorb a harmful fine-tuning update so
|
||||
the backbone never learns it, deactivated at inference; ours differs in
|
||||
being RL reward hacking (not SFT harmful data), an extracted direction
|
||||
plus cosine routing (not fixed reserved params), and the weak (held-out)
|
||||
detector.
|
||||
\item Orthogonal gradient projection lineage
|
||||
\citep{yu2020pcgrad, ilharco2023taskarithmetic, qiao2025pegp,
|
||||
shamsian2025orthograd, sun2026ogpsa}: projecting a gradient out of a
|
||||
subspace originates in multi-task learning (PCGrad), weight-space task
|
||||
arithmetic (negating a task vector), continual PEFT (PEGP), and
|
||||
unlearning (OrthoGrad, OGPSA). We repurpose it to remove a hack
|
||||
direction during RL rather than to preserve old tasks or unlearn data.
|
||||
\item Reward-hacking generalises across types
|
||||
\citep{taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize,
|
||||
wang2026rewardhackingsurvey}: SFT on harmless hacks generalises to new
|
||||
hack settings and to unrelated misalignment, and hacking transfers
|
||||
across tasks. This is the premise behind our held-out test
|
||||
(C2), and the localize-then-delete framing was proposed in the alignment
|
||||
community \citep{mallen2025rhinterventions} before this work.
|
||||
% Anticipated critique (Piggyback learns its mask end-to-end via a differentiable
|
||||
% real-valued threshold): why is our route gate a per-step calibrated cosine
|
||||
% threshold rather than a learned mask? Answer for the rebuttal: a learned mask
|
||||
@@ -613,8 +677,12 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
% unsupervised hack-vs-clean cos gap, not a trained parameter.
|
||||
% LoRA's rank-deficiency finding is mild external support for our low-rank hack
|
||||
% subspace (~10 pairs => rank-10).
|
||||
% \TODO{abstract-only twins to verify+place: GRIFT (gradient fingerprints,
|
||||
% arXiv:2604.16242); Spilling the Beans (OOD self-report, arXiv:2511.06626).}
|
||||
% GRIFT now cited (gradient-level competitor bullet above). Still unplaced:
|
||||
% Spilling the Beans (OOD self-report, arXiv:2511.06626).
|
||||
\item Gradient surgery against shortcuts in SFT reasoning
|
||||
\citep{cao2026sart}: a per-sample shortcut score flags shortcut-promoting
|
||||
examples, then orthogonal projection removes those updates. The
|
||||
supervised-reasoning analogue of our GRPO reward-hacking setting.
|
||||
\end{itemize}
|
||||
|
||||
\section{Lessons learned / discussion}
|
||||
@@ -622,7 +690,7 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
stale fast (cos to live gradient decays $\sim$0.28$\to$0.07 by step 10), so
|
||||
online refresh helps; (b) Adam momentum leak (projection does not touch the
|
||||
buffer) -- bounded on frozen-V, open under refresh; (c) erase vs route trade-off
|
||||
and why route's per-rollout gate + scale-matched quarantine beat the earlier
|
||||
and why route's per-rollout gate + scale-matched auxiliary adapter beat the earlier
|
||||
shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regime.}
|
||||
|
||||
\section{Why this matters for alignment}
|
||||
@@ -637,13 +705,13 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
|
||||
extending gradient routing to one RL side-effect, we give evidence it
|
||||
may be promising for others.
|
||||
% --- agent-suggested, keep or cut ---
|
||||
\item \TODO{(agent-suggested) the quarantine knob is \emph{deletable}: you get
|
||||
a localized handle on the unwanted behaviour rather than hoping a
|
||||
\item \TODO{(agent-suggested) the auxiliary adapter is \emph{deletable}: you get
|
||||
localized control over the unwanted behaviour rather than hoping a
|
||||
penalty suppressed a latent capability (cf.\ unlearning-via-ablation in
|
||||
\citep{cloud2024gradientrouting}).}
|
||||
\item \TODO{(agent-suggested) it acts \emph{during} training, before the
|
||||
behaviour bakes across all weights; our post-hoc test-time erasure
|
||||
control tests whether that timing earns its cost.}
|
||||
behaviour is distributed across all weights; our post-hoc test-time erasure
|
||||
control tests whether that timing justifies its cost.}
|
||||
\item \TODO{think more -- author.}
|
||||
\end{itemize}
|
||||
|
||||
@@ -654,14 +722,15 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
|
||||
\item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy
|
||||
rollouts) due to compute limits, rather than waiting for endogenous
|
||||
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
|
||||
teacher is an accelerant, not the suppressed signal (the student
|
||||
out-hacks it $\sim3{:}1$ from step 40, and emerges three held-out modes
|
||||
with zero teacher examples); the teacher-off control there closes it.
|
||||
\item Little optimization done per component, so there is likely substantial
|
||||
low-hanging fruit in each novel piece (extraction, gating, refresh,
|
||||
quarantine scale-matching).
|
||||
teacher accelerates emergence, not the suppressed signal (the student
|
||||
produces ${\sim}3{:}1$ more hacks than the teacher from step 40, and emerges
|
||||
three held-out modes with zero teacher examples); the teacher-off control
|
||||
there closes it.
|
||||
\item Little optimization done per component, so there are likely substantial
|
||||
straightforward improvements in each novel piece (extraction, gating, refresh,
|
||||
auxiliary-adapter scale-matching).
|
||||
\item \TODO{also: $n{=}3$, 60-step fast preset, single substrate; C2 clean
|
||||
2-of-4 design unqueued; solve-rate matching band; no-cheat invariant
|
||||
2-of-4 design unqueued; solve-rate matching band; no-label-leakage
|
||||
assumptions.}
|
||||
\end{itemize}
|
||||
|
||||
@@ -835,14 +904,14 @@ The erase and route gradient policies at the optimizer step
|
||||
\caption{Gradient policy at the optimizer step (per Linear)}
|
||||
\label{alg:step}
|
||||
\begin{algorithmic}[1]
|
||||
\Statex \textbf{erase (one-sided)} -- remove only the hack-ward component:
|
||||
\Statex \textbf{erase (one-sided)} -- remove only the hack-aligned component:
|
||||
\State $c \gets v_{\mathrm{hack}}\, g$;\quad $c_+ \gets \mathrm{relu}(c)$
|
||||
\State $g \gets g - c_+\, v_{\mathrm{hack}}$;\quad $\mathrm{opt.step}(g)$
|
||||
\Statex
|
||||
\Statex \textbf{route} -- the removed part trains a deletable quarantine knob:
|
||||
\Statex \textbf{route} -- the removed part trains a deletable auxiliary adapter:
|
||||
\State $r \gets \mathrm{relu}(v_{\mathrm{hack}}\, g)\, v_{\mathrm{hack}}$
|
||||
\State $\mathrm{opt.step}(\delta_S,\ g - r)$ \Comment{main knob learns the orthogonal complement}
|
||||
\State $\mathrm{opt.step}(\delta_{S,\mathrm{hack}},\ r)$ \Comment{quarantine absorbs the hack-ward part}
|
||||
\State $\mathrm{opt.step}(\delta_S,\ g - r)$ \Comment{kept adapter learns the orthogonal complement}
|
||||
\State $\mathrm{opt.step}(\delta_{S,\mathrm{hack}},\ r)$ \Comment{auxiliary adapter absorbs the hack-aligned part}
|
||||
\State at deploy: $\delta_{S,\mathrm{hack}} \gets 0$
|
||||
\end{algorithmic}
|
||||
\end{algorithm}
|
||||
@@ -921,7 +990,7 @@ and basis strength (Table~\ref{tab:ctx-basis}).
|
||||
% results.md Q6 (v_hack_full, frozen, one_sided; paired Delta vs same-seed vanilla).
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Teacher density: the hack cut holds as the pool thins and the solve
|
||||
\caption{Teacher density: the hack reduction holds as the pool shrinks and the solve
|
||||
cost vanishes at low mix (results.md Q6); mix=0.125 is the locked default.
|
||||
Paired $\Delta$ vs same-seed vanilla.}
|
||||
\label{tab:ctx-mix}
|
||||
@@ -942,8 +1011,8 @@ and basis strength (Table~\ref{tab:ctx-basis}).
|
||||
\centering
|
||||
\caption{Pair-set content: it is the hack \emph{mechanism}, not the framing
|
||||
(results.md Q10). $n{=}1$/row, seed 41; $\pm0.06$ is baseline noise, so
|
||||
everything from \texttt{intent\_vs\_spec} down is null. The \texttt{null\_city}
|
||||
placebo sits at $+0.024$ (no effect), as a control should.
|
||||
everything from \texttt{intent\_vs\_spec} down is null. The random (placebo)
|
||||
pair set sits at $+0.024$ (no effect), as a control should.
|
||||
\TODO{this table is hard to read without seeing a pair: add an appendix with
|
||||
one example (hack, clean) pair per pair set listed here, so the reader can judge
|
||||
what ``hack mechanism'' vs ``semantic framing'' vs ``random content'' means.}}
|
||||
@@ -959,13 +1028,17 @@ and basis strength (Table~\ref{tab:ctx-basis}).
|
||||
\texttt{moral} & semantic framing & $-0.005$ \\
|
||||
\texttt{eval\_aware} & semantic framing & $+0.010$ \\
|
||||
\texttt{philosophical} & semantic framing & $+0.017$ \\
|
||||
\texttt{null\_city} (placebo) & random content & $+0.024$ \\
|
||||
% paper "random (placebo)" := config pair-set null_city
|
||||
random (placebo) & random content & $+0.024$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
% results.md Q8 (mix=0.5, frozen, one_sided). Basis NAMES mislead: v_hack_full
|
||||
% = 10 pairs/k=5; v_hack_21pairs = 16 pairs/k=12 (triple-confounded).
|
||||
% results.md Q8 (mix=0.5, frozen, one_sided). Basis NAMES mislead, so the paper
|
||||
% uses descriptive labels. paper:code mapping --
|
||||
% "weak (10 pairs, k=5)" := config v_hack_full (10 pairs, k=5)
|
||||
% "strong (16 pairs, k=12)" := config v_hack_21pairs (16 pairs, k=12)
|
||||
% (triple-confounded across pairs/k/extract-tau).
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\caption{Basis strength (results.md Q8): the stronger basis cuts hack
|
||||
@@ -978,8 +1051,8 @@ and basis strength (Table~\ref{tab:ctx-basis}).
|
||||
Basis & Train hack $\downarrow$ & Train solve $\uparrow$ & $n$ \\
|
||||
\midrule
|
||||
Vanilla & $0.719 \pm 0.120$ & $0.306 \pm 0.116$ & 4 \\
|
||||
\texttt{v\_hack\_full} (weak, 10pr/$k$5) & $0.700 \pm 0.109$ & $0.283 \pm 0.038$ & 3 \\
|
||||
\texttt{v\_hack\_21pairs} (16pr/$k$12) & $0.588 \pm 0.131$ & $0.256 \pm 0.083$ & 4 \\
|
||||
weak (10 pairs, $k{=}5$) & $0.700 \pm 0.109$ & $0.283 \pm 0.038$ & 3 \\
|
||||
strong (16 pairs, $k{=}12$) & $0.588 \pm 0.131$ & $0.256 \pm 0.083$ & 4 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\end{table}
|
||||
|
||||
+76
-1
@@ -121,7 +121,6 @@
|
||||
% IDs from docs/grad_routing/{related_work,search_for_more}.md. Authors NOT
|
||||
% filled (not read) -- do not cite as @misc with invented authors. Verify
|
||||
% byline from arXiv before promoting any of these into the bibliography:
|
||||
% GRIFT (gradient fingerprints to detect/reject hacking) arXiv:2604.16242
|
||||
% Spilling the Beans (SFT self-report generalises OOD) arXiv:2511.06626
|
||||
% Baker et al. (weak monitor -> obfuscated reward hacking) arXiv:2503.11926
|
||||
|
||||
@@ -321,3 +320,79 @@
|
||||
howpublished = {Alignment Forum},
|
||||
url = {https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks}
|
||||
}
|
||||
|
||||
% --- verified additions, 2026-06-05 (abstract verbatim from Semantic Scholar /
|
||||
% arXiv API via scripts; byline cross-checked arXiv + OpenAlex; see
|
||||
% docs/grad_routing/search_2026-06-04_related_work.md) -----------------------
|
||||
|
||||
% The uncited gradient-level NEAR-TWIN on the signal axis: GRIFT DETECTS reward
|
||||
% hacking from the gradient of the CoT (we INTERVENE on the gradient). Resolves
|
||||
% the flagged abstract-only twin. Differentiate detect-vs-intervene.
|
||||
@misc{wang2026grift,
|
||||
title = {Detecting and Suppressing Reward Hacking with Gradient Fingerprints},
|
||||
author = {Wang, Songtao and Pham, Quang Hieu and Yin, Fangcong and Wang, Xinpeng and Chen, Jocelyn Qiaochu and Durrett, Greg and Ye, Xi},
|
||||
year = {2026},
|
||||
eprint = {2604.16242},
|
||||
archivePrefix= {arXiv},
|
||||
primaryClass = {cs.LG},
|
||||
url = {https://arxiv.org/abs/2604.16242},
|
||||
abstract = {Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25\% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces.}
|
||||
}
|
||||
|
||||
% Gradient-level reward-hacking competitor: re-weights the policy gradient by an
|
||||
% advantage-sign-robustness certificate. Same lab as ackermann2026gradreg (RIKEN).
|
||||
@misc{ono2026signcert,
|
||||
title = {Mitigating Reward Hacking in RLHF via Advantage Sign Robustness},
|
||||
author = {Ono, Shinnosuke and Ackermann, Johannes and Nishimori, Soichiro and Ishida, Takashi and Sugiyama, Masashi},
|
||||
year = {2026},
|
||||
eprint = {2604.02986},
|
||||
archivePrefix= {arXiv},
|
||||
primaryClass = {cs.LG},
|
||||
url = {https://arxiv.org/abs/2604.02986},
|
||||
abstract = {Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.}
|
||||
}
|
||||
|
||||
% --- gradient-projection lineage (the continual-learning ancestry of "project
|
||||
% a gradient out of a subspace"; a CL reviewer expects these) ----------------
|
||||
|
||||
% PCGrad: the canonical "project a gradient onto the normal plane of a
|
||||
% conflicting one" primitive (multi-task learning).
|
||||
@inproceedings{yu2020pcgrad,
|
||||
title = {Gradient Surgery for Multi-Task Learning},
|
||||
author = {Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea},
|
||||
booktitle = {Neural Information Processing Systems (NeurIPS)},
|
||||
year = {2020},
|
||||
eprint = {2001.06782},
|
||||
archivePrefix= {arXiv},
|
||||
primaryClass = {cs.LG},
|
||||
url = {https://arxiv.org/abs/2001.06782},
|
||||
abstract = {While deep learning and deep reinforcement learning (RL) systems have demonstrated impressive results in domains such as image classification, game playing, and robotic control, data efficiency remains a major challenge. Multi-task learning has emerged as a promising approach for sharing structure across multiple tasks to enable more efficient learning. However, the multi-task setting presents a number of optimization challenges, making it difficult to realize large efficiency gains compared to learning tasks independently. The reasons why multi-task learning is so challenging compared to single-task learning are not fully understood. In this work, we identify a set of three conditions of the multi-task optimization landscape that cause detrimental gradient interference, and develop a simple yet general approach for avoiding such interference between task gradients. We propose a form of gradient surgery that projects a task's gradient onto the normal plane of the gradient of any other task that has a conflicting gradient. On a series of challenging multi-task supervised and multi-task RL problems, this approach leads to substantial gains in efficiency and performance. Further, it is model-agnostic and can be combined with previously-proposed multi-task architectures for enhanced performance.}
|
||||
}
|
||||
|
||||
% Task arithmetic: negating a weight-space task vector reduces a behavior --
|
||||
% the weight-direction-removal precedent for our extracted v_hack (cite as
|
||||
% steering/negation, NOT as unlearning: they frame it as model editing).
|
||||
@inproceedings{ilharco2023taskarithmetic,
|
||||
title = {Editing Models with Task Arithmetic},
|
||||
author = {Ilharco, Gabriel and Ribeiro, Marco Tulio and Wortsman, Mitchell and Gururangan, Suchin and Schmidt, Ludwig and Hajishirzi, Hannaneh and Farhadi, Ali},
|
||||
booktitle = {International Conference on Learning Representations (ICLR)},
|
||||
year = {2023},
|
||||
eprint = {2212.04089},
|
||||
archivePrefix= {arXiv},
|
||||
primaryClass = {cs.LG},
|
||||
url = {https://arxiv.org/abs/2212.04089},
|
||||
abstract = {Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around task vectors. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D'', combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.}
|
||||
}
|
||||
|
||||
% PEGP: orthogonal gradient projection inside Adapter/LoRA/Prefix/Prompt PEFT --
|
||||
% the closest methodological prior art to our erase arm (orthogonal projection
|
||||
% in a low-rank tuning subspace, repurposed there for anti-forgetting). TPAMI.
|
||||
@article{qiao2025pegp,
|
||||
title = {Gradient Projection for Continual Parameter-Efficient Tuning},
|
||||
author = {Qiao, Jingyang and Zhang, Zhizhong and Tan, Xin and Qu, Yanyun and Zhang, Wensheng and Han, Zhi and Xie, Yuan},
|
||||
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
|
||||
year = {2025},
|
||||
doi = {10.1109/TPAMI.2025.3587032},
|
||||
url = {https://doi.org/10.1109/TPAMI.2025.3587032},
|
||||
abstract = {Parameter-efficient tunings (PETs) have demonstrated impressive performance and promising perspectives in training large models, while they are still confronted with a common problem: the trade-off between learning new content and protecting old knowledge, leading to zero-shot generalization collapse, and cross-modal hallucination. In this paper, we reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection, and first propose a unified framework called Parameter Efficient Gradient Projection (PEGP). We introduce orthogonal gradient projection into different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting even for large-scale models. It therefore modifies the gradient towards the direction that has less impact on the old feature space, with less extra memory space and training time. We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets, and experiments comprehensively demonstrate its efficiency in reducing forgetting in class, online class, domain, task, and multi-modality continual settings.}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user