paper: fix build, vector figs, +2 plots, de-jargon prose

- drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub
  icon was decorative
- switch the two included figures PNG->PDF (vector; now-tracked, smaller)
- add fig:generalisation (A5 dumbbell) next to tab:generalisation and
  fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd
- rename leaked config codenames in appendix tables (v_hack_full ->
  "weak (10 pairs)", null_city -> "random (placebo)") with paper:code
  mapping comments
- de-jargon reader-facing prose per a 3-model external panel
  (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter,
  quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward ->
  hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is.

14 pages, zero unresolved refs.
This commit is contained in:
wassname
2026-06-05 14:25:03 +08:00
parent 04562c5226
commit 07e1eb8753
2 changed files with 224 additions and 76 deletions
+148 -75
View File
@@ -23,7 +23,6 @@
\usepackage{listings}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{fontawesome5}
\usepackage{hyperref}
% hyperref defaults colour citations green / links red (ugly). Make refs+cites
% black (print-clean) and URLs a muted blue.
@@ -80,7 +79,7 @@
% "coming soon" at camera-ready (cf. the AntiPaSTO github line).
\begin{center}
\small
\faGithub~\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
\end{center}
\begin{abstract}
@@ -103,10 +102,10 @@ the three claims C1/C2/C3.}
% --min-steps 60 --out out/figs/dyn_sub4.png
\begin{figure}[!ht]
\centering
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.png}
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.pdf}
\caption{Deploy hack rate (top) and solve rate (bottom) over 60 GRPO steps for
route (ours) and vanilla GRPO, three seeds each; thick line is the mean, thin
lines per seed (EMA-5). Deploy evaluation is knob-off at $n{=}64$, $T{=}0.7$.
lines per seed (EMA-5). Deploy evaluation is adapter-off at $n{=}64$, $T{=}0.7$.
Vanilla acquires the hack, rising to ${\sim}0.4$, while route stays near zero and
reaches a higher solve rate (${\sim}0.6$ vs ${\sim}0.4$). The arms differ only in
the gradient policy (Section~\ref{sec:method}); the gap persists to 200 steps
@@ -124,8 +123,8 @@ the three claims C1/C2/C3.}
RL post-training reliably induces reward hacking: the policy learns to exploit
flaws in the grader rather than solve the task. Today's interventions act on the
reward or advantage \citep{wu2026rebound} and need a detector at scoring time,
yet at deployment some hacks are unknown. We test whether intervening one step
deeper, on the \emph{gradient} itself, can stop the policy picking up a hack as
yet at deployment some hacks are unknown. We test whether intervening at the
\emph{gradient} level can stop the policy acquiring a hack as
it forms (Figure~\ref{fig:keynote}). Our hypothesis:
\begin{quote}
@@ -171,19 +170,19 @@ README ``How it works'' + blog intro.}
% Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
% U, Vh frozen and double as the v_hack basis.
\TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
singular-value coordinates; we freeze $U,V$ and train a per-module knob
$\delta_S\in\mathbb{R}^r$ (and a routing knob $\delta_{S,\text{hack}}$) in that
singular-value coordinates; we freeze $U,V$ and train a per-module adapter
parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
and the projection all live in this same low-rank, weight-aligned space
($r\sim500$--$2560$). Two consequences we use:
\begin{itemize}
\item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
never reconstructed on the main path), so a knob-off forward gives
$\pi_{\text{ref}}$ for free, with no second model.
never reconstructed on the main path), so an adapter-off forward gives
$\pi_{\text{ref}}$ with no second model.
\item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
hack-ward update parked in the quarantine still moves the training
model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates exactly
that routed capability.
hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
exactly that routed capability.
\end{itemize}
\subsection{Extracting the hack direction}
@@ -193,8 +192,8 @@ the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on
$\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by
majority sign, drop the global bottom-25\% singular values as noise floor.
Pseudocode in Appendix~\ref{app:pseudocode}. No-cheat invariant: the pairs may
select/calibrate; live routing never reads \texttt{gt\_pass}.}
Pseudocode in Appendix~\ref{app:pseudocode}. No-label-leakage invariant: the pairs
may select/calibrate; live routing never reads \texttt{gt\_pass}.}
% Known issue from docs/pseudocode/02_extract_vhack.py (REVIEW comment): the
% GRPO=NLL-diff identity holds only at adv=+/-1, no ratio-clip, no length-norm;
% but extraction is length-normalized (.mean() per completion) while the live
@@ -212,9 +211,9 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
part orthogonal to $v_{\text{hack}}$).
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
calibrated each step from the hack-vs-clean cosine gap) sends that
rollout's whole gradient into a separate quarantine knob
rollout's whole gradient into a separate auxiliary adapter
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
update so the kept knob is fully relieved of it, and deleted at deploy.
update so the kept adapter is free of it, and deleted at deploy.
Mechanically this is
SGTM-style post-backward parameter-gradient masking
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
@@ -231,7 +230,7 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
\TODO{outline: Ariahw LeetCode loophole substrate \citep{ariahw2025steering}, 4
modes, even non-overlapping partition (Appendix~\ref{app:traces},
6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125;
deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
deploy-eval = adapter-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
% ===================================================================
% RESULTS -- evidence tables + figures. Numbers are real where present,
@@ -243,7 +242,10 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
Over three seeds at the 60-step preset, route holds deploy hack near zero while
vanilla GRPO acquires it, and route also raises the solve rate
(Figure~\ref{fig:keynote}, Table~\ref{tab:keynote}). \TODO{prose -- author.}
(Figure~\ref{fig:keynote}, Table~\ref{tab:keynote}). The on-policy path still
learns to hack; route isolates it into the deletable auxiliary adapter, so removing
that adapter at deploy leaves the hack rate $\approx0$ (Figure~\ref{fig:traindeploy}).
\TODO{prose -- author.}
% --- Figure: keynote dynamics -----------------------------------------------
% Provenance: out/figs/dyn_sub4_hack_overlay.png. Regenerate from the 6 explicit
@@ -279,7 +281,7 @@ vanilla GRPO acquires it, and route also raises the solve rate
\begin{table}[t]
\centering
\caption{Deploy hack and solve rate, mean$\pm$SEM, both arms over 3 seeds
(41/42/43). 60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off,
(41/42/43). 60-step fast preset, Qwen3-4B, mix=0.125; deploy = adapter-off,
$n{=}64$, $T{=}0.7$. Paired $t$-test (route$-$vanilla per seed, df$=$2): hack
$t{=}8.54$, $p{\approx}0.013$; solve $t{=}2.87$, $p{\approx}0.10$.}
\label{tab:keynote}
@@ -295,6 +297,20 @@ vanilla GRPO acquires it, and route also raises the solve rate
\end{tabular}
\end{table}
% Mechanism view: the knob holds deploy hack ~0 while the on-policy (adapter-on)
% path still learns to hack. Source: scripts/plot_dynamics.py, 60-step preset,
% out/figs/train_vs_deploy_60_train_deploy.{pdf,csv}.
\begin{figure}[t]
\centering
\includegraphics[width=0.92\linewidth]{../../out/figs/train_vs_deploy_60_train_deploy.pdf}
\caption{Train (adapter on, dashed) vs deploy (adapter off, solid) over training,
vanilla vs route (60-step preset). Vanilla encodes the hack in the weights, so
train and deploy hack rise together. route isolates it: the adapter-on path
still learns to hack ($\sim0.22$), but with the auxiliary adapter removed at deploy
the hack rate stays $\approx0$ while solve is preserved.}
\label{fig:traindeploy}
\end{figure}
\subsection{C3: directional specificity (controls)}
% Precedent at the training-hack metric (Appendix~\ref{app:context}, Q10):
% the null_city placebo pairset gave delta hack +0.024 (no effect) and a
@@ -302,7 +318,7 @@ vanilla GRPO acquires it, and route also raises the solve rate
% mechanism, not a generic direction. The deploy-metric replication is jobs
% 80 (placebo) / 81 (random-V) below.
The deploy-metric controls below replicate a training-hack precedent: at the
fast preset a semantically random (``null\_city'') pairset moved hack by only
fast preset a semantically random (placebo) pairset moved hack by only
$+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
(Appendix~\ref{app:context}, Q10).
@@ -314,8 +330,8 @@ arm to zero deploy hack at $0.625$ solve, because routing needs the direction on
to \emph{discriminate} hack rollouts, not to \emph{span} the hack subspace in weight
space; the absorption property of gradient routing
\citep{cloud2024gradientrouting, sgtm2025localization} then localises the capability
into the discarded knob. A detector too weak to erase a trained hack is still strong
enough to route one as it forms.
into the deleted auxiliary adapter. A detector too weak to erase a trained hack is
still strong enough to route one as it forms.
% --- Table: ablation --------------------------------------------------------
% Provenance (seed 41, 60-step fast preset):
@@ -351,9 +367,9 @@ enough to route one as it forms.
$\neg$hack-pairs swaps in a semantically random (placebo) pairset. If route's
suppression were directional, these controls would return toward the vanilla hack
level; instead the placebo also reaches zero deploy hack, so route's gate is
direction-agnostic and the suppression is the routed-and-discarded knob absorbing
whatever crosses the per-step energy threshold, not $v_{\text{hack}}$ pointing at
the hack. Directional specificity is what the erase arm tests (it subtracts
direction-agnostic and the suppression is the routed-and-deleted auxiliary adapter
absorbing whatever crosses the per-step energy threshold, not $v_{\text{hack}}$
aligned with the hack. Directional specificity is what the erase arm tests (it subtracts
$\propto\cos(g,v)$); that pair is pending. The post-hoc block (different checkpoint,
own baseline $0.391/0.302$, $n{=}192$) tests test-time erasure, not training-time
routing.}
@@ -386,7 +402,7 @@ enough to route one as it forms.
\subsection{Long-run convergence}
The 60-step gap persists to convergence: at 200 steps route's deploy hack stays
pinned at zero while vanilla acquires the hack and rises to ${\sim}0.32$
held at zero while vanilla acquires the hack and rises to ${\sim}0.32$
(Figure~\ref{fig:longrun}), with route's solve rate ending higher
(${\sim}0.61$ vs ${\sim}0.47$). \TODO{prose -- author.}
@@ -408,7 +424,7 @@ pinned at zero while vanilla acquires the hack and rises to ${\sim}0.32$
% coherent: lp_s in [-0.47,-0.29] through step 200), not a property of vanilla GRPO.
\begin{figure}[t]
\centering
\includegraphics[width=0.95\linewidth]{../../out/figs/dyn_longrun_200.png}
\includegraphics[width=0.95\linewidth]{../../out/figs/dyn_longrun_200.pdf}
\caption{Deploy hack (red) and solve (green) over 200 GRPO steps, seed 41
(deploy evaluation $n{=}64$, $T{=}0.7$, EMA-5). route (right) holds deploy hack
at $0$ for all 200 steps while solve climbs to ${\sim}0.61$ and plateaus.
@@ -421,10 +437,11 @@ pinned at zero while vanilla acquires the hack and rises to ${\sim}0.32$
\label{fig:longrun}
\end{figure}
\subsection{C2: generalisation to held-out modes (the no-cheat payload)}
\subsection{C2: generalisation to held-out modes (the zero-label test)}
route suppresses deploy hack on loophole modes the route gate never saw a label
for, not only the demonstrated mode (Table~\ref{tab:generalisation}).
for, not only the demonstrated mode (Table~\ref{tab:generalisation},
Figure~\ref{fig:generalisation}).
\TODO{prose -- author. This is now the designed held-out test (jobs 103/104):
the teacher demonstrates only \texttt{run\_tests}; the other three modes are
suppressed with zero labels. Vanilla baseline column filled (job 103); route2
@@ -461,17 +478,17 @@ column pending job 104.}
and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
absent from both: it is not in $v$, not in the teacher pool, and the live route
gate (\texttt{teacher OR hacked\_E}) is near-silent there: the \texttt{run\_tests}
detector cross-fires on only $0/1386$ \texttt{sentinel}, $2/1337$
detector fires on only $0/1386$ \texttt{sentinel}, $2/1337$
\texttt{file\_marker}, and $17/1540$ \texttt{stdout\_marker} student rollouts
($\le 1.1\%$, detector false positives, not by-design labels), so $\ge 98.9\%$
of held-out hacks reach the optimiser unlabelled. The student still learns the
held-out
cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
\texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
three. So the quarantine removes cheats it was never given a label for. Mechanism
cheats on-policy (adapter-on train-hack: \texttt{file\_marker} $0.54$,
\texttt{stdout\_marker} $0.60$), yet adapter-off deploy hack is near zero on all
three. So the auxiliary adapter removes hacks it was never given a label for. Mechanism
caveat: the placebo control (Table~\ref{tab:ablation}) shows route's gate is
direction-agnostic, so the held-out suppression is the deletable quarantine
generalising, not $v(\texttt{run\_tests})$ pointing at the other modes; the
direction-agnostic, so the held-out suppression is the deletable auxiliary adapter
generalising, not $v(\texttt{run\_tests})$ aligned with the other modes; the
zero-label result stands either way. $n{=}1$; $n{=}3$ queued (jobs 107--110).}
\label{tab:generalisation}
\begin{tabular}{lccc}
@@ -486,6 +503,20 @@ column pending job 104.}
\end{tabular}
\end{table}
% Dumbbell view of Table~\ref{tab:generalisation}. Source: jobs 103 (vanilla)
% + 104 (route2), per_mode_deploy.json; same numbers as the table.
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{../../out/figs/a5_generalisation.pdf}
\caption{Per-mode deploy hack (left, $\downarrow$) and solve (right, $\uparrow$),
vanilla vs route, A5 held-out test ($n{=}1$, seed 41; same run as
Table~\ref{tab:generalisation}). Only \texttt{run\_tests} is demonstrated to the
teacher and present in $v_{\text{hack}}$ (``IN''); the other three modes are
held out. The arrow is route's effect: deploy hack drops on every held-out mode
($\le1\%$ of their labels reach the gate) while solve is largely preserved.}
\label{fig:generalisation}
\end{figure}
\subsection{C4: the teacher bootstrap is an accelerant, not the signal}
\label{sec:bootstrap}
% PROVENANCE:
@@ -523,7 +554,7 @@ Three of the four suppressed modes have zero teacher examples, so their gradient
cannot be teacher-injected.
Third, the clean control cuts the teacher entirely at step 40 (seed, then pure
on-policy to 200) for both vanilla and route2. If the teacher were load-bearing,
on-policy to 200) for both vanilla and route2. If the teacher were necessary,
vanilla hacking would decay and route2's suppression would lose its target after
the cut; if it is an accelerant, vanilla keeps hacking and route2 keeps holding
deploy hack near zero. \TODO{figure from jobs 93/94 (\texttt{--teacher-off-step=40},
@@ -545,16 +576,16 @@ one-liners are in docs/grad\_routing/related\_work.md.}
% only-constrains-training" is thin unless argued; and we never measured
% whether our hack-basis and their clean-basis are the same subspace (if they
% coincide, +project-onto and -project-out converge). Attack vector for a reviewer.
\item Trusted-direction projection \citep{huang2026directional}: the near-twin.
\item Trusted-direction projection \citep{huang2026directional}: the closest method.
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
clean warmup and project the live gradient \emph{onto} its dominant
left-singular directions. We extract a hack direction from a few
contrastive (hack, clean) pair gradients and project it \emph{out}, in the
frozen SVD-of-$W$ $\delta_S$ coordinates. Both directions live in weight
space; the signal differs (their clean update trajectory needs a warmup,
ours is a handful of labelled pair gradients), and we quarantine the
removed part into a deploy-deletable knob, where their projection only
constrains training.
ours is a handful of labelled pair gradients), and we route the
removed part into a deploy-deletable auxiliary adapter, where their projection
only constrains training.
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the
% API. "post-backward vs forward .detach()" reads as engineering taste to an
% RL reader; "we route in parameter-gradient space, Cloud routes in activation
@@ -564,18 +595,31 @@ one-liners are in docs/grad\_routing/related\_work.md.}
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
we operate post-backward on parameter gradients (next bullet).
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
closest mechanistic relative, a post-backward parameter-gradient mask
closest mechanistic analogue, a post-backward parameter-gradient mask
over a forget subspace, tolerant to label noise, where the forgotten
capability leaking back into the kept weights shrinks as model size
grows (supports our scalability argument). We differ in the
mask \emph{source}: SGTM tags training data per example over fixed
reserved dims; we extract one hack direction from a few contrastive pairs
and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
weak-detector axis.
and route by cosine. Their TPR/FPR detector quality maps to our
weak-detector (held-out) axis.
\item Advantage-level intervention \citep{wu2026rebound}: representation-
informed advantage modulation; ours is gradient-level (one step deeper,
after the reward is computed). A matched-compute head-to-head is future
work.
informed advantage modulation; ours is gradient-level (after the reward
is computed). A matched-compute comparison is future work.
\item Other gradient-level reward-hacking work
\citep{wang2026grift, ono2026signcert, ackermann2026gradreg}: GRIFT
\emph{detects} hacking from the gradient of the CoT (we intervene on it);
SignCert-PO down-weights advantage-sign-non-robust completions in the
policy gradient; gradient regularisation biases updates toward flat,
reward-accurate minima. All act on the gradient but none localise a hack
subspace into a deletable auxiliary adapter.
\item Contrastive-direction removal on the reward model
\citep{liu2026harve, fierro2025weightarithmetic}: HARVE builds a hacking
subspace from residual-stream diffs of gold-vs-hacked pairs and projects
the reward-head vector off it; weight arithmetic isolates a behaviour
direction by subtracting two opposite fine-tunes. Both remove a
contrastive direction like our $v_{\text{hack}}$, but edit a static
model, not the live policy gradient.
\item Reward-for-honesty \citep{joglekar2025confessions}: we reject this
design, since it reintroduces a live judge over student rollouts and
invites monitor obfuscation (arXiv:2503.11926).
@@ -600,11 +644,31 @@ one-liners are in docs/grad\_routing/related\_work.md.}
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
older idea that a capability can be confined to a weight subset, via a
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
Our quarantine $\delta_{S,\text{hack}}$ is a deletable adapter in that
Our auxiliary $\delta_{S,\text{hack}}$ is a deletable adapter in that
family. Two differences: these methods \emph{add} a wanted task and pick
the subset from a given task label, whereas we \emph{remove} an unwanted
capability and pick the subset from a gradient signal ($\cos$ to
$v_{\text{hack}}$), with no per-rollout label.
$v_{\text{hack}}$), with no per-rollout label. The deletable-adapter idea
itself has a 2023 precedent: separable ``security vectors''
\citep{zhou2023securityvectors} absorb a harmful fine-tuning update so
the backbone never learns it, deactivated at inference; ours differs in
being RL reward hacking (not SFT harmful data), an extracted direction
plus cosine routing (not fixed reserved params), and the weak (held-out)
detector.
\item Orthogonal gradient projection lineage
\citep{yu2020pcgrad, ilharco2023taskarithmetic, qiao2025pegp,
shamsian2025orthograd, sun2026ogpsa}: projecting a gradient out of a
subspace originates in multi-task learning (PCGrad), weight-space task
arithmetic (negating a task vector), continual PEFT (PEGP), and
unlearning (OrthoGrad, OGPSA). We repurpose it to remove a hack
direction during RL rather than to preserve old tasks or unlearn data.
\item Reward-hacking generalises across types
\citep{taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize,
wang2026rewardhackingsurvey}: SFT on harmless hacks generalises to new
hack settings and to unrelated misalignment, and hacking transfers
across tasks. This is the premise behind our held-out test
(C2), and the localize-then-delete framing was proposed in the alignment
community \citep{mallen2025rhinterventions} before this work.
% Anticipated critique (Piggyback learns its mask end-to-end via a differentiable
% real-valued threshold): why is our route gate a per-step calibrated cosine
% threshold rather than a learned mask? Answer for the rebuttal: a learned mask
@@ -613,8 +677,12 @@ one-liners are in docs/grad\_routing/related\_work.md.}
% unsupervised hack-vs-clean cos gap, not a trained parameter.
% LoRA's rank-deficiency finding is mild external support for our low-rank hack
% subspace (~10 pairs => rank-10).
% \TODO{abstract-only twins to verify+place: GRIFT (gradient fingerprints,
% arXiv:2604.16242); Spilling the Beans (OOD self-report, arXiv:2511.06626).}
% GRIFT now cited (gradient-level competitor bullet above). Still unplaced:
% Spilling the Beans (OOD self-report, arXiv:2511.06626).
\item Gradient surgery against shortcuts in SFT reasoning
\citep{cao2026sart}: a per-sample shortcut score flags shortcut-promoting
examples, then orthogonal projection removes those updates. The
supervised-reasoning analogue of our GRPO reward-hacking setting.
\end{itemize}
\section{Lessons learned / discussion}
@@ -622,7 +690,7 @@ one-liners are in docs/grad\_routing/related\_work.md.}
stale fast (cos to live gradient decays $\sim$0.28$\to$0.07 by step 10), so
online refresh helps; (b) Adam momentum leak (projection does not touch the
buffer) -- bounded on frozen-V, open under refresh; (c) erase vs route trade-off
and why route's per-rollout gate + scale-matched quarantine beat the earlier
and why route's per-rollout gate + scale-matched auxiliary adapter beat the earlier
shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regime.}
\section{Why this matters for alignment}
@@ -637,13 +705,13 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
extending gradient routing to one RL side-effect, we give evidence it
may be promising for others.
% --- agent-suggested, keep or cut ---
\item \TODO{(agent-suggested) the quarantine knob is \emph{deletable}: you get
a localized handle on the unwanted behaviour rather than hoping a
\item \TODO{(agent-suggested) the auxiliary adapter is \emph{deletable}: you get
localized control over the unwanted behaviour rather than hoping a
penalty suppressed a latent capability (cf.\ unlearning-via-ablation in
\citep{cloud2024gradientrouting}).}
\item \TODO{(agent-suggested) it acts \emph{during} training, before the
behaviour bakes across all weights; our post-hoc test-time erasure
control tests whether that timing earns its cost.}
behaviour is distributed across all weights; our post-hoc test-time erasure
control tests whether that timing justifies its cost.}
\item \TODO{think more -- author.}
\end{itemize}
@@ -654,14 +722,15 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
\item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy
rollouts) due to compute limits, rather than waiting for endogenous
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
teacher is an accelerant, not the suppressed signal (the student
out-hacks it $\sim3{:}1$ from step 40, and emerges three held-out modes
with zero teacher examples); the teacher-off control there closes it.
\item Little optimization done per component, so there is likely substantial
low-hanging fruit in each novel piece (extraction, gating, refresh,
quarantine scale-matching).
teacher accelerates emergence, not the suppressed signal (the student
produces ${\sim}3{:}1$ more hacks than the teacher from step 40, and emerges
three held-out modes with zero teacher examples); the teacher-off control
there closes it.
\item Little optimization done per component, so there are likely substantial
straightforward improvements in each novel piece (extraction, gating, refresh,
auxiliary-adapter scale-matching).
\item \TODO{also: $n{=}3$, 60-step fast preset, single substrate; C2 clean
2-of-4 design unqueued; solve-rate matching band; no-cheat invariant
2-of-4 design unqueued; solve-rate matching band; no-label-leakage
assumptions.}
\end{itemize}
@@ -835,14 +904,14 @@ The erase and route gradient policies at the optimizer step
\caption{Gradient policy at the optimizer step (per Linear)}
\label{alg:step}
\begin{algorithmic}[1]
\Statex \textbf{erase (one-sided)} -- remove only the hack-ward component:
\Statex \textbf{erase (one-sided)} -- remove only the hack-aligned component:
\State $c \gets v_{\mathrm{hack}}\, g$;\quad $c_+ \gets \mathrm{relu}(c)$
\State $g \gets g - c_+\, v_{\mathrm{hack}}$;\quad $\mathrm{opt.step}(g)$
\Statex
\Statex \textbf{route} -- the removed part trains a deletable quarantine knob:
\Statex \textbf{route} -- the removed part trains a deletable auxiliary adapter:
\State $r \gets \mathrm{relu}(v_{\mathrm{hack}}\, g)\, v_{\mathrm{hack}}$
\State $\mathrm{opt.step}(\delta_S,\ g - r)$ \Comment{main knob learns the orthogonal complement}
\State $\mathrm{opt.step}(\delta_{S,\mathrm{hack}},\ r)$ \Comment{quarantine absorbs the hack-ward part}
\State $\mathrm{opt.step}(\delta_S,\ g - r)$ \Comment{kept adapter learns the orthogonal complement}
\State $\mathrm{opt.step}(\delta_{S,\mathrm{hack}},\ r)$ \Comment{auxiliary adapter absorbs the hack-aligned part}
\State at deploy: $\delta_{S,\mathrm{hack}} \gets 0$
\end{algorithmic}
\end{algorithm}
@@ -921,7 +990,7 @@ and basis strength (Table~\ref{tab:ctx-basis}).
% results.md Q6 (v_hack_full, frozen, one_sided; paired Delta vs same-seed vanilla).
\begin{table}[h]
\centering
\caption{Teacher density: the hack cut holds as the pool thins and the solve
\caption{Teacher density: the hack reduction holds as the pool shrinks and the solve
cost vanishes at low mix (results.md Q6); mix=0.125 is the locked default.
Paired $\Delta$ vs same-seed vanilla.}
\label{tab:ctx-mix}
@@ -942,8 +1011,8 @@ and basis strength (Table~\ref{tab:ctx-basis}).
\centering
\caption{Pair-set content: it is the hack \emph{mechanism}, not the framing
(results.md Q10). $n{=}1$/row, seed 41; $\pm0.06$ is baseline noise, so
everything from \texttt{intent\_vs\_spec} down is null. The \texttt{null\_city}
placebo sits at $+0.024$ (no effect), as a control should.
everything from \texttt{intent\_vs\_spec} down is null. The random (placebo)
pair set sits at $+0.024$ (no effect), as a control should.
\TODO{this table is hard to read without seeing a pair: add an appendix with
one example (hack, clean) pair per pair set listed here, so the reader can judge
what ``hack mechanism'' vs ``semantic framing'' vs ``random content'' means.}}
@@ -959,13 +1028,17 @@ and basis strength (Table~\ref{tab:ctx-basis}).
\texttt{moral} & semantic framing & $-0.005$ \\
\texttt{eval\_aware} & semantic framing & $+0.010$ \\
\texttt{philosophical} & semantic framing & $+0.017$ \\
\texttt{null\_city} (placebo) & random content & $+0.024$ \\
% paper "random (placebo)" := config pair-set null_city
random (placebo) & random content & $+0.024$ \\
\bottomrule
\end{tabular}
\end{table}
% results.md Q8 (mix=0.5, frozen, one_sided). Basis NAMES mislead: v_hack_full
% = 10 pairs/k=5; v_hack_21pairs = 16 pairs/k=12 (triple-confounded).
% results.md Q8 (mix=0.5, frozen, one_sided). Basis NAMES mislead, so the paper
% uses descriptive labels. paper:code mapping --
% "weak (10 pairs, k=5)" := config v_hack_full (10 pairs, k=5)
% "strong (16 pairs, k=12)" := config v_hack_21pairs (16 pairs, k=12)
% (triple-confounded across pairs/k/extract-tau).
\begin{table}[h]
\centering
\caption{Basis strength (results.md Q8): the stronger basis cuts hack
@@ -978,8 +1051,8 @@ and basis strength (Table~\ref{tab:ctx-basis}).
Basis & Train hack $\downarrow$ & Train solve $\uparrow$ & $n$ \\
\midrule
Vanilla & $0.719 \pm 0.120$ & $0.306 \pm 0.116$ & 4 \\
\texttt{v\_hack\_full} (weak, 10pr/$k$5) & $0.700 \pm 0.109$ & $0.283 \pm 0.038$ & 3 \\
\texttt{v\_hack\_21pairs} (16pr/$k$12) & $0.588 \pm 0.131$ & $0.256 \pm 0.083$ & 4 \\
weak (10 pairs, $k{=}5$) & $0.700 \pm 0.109$ & $0.283 \pm 0.038$ & 3 \\
strong (16 pairs, $k{=}12$) & $0.588 \pm 0.131$ & $0.256 \pm 0.083$ & 4 \\
\bottomrule
\end{tabular}
\end{table}
+76 -1
View File
@@ -121,7 +121,6 @@
% IDs from docs/grad_routing/{related_work,search_for_more}.md. Authors NOT
% filled (not read) -- do not cite as @misc with invented authors. Verify
% byline from arXiv before promoting any of these into the bibliography:
% GRIFT (gradient fingerprints to detect/reject hacking) arXiv:2604.16242
% Spilling the Beans (SFT self-report generalises OOD) arXiv:2511.06626
% Baker et al. (weak monitor -> obfuscated reward hacking) arXiv:2503.11926
@@ -321,3 +320,79 @@
howpublished = {Alignment Forum},
url = {https://www.alignmentforum.org/posts/Ge55vxEmKXunFFwoe/reward-hacking-behavior-can-generalize-across-tasks}
}
% --- verified additions, 2026-06-05 (abstract verbatim from Semantic Scholar /
% arXiv API via scripts; byline cross-checked arXiv + OpenAlex; see
% docs/grad_routing/search_2026-06-04_related_work.md) -----------------------
% The uncited gradient-level NEAR-TWIN on the signal axis: GRIFT DETECTS reward
% hacking from the gradient of the CoT (we INTERVENE on the gradient). Resolves
% the flagged abstract-only twin. Differentiate detect-vs-intervene.
@misc{wang2026grift,
title = {Detecting and Suppressing Reward Hacking with Gradient Fingerprints},
author = {Wang, Songtao and Pham, Quang Hieu and Yin, Fangcong and Wang, Xinpeng and Chen, Jocelyn Qiaochu and Durrett, Greg and Ye, Xi},
year = {2026},
eprint = {2604.16242},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2604.16242},
abstract = {Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25\% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces.}
}
% Gradient-level reward-hacking competitor: re-weights the policy gradient by an
% advantage-sign-robustness certificate. Same lab as ackermann2026gradreg (RIKEN).
@misc{ono2026signcert,
title = {Mitigating Reward Hacking in RLHF via Advantage Sign Robustness},
author = {Ono, Shinnosuke and Ackermann, Johannes and Nishimori, Soichiro and Ishida, Takashi and Sugiyama, Masashi},
year = {2026},
eprint = {2604.02986},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2604.02986},
abstract = {Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.}
}
% --- gradient-projection lineage (the continual-learning ancestry of "project
% a gradient out of a subspace"; a CL reviewer expects these) ----------------
% PCGrad: the canonical "project a gradient onto the normal plane of a
% conflicting one" primitive (multi-task learning).
@inproceedings{yu2020pcgrad,
title = {Gradient Surgery for Multi-Task Learning},
author = {Yu, Tianhe and Kumar, Saurabh and Gupta, Abhishek and Levine, Sergey and Hausman, Karol and Finn, Chelsea},
booktitle = {Neural Information Processing Systems (NeurIPS)},
year = {2020},
eprint = {2001.06782},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2001.06782},
abstract = {While deep learning and deep reinforcement learning (RL) systems have demonstrated impressive results in domains such as image classification, game playing, and robotic control, data efficiency remains a major challenge. Multi-task learning has emerged as a promising approach for sharing structure across multiple tasks to enable more efficient learning. However, the multi-task setting presents a number of optimization challenges, making it difficult to realize large efficiency gains compared to learning tasks independently. The reasons why multi-task learning is so challenging compared to single-task learning are not fully understood. In this work, we identify a set of three conditions of the multi-task optimization landscape that cause detrimental gradient interference, and develop a simple yet general approach for avoiding such interference between task gradients. We propose a form of gradient surgery that projects a task's gradient onto the normal plane of the gradient of any other task that has a conflicting gradient. On a series of challenging multi-task supervised and multi-task RL problems, this approach leads to substantial gains in efficiency and performance. Further, it is model-agnostic and can be combined with previously-proposed multi-task architectures for enhanced performance.}
}
% Task arithmetic: negating a weight-space task vector reduces a behavior --
% the weight-direction-removal precedent for our extracted v_hack (cite as
% steering/negation, NOT as unlearning: they frame it as model editing).
@inproceedings{ilharco2023taskarithmetic,
title = {Editing Models with Task Arithmetic},
author = {Ilharco, Gabriel and Ribeiro, Marco Tulio and Wortsman, Mitchell and Gururangan, Suchin and Schmidt, Ludwig and Hajishirzi, Hannaneh and Farhadi, Ali},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2023},
eprint = {2212.04089},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2212.04089},
abstract = {Changing how pre-trained models behave -- e.g., improving their performance on a downstream task or mitigating biases learned during pre-training -- is a common practice when developing machine learning systems. In this work, we propose a new paradigm for steering the behavior of neural networks, centered around task vectors. A task vector specifies a direction in the weight space of a pre-trained model, such that movement in that direction improves performance on the task. We build task vectors by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning on a task. We show that these task vectors can be modified and combined together through arithmetic operations such as negation and addition, and the behavior of the resulting model is steered accordingly. Negating a task vector decreases performance on the target task, with little change in model behavior on control tasks. Moreover, adding task vectors together can improve performance on multiple tasks at once. Finally, when tasks are linked by an analogy relationship of the form ``A is to B as C is to D'', combining task vectors from three of the tasks can improve performance on the fourth, even when no data from the fourth task is used for training. Overall, our experiments with several models, modalities and tasks show that task arithmetic is a simple, efficient and effective way of editing models.}
}
% PEGP: orthogonal gradient projection inside Adapter/LoRA/Prefix/Prompt PEFT --
% the closest methodological prior art to our erase arm (orthogonal projection
% in a low-rank tuning subspace, repurposed there for anti-forgetting). TPAMI.
@article{qiao2025pegp,
title = {Gradient Projection for Continual Parameter-Efficient Tuning},
author = {Qiao, Jingyang and Zhang, Zhizhong and Tan, Xin and Qu, Yanyun and Zhang, Wensheng and Han, Zhi and Xie, Yuan},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
year = {2025},
doi = {10.1109/TPAMI.2025.3587032},
url = {https://doi.org/10.1109/TPAMI.2025.3587032},
abstract = {Parameter-efficient tunings (PETs) have demonstrated impressive performance and promising perspectives in training large models, while they are still confronted with a common problem: the trade-off between learning new content and protecting old knowledge, leading to zero-shot generalization collapse, and cross-modal hallucination. In this paper, we reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection, and first propose a unified framework called Parameter Efficient Gradient Projection (PEGP). We introduce orthogonal gradient projection into different PET paradigms and theoretically demonstrate that the orthogonal condition for the gradient can effectively resist forgetting even for large-scale models. It therefore modifies the gradient towards the direction that has less impact on the old feature space, with less extra memory space and training time. We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets, and experiments comprehensively demonstrate its efficiency in reducing forgetting in class, online class, domain, task, and multi-modality continual settings.}
}