|
|
@@ -1,6 +1,10 @@
|
|
|
|
% HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the
|
|
|
|
% STATUS 2026-06-12: Sections 1-2 (intro, method) now describe the CURRENT
|
|
|
|
% current activation-scored routeA implementation. See README.md and
|
|
|
|
% activation-scored routeA method (lora2r adapter, pooled-activation quantile gate).
|
|
|
|
% RESEARCH_JOURNAL.md before revising claims or results.
|
|
|
|
% The Results section and its figures are RETIRED routeV/PiSSA-substrate evidence,
|
|
|
|
|
|
|
|
% kept for provenance only -- the PiSSA placebo tie was later diagnosed as shrinkage
|
|
|
|
|
|
|
|
% (shared frozen basis made routing a magnitude split), which is why the substrate
|
|
|
|
|
|
|
|
% changed. Replace results with the lora2r routeA decision runs when they land.
|
|
|
|
|
|
|
|
% See README.md and RESEARCH_JOURNAL.md before revising claims or results.
|
|
|
|
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
|
|
|
|
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
|
|
|
|
% MINIMAL skeleton: section outline + contributions + evidence tables + figures
|
|
|
|
% MINIMAL skeleton: section outline + contributions + evidence tables + figures
|
|
|
|
% + refs + factual appendices (traces, counts, pseudocode ported from the blog).
|
|
|
|
% + refs + factual appendices (traces, counts, pseudocode ported from the blog).
|
|
|
@@ -85,10 +89,48 @@
|
|
|
|
\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
|
|
|
|
\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
|
|
|
|
\end{center}
|
|
|
|
\end{center}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
% CLAIMS (2026-06-12, from docs/human_journal.md 2026-06-11; numbers pending the
|
|
|
|
|
|
|
|
% lora2r routeA decision runs -- every \TODO{X} below is a placeholder):
|
|
|
|
|
|
|
|
% C1 (narrow): in this environment, direction-gated routing reduces the hack
|
|
|
|
|
|
|
|
% rate retained at deployment versus the capacity-matched vanilla control,
|
|
|
|
|
|
|
|
% at comparable or better solve rate.
|
|
|
|
|
|
|
|
% C2 (the decisive control): the extracted direction improves the hack/solve
|
|
|
|
|
|
|
|
% tradeoff over a Haar-random direction through identical routing
|
|
|
|
|
|
|
|
% machinery. Routing alone is expected to suppress some hacking; the
|
|
|
|
|
|
|
|
% direction's claim is the tradeoff, not the suppression.
|
|
|
|
|
|
|
|
% C3 (oracle-free): no ground-truth label of a training rollout sets routes or
|
|
|
|
|
|
|
|
% thresholds; gate quality is reported as a classifier (AUROC, precision
|
|
|
|
|
|
|
|
% at the cuts) with labels used for measurement only.
|
|
|
|
|
|
|
|
% Do NOT claim: label-free (the pairs are labels, just not of training
|
|
|
|
|
|
|
|
% rollouts); held-out-mode generalization (current env runs one loophole mode);
|
|
|
|
|
|
|
|
% systematic evidence (one env, one model, teacher-bootstrapped, few seeds).
|
|
|
|
\begin{abstract}
|
|
|
|
\begin{abstract}
|
|
|
|
\TODO{abstract -- author writes. Draft sketch lives in
|
|
|
|
Reinforcement learning post-training can teach a policy to exploit flaws in its
|
|
|
|
docs/spec/20260602\_writeup\_spec.md (Heilmeier + Nature structure). Stick to
|
|
|
|
grader rather than solve its task. Existing interventions modify the reward or
|
|
|
|
the three claims C1/C2/C3.}
|
|
|
|
advantage and therefore need a hack detector at training time: in practice the
|
|
|
|
|
|
|
|
environment's ground-truth grader, which a new environment does not supply, or a
|
|
|
|
|
|
|
|
separate judge model, which is unavailable once the policy itself is at the
|
|
|
|
|
|
|
|
frontier. The policy's own internal representations remain available at any
|
|
|
|
|
|
|
|
scale. Gradient routing offers an alternative: localize the unwanted behavior into parameters that are
|
|
|
|
|
|
|
|
deleted at deployment. But prior gradient-routing methods select each example's
|
|
|
|
|
|
|
|
route from a per-example data label, reintroducing the same detector
|
|
|
|
|
|
|
|
requirement. We test whether an extracted direction can replace
|
|
|
|
|
|
|
|
those labels. From a small set of hand-written (hack, clean) completion pairs we
|
|
|
|
|
|
|
|
extract an activation-space hacking direction, which we call $v_{\text{act}}$;
|
|
|
|
|
|
|
|
during GRPO, the dot product
|
|
|
|
|
|
|
|
between each rollout's pooled activations and this direction selects whether its
|
|
|
|
|
|
|
|
update trains the deployed parameters, quarantine parameters removed at
|
|
|
|
|
|
|
|
deployment, or both. No ground-truth label of a training rollout is read during
|
|
|
|
|
|
|
|
training. In a LeetCode reward-hacking environment with Qwen3-4B, the method
|
|
|
|
|
|
|
|
reduces the hack rate retained at deployment from \TODO{X} to \TODO{Y} while
|
|
|
|
|
|
|
|
\TODO{maintaining/improving} the solve rate relative to vanilla GRPO. Routing
|
|
|
|
|
|
|
|
through a Haar-random direction also suppresses hacking, so part of the effect
|
|
|
|
|
|
|
|
is the quarantine mechanism itself; the extracted direction improves the
|
|
|
|
|
|
|
|
tradeoff, reaching \TODO{lower hack and higher solve} than the random control.
|
|
|
|
|
|
|
|
The evidence is from one environment and one model, with teacher-bootstrapped
|
|
|
|
|
|
|
|
training runs; within that scope, it suggests an extracted representation can
|
|
|
|
|
|
|
|
stand in for routing labels in settings where labels for training rollouts are
|
|
|
|
|
|
|
|
unavailable.
|
|
|
|
\end{abstract}
|
|
|
|
\end{abstract}
|
|
|
|
|
|
|
|
|
|
|
|
% --- Headline figure: declared before the body so the [t] float lands on p.1
|
|
|
|
% --- Headline figure: declared before the body so the [t] float lands on p.1
|
|
|
@@ -106,7 +148,8 @@ the three claims C1/C2/C3.}
|
|
|
|
\begin{figure}[!ht]
|
|
|
|
\begin{figure}[!ht]
|
|
|
|
\centering
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.pdf}
|
|
|
|
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.pdf}
|
|
|
|
\caption{Deploy hack rate (top) and solve rate (bottom) over 60 GRPO steps for
|
|
|
|
\caption{\TODO{retired routeV run; regenerate from the lora2r routeA decision
|
|
|
|
|
|
|
|
runs} Deploy hack rate (top) and solve rate (bottom) over 60 GRPO steps for
|
|
|
|
route (ours) and vanilla GRPO, three seeds each; thick line is the mean, thin
|
|
|
|
route (ours) and vanilla GRPO, three seeds each; thick line is the mean, thin
|
|
|
|
lines per seed (EMA-5). Deploy evaluation is adapter-off at $n{=}64$, $T{=}0.7$.
|
|
|
|
lines per seed (EMA-5). Deploy evaluation is adapter-off at $n{=}64$, $T{=}0.7$.
|
|
|
|
Vanilla acquires the hack, rising to ${\sim}0.4$, while route stays near zero and
|
|
|
|
Vanilla acquires the hack, rising to ${\sim}0.4$, while route stays near zero and
|
|
|
@@ -128,41 +171,47 @@ the three claims C1/C2/C3.}
|
|
|
|
RL post-training reliably induces reward hacking: the policy learns to exploit
|
|
|
|
RL post-training reliably induces reward hacking: the policy learns to exploit
|
|
|
|
flaws in the grader rather than solve the task. Today's interventions act on the
|
|
|
|
flaws in the grader rather than solve the task. Today's interventions act on the
|
|
|
|
reward or advantage \citep{wu2026rebound} and need a detector at scoring time,
|
|
|
|
reward or advantage \citep{wu2026rebound} and need a detector at scoring time,
|
|
|
|
yet at deployment some hacks are unknown. We test whether intervening at the
|
|
|
|
yet at deployment some hacks are unknown, and a judge model stronger than the
|
|
|
|
\emph{gradient} level can stop the policy acquiring a hack as
|
|
|
|
policy stops being available once the policy is itself at the frontier. We test
|
|
|
|
it forms (Figure~\ref{fig:keynote}). Our hypothesis:
|
|
|
|
whether gradient routing
|
|
|
|
|
|
|
|
\citep{cloud2024gradientrouting, sgtm2025localization} can localize a hack into
|
|
|
|
|
|
|
|
deletable parameters as it forms (Figure~\ref{fig:keynote}). Our hypothesis:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{quote}
|
|
|
|
\begin{quote}
|
|
|
|
We can find a ``reward-hacking direction'' by contrasting the GRPO gradients of
|
|
|
|
Prior gradient-routing methods select each example's route from a data label. We
|
|
|
|
hacky and clean completions, and then, during normal GRPO training, route that
|
|
|
|
ask whether an extracted activation-space hacking direction can replace those
|
|
|
|
direction out of the live gradient on each adapter parameter, reducing the
|
|
|
|
labels: each rollout's pooled activations select whether its GRPO update trains
|
|
|
|
reward-hack rate without a ground-truth grader in the loop.
|
|
|
|
the parameters retained at deployment, the quarantine parameters removed by
|
|
|
|
|
|
|
|
deployment ablation, or both.
|
|
|
|
\end{quote}
|
|
|
|
\end{quote}
|
|
|
|
|
|
|
|
|
|
|
|
The detector that supplies the direction is allowed to be weak: it may flag one
|
|
|
|
The direction is extracted from a small set of hand-authored (hack, clean)
|
|
|
|
hack type and miss others, mimicking the known-vs-unknown split at deployment
|
|
|
|
contrast pairs, written without observing training rollouts, so no ground-truth
|
|
|
|
|
|
|
|
label of a training rollout enters the gate. The decisive control is a
|
|
|
|
|
|
|
|
Haar-random direction passed through the same routing machinery
|
|
|
|
(Section~\ref{sec:method}).
|
|
|
|
(Section~\ref{sec:method}).
|
|
|
|
\TODO{rewrite -- author. Outline: (1) RL post-training induces reward hacking;
|
|
|
|
\TODO{rewrite -- author. Outline: (1) RL post-training induces reward hacking;
|
|
|
|
(2) interventions today act on reward/advantage \citep{wu2026rebound} and need a
|
|
|
|
(2) interventions today act on reward/advantage \citep{wu2026rebound} and need a
|
|
|
|
detector at scoring time; (3) at deploy some hacks are unknown; (4) here we route
|
|
|
|
detector at scoring time; (3) at deploy some hacks are unknown; (4) here we
|
|
|
|
the GRPO gradient away from a weak-detector hack direction. Snippet source:
|
|
|
|
route each rollout's update by an extracted activation direction instead of a
|
|
|
|
README ``How it works'' + blog intro.}
|
|
|
|
per-example label. Snippet source: README ``How it works'' + AGENTS.md.}
|
|
|
|
|
|
|
|
|
|
|
|
\paragraph{Contributions.} % author-dictated; factual claims.
|
|
|
|
\paragraph{Contributions.} % author-dictated; factual claims.
|
|
|
|
\begin{enumerate}
|
|
|
|
\begin{enumerate}
|
|
|
|
\item We adapt the post-backward parameter-gradient routing of
|
|
|
|
\item We adapt the selective masking of \citet{sgtm2025localization} (reserve
|
|
|
|
\citet{sgtm2025localization} (reserve a forget subspace, delete it at
|
|
|
|
a forget block, delete it at deployment) from supervised unlearning to
|
|
|
|
deploy) from supervised unlearning to reward hacking in RL post-training.
|
|
|
|
reward hacking in RL post-training. We keep the localize-then-ablate
|
|
|
|
We keep the localize-then-ablate framing of gradient routing
|
|
|
|
framing of gradient routing \citep{cloud2024gradientrouting} and realize
|
|
|
|
\citep{cloud2024gradientrouting} but route post-backward on parameter
|
|
|
|
it as a per-rollout three-way mask over two independent adapter blocks
|
|
|
|
gradients rather than via Cloud's forward \texttt{.detach()} on activations.
|
|
|
|
(deployed and quarantine), set before the gradient-carrying forward pass.
|
|
|
|
% Gradient routing usually needs labels. We replace labels with a weight-space hacking vector from synthetic contrastive gradients.
|
|
|
|
% Gradient routing usually needs labels. We replace labels with an activation-space hacking direction from synthetic contrastive pairs.
|
|
|
|
\item We replace the routing signal itself. \citet{sgtm2025localization} and
|
|
|
|
\item We replace the routing signal itself. \citet{sgtm2025localization} and
|
|
|
|
gradient routing tag the training \emph{data} (per-example / per-token,
|
|
|
|
gradient routing tag the training \emph{data} (per-example / per-token,
|
|
|
|
$O(\text{dataset})$ labels); we extract one hack \emph{direction},
|
|
|
|
$O(\text{dataset})$ labels); we extract one hack \emph{direction},
|
|
|
|
representation-engineering
|
|
|
|
representation-engineering style, from a small set of hand-authored
|
|
|
|
style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
|
|
|
|
contrastive (hack, clean) pairs, and score each rollout by the dot
|
|
|
|
$\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
|
|
|
|
product between its pooled bottleneck activations and that direction.
|
|
|
|
|
|
|
|
The live RL rollouts carry no labels.
|
|
|
|
% \item We extend the Ariahw LeetCode reward-hacking RL environment
|
|
|
|
% \item We extend the Ariahw LeetCode reward-hacking RL environment
|
|
|
|
% \citep{ariahw2025steering} with three additional loophole types (four
|
|
|
|
% \citep{ariahw2025steering} with three additional loophole types (four
|
|
|
|
% total: run\_tests, sentinel, stdout\_marker, file\_marker).
|
|
|
|
% total: run\_tests, sentinel, stdout\_marker, file\_marker).
|
|
|
@@ -172,116 +221,148 @@ README ``How it works'' + blog intro.}
|
|
|
|
\label{sec:method}
|
|
|
|
\label{sec:method}
|
|
|
|
|
|
|
|
|
|
|
|
We call the method \textbf{vGROUT} (vector gradient routing). It has two phases.
|
|
|
|
We call the method \textbf{vGROUT} (vector gradient routing). It has two phases.
|
|
|
|
(1) \emph{Make} a reward-hacking direction $v_{\text{hack}}$ from a few contrastive
|
|
|
|
(1) \emph{Extract} a reward-hacking direction $v_{\text{act}}$ from a few
|
|
|
|
(hack, clean) pairs (Section~\ref{sec:extract}). (2) During GRPO, use $v_{\text{hack}}$
|
|
|
|
contrastive (hack, clean) pairs, using forward passes only
|
|
|
|
to separate the live gradient: the hack-aligned part is either \emph{erased}
|
|
|
|
(Section~\ref{sec:extract}). (2) During GRPO, score each rollout by the dot
|
|
|
|
(subtracted and discarded, leaving the orthogonal complement) or \emph{routed}
|
|
|
|
product between its pooled bottleneck activations and $v_{\text{act}}$, and set a
|
|
|
|
(the whole gated rollout's gradient is sent, not discarded, into a separate adapter
|
|
|
|
per-rollout mask over the two adapter blocks before the gradient-carrying forward
|
|
|
|
$\delta_{S,\text{hack}}$ that is deleted at deploy). The split acts on the
|
|
|
|
pass: low-scoring rollouts train only the deployed block, high-scoring rollouts
|
|
|
|
\emph{gradient} during training; the deletion acts on the \emph{weights} at deploy.
|
|
|
|
train only the quarantine block, and rollouts between the two thresholds train
|
|
|
|
Mechanically vGROUT follows the post-backward, deletable-block routing of
|
|
|
|
both. Deployment ablation resets the quarantine block to its initialization, so
|
|
|
|
\citet{sgtm2025localization} (the gradient-routing family of
|
|
|
|
whatever was localized there is absent from the deployed model. Mechanically
|
|
|
|
\citealp{cloud2024gradientrouting}); it differs from both in that the routing is
|
|
|
|
vGROUT follows the deletable-block routing of \citet{sgtm2025localization} (the
|
|
|
|
gated by an extracted direction, not a per-example data label.
|
|
|
|
gradient-routing family of \citealp{cloud2024gradientrouting}); it differs from
|
|
|
|
|
|
|
|
both in that the route is selected by an extracted direction, not a per-example
|
|
|
|
|
|
|
|
data label.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Adapter}
|
|
|
|
\subsection{Adapter}
|
|
|
|
- We use lora, where half is masked
|
|
|
|
% PROVENANCE: src/vgrout/lora2r.py; supersedes the retired PiSSA delta_S substrate
|
|
|
|
% FIXME we now use lora
|
|
|
|
% (its placebo tie was shrinkage: the shared frozen basis made routing a magnitude
|
|
|
|
|
|
|
|
% split; see RESEARCH_JOURNAL.md).
|
|
|
|
|
|
|
|
Each target Linear layer receives one rank-$2r$ LoRA adapter with
|
|
|
|
|
|
|
|
$A\in\mathbb{R}^{2r\times d_{\text{in}}}$ and
|
|
|
|
|
|
|
|
$B\in\mathbb{R}^{d_{\text{out}}\times 2r}$, both trainable. Frozen copies of the
|
|
|
|
|
|
|
|
Gaussian initialization are subtracted in the forward pass, so the net adapter
|
|
|
|
|
|
|
|
delta is exactly zero at initialization. The $2r$ dimensions split into a
|
|
|
|
|
|
|
|
\emph{deployed} block $[:r]$ and a \emph{quarantine} block $[r:]$; because
|
|
|
|
|
|
|
|
$[B \,|\, B_q]\,([A; A_q]\,x)$ has no cross terms, the two blocks are independent
|
|
|
|
|
|
|
|
adapters sharing a module. Deployment ablation resets the quarantine block to its
|
|
|
|
|
|
|
|
initialization, removing its learned contribution from the deployed model.
|
|
|
|
|
|
|
|
|
|
|
|
% % PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
|
|
|
|
Per rollout, a mask $(m, d)$ on the block outputs selects one of three training
|
|
|
|
% % Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
|
|
|
|
modes, matching the mask rows of \citet{sgtm2025localization}:
|
|
|
|
% % U, Vh frozen and double as the v_hack basis.
|
|
|
|
\begin{itemize}
|
|
|
|
% \TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
|
|
|
|
\item \emph{keep} $(0,0)$: only the deployed block trains; the quarantine
|
|
|
|
% singular-value coordinates; we freeze $U,V$ and train a per-module adapter
|
|
|
|
output is zeroed in the forward pass, so the deployed block learns to
|
|
|
|
% parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
|
|
|
|
operate under deployment ablation.
|
|
|
|
% basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
|
|
|
|
\item \emph{route} $(1,1)$: only the quarantine block trains; the deployed
|
|
|
|
% and the projection all live in this same low-rank, weight-aligned space
|
|
|
|
output remains in the forward pass but is detached from the gradient.
|
|
|
|
% ($r\sim500$--$2560$). Two consequences we use:
|
|
|
|
\item \emph{absorb} $(1,0)$: both blocks train, which may permit absorption
|
|
|
|
% \begin{itemize}
|
|
|
|
\citep{cloud2024gradientrouting}.
|
|
|
|
% \item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
|
|
|
|
\end{itemize}
|
|
|
|
% never reconstructed on the main path), so an adapter-off forward gives
|
|
|
|
|
|
|
|
% $\pi_{\text{ref}}$ with no second model.
|
|
|
|
|
|
|
|
% \item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
|
|
|
|
|
|
|
|
% hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
|
|
|
|
|
|
|
|
% training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
|
|
|
|
|
|
|
|
% exactly that routed capability.
|
|
|
|
|
|
|
|
% \end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Extracting the hack direction}
|
|
|
|
\subsection{Extracting the hack direction}
|
|
|
|
\label{sec:extract}
|
|
|
|
\label{sec:extract}
|
|
|
|
\TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs
|
|
|
|
% PROVENANCE: src/vgrout/extract_vhack_act.py; pairs in data/pairs/hack_pairs.md.
|
|
|
|
(Appendix~\ref{app:pairs}), compute
|
|
|
|
Each contrast pair consists of one completion that exploits the evaluation
|
|
|
|
the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
|
|
|
|
procedure and one correct completion for the same prompt, matched in length and
|
|
|
|
algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on
|
|
|
|
style so that the hack is the main difference between them
|
|
|
|
$\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by
|
|
|
|
(Appendix~\ref{app:pairs}). For each pair we run a forward pass and read the
|
|
|
|
majority sign, drop the global bottom-25\% singular values as noise floor.
|
|
|
|
deployed-block bottleneck activation $A_{[:r]}x$, mean-pooled over completion
|
|
|
|
Pseudocode in Appendix~\ref{app:pseudocode}. No-label-leakage invariant: the pairs
|
|
|
|
tokens. The per-module mean hack-minus-clean difference, unit-normalized per
|
|
|
|
may select/calibrate; live routing never reads \texttt{gt\_pass}.}
|
|
|
|
module, is $v_{\text{act}}$. Extraction uses forward passes only, and is repeated
|
|
|
|
% Known issue from docs/pseudocode/02_extract_vhack.py (REVIEW comment): the
|
|
|
|
every $N$ steps with the quarantine block ablated so the direction tracks the
|
|
|
|
% GRPO=NLL-diff identity holds only at adv=+/-1, no ratio-clip, no length-norm;
|
|
|
|
current model. The pair sides provide the only hack/clean labels in the method;
|
|
|
|
% but extraction is length-normalized (.mean() per completion) while the live
|
|
|
|
no ground-truth label of a training rollout is read during training.
|
|
|
|
% Dr.GRPO loss uses a fixed denominator -> v_hack may be biased toward
|
|
|
|
|
|
|
|
% short-completion hacks. Flag in Limitations; not yet measured.
|
|
|
|
|
|
|
|
\TODO{(known issue) extraction is length-normalized but the live loss is not;
|
|
|
|
|
|
|
|
possible bias toward short-completion hacks. Unmeasured -- see Limitations.}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Arms: erase vs.\ route, offline vs.\ online}
|
|
|
|
\subsection{The gate and the comparison arms}
|
|
|
|
% FIXME outdated
|
|
|
|
% PROVENANCE: gate in src/vgrout/train.py (routeA); arms in train_config.py.
|
|
|
|
\TODO{outline -- the design axes (this is part of what is novel). Two ways to
|
|
|
|
The routing score is computed on the no-gradient $\log\pi_{\text{old}}$ forward
|
|
|
|
keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
|
|
|
|
pass the GRPO loss already requires: an activation-capture hook pools the same
|
|
|
|
|
|
|
|
bottleneck activations over completion tokens, and the score is the pooled dot
|
|
|
|
|
|
|
|
product with $v_{\text{act}}$. Thresholds are the symmetric $q$ and $1{-}q$
|
|
|
|
|
|
|
|
quantiles of a buffer holding every score from the run; until the buffer reaches
|
|
|
|
|
|
|
|
a warmup count the gate assigns absorb to all rollouts. A rollout scoring at or
|
|
|
|
|
|
|
|
below the lower threshold is assigned keep, at or above the upper threshold
|
|
|
|
|
|
|
|
route, and otherwise absorb. The thresholds use no labels. They are quantiles of
|
|
|
|
|
|
|
|
a run-spanning buffer rather than fixed cuts because the hack base rate is
|
|
|
|
|
|
|
|
nonstationary: near zero early in training and saturating later.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The gate is a binary classifier with a reject option: one linear score
|
|
|
|
|
|
|
|
($v_{\text{act}}$ is the weight vector) with two ordered cuts. Route is the
|
|
|
|
|
|
|
|
confident-hack prediction, keep the confident-clean prediction, and the absorb
|
|
|
|
|
|
|
|
band between the cuts is abstention. Its three outputs predict the
|
|
|
|
|
|
|
|
$\mathbf{D}_{\text{forget}}$ / $\mathbf{D}_{\text{retain}}$ /
|
|
|
|
|
|
|
|
$\mathbf{D}_{\text{unlabeled}}$ assignment of \citet{sgtm2025localization},
|
|
|
|
|
|
|
|
taking the role of their input classifier without environment labels. The error
|
|
|
|
|
|
|
|
costs are asymmetric. A hack assigned keep trains the deployed block with the
|
|
|
|
|
|
|
|
quarantine zeroed in its forward pass, outside the reach of absorption; a clean
|
|
|
|
|
|
|
|
rollout assigned route trains only the quarantine, which deployment ablation
|
|
|
|
|
|
|
|
deletes. A hack that falls into the absorb band instead leaves both blocks
|
|
|
|
|
|
|
|
trainable, where the leakage results of \citet{sgtm2025localization} suggest it
|
|
|
|
|
|
|
|
predominantly updates the already-localized quarantine. Both cuts therefore
|
|
|
|
|
|
|
|
target precision, and hack recall is delegated to absorption. We evaluate the
|
|
|
|
|
|
|
|
gate accordingly, with classifier metrics computed against ground-truth rollout
|
|
|
|
|
|
|
|
labels used for measurement only: AUROC of the score, and precision, recall, and
|
|
|
|
|
|
|
|
a precision-weighted $F_{0.5}$ at the route cut.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We compare arms that share the model, adapter, and teacher pool and differ only
|
|
|
|
|
|
|
|
in the gate:
|
|
|
|
\begin{itemize}
|
|
|
|
\begin{itemize}
|
|
|
|
\item \emph{erase} (one-sided): subtract the $v_{\text{hack}}$ component from
|
|
|
|
\item \emph{none}: the gate is pinned keep, so the quarantine block never
|
|
|
|
the live $\delta_S$ gradient; the optimizer steps on what remains (the
|
|
|
|
trains. This is the capacity- and structure-matched vanilla control and
|
|
|
|
part orthogonal to $v_{\text{hack}}$).
|
|
|
|
the emergence reference.
|
|
|
|
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
|
|
|
|
\item \emph{routeA}: the method, with the per-rollout three-way gate above.
|
|
|
|
calibrated each step from the hack-vs-clean cosine gap) sends that
|
|
|
|
Replacing $v_{\text{act}}$ with a Haar-random direction gives the
|
|
|
|
rollout's whole gradient into a separate auxiliary adapter
|
|
|
|
placebo control: the same routing machinery with no pair signal.
|
|
|
|
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
|
|
|
|
\item \emph{absorb}: the masks are pinned $(1,0)$, so both blocks train on
|
|
|
|
update so the kept adapter is free of it, and deleted at deploy.
|
|
|
|
every rollout. This isolates ungated two-block training; it does not by
|
|
|
|
Mechanically this is
|
|
|
|
itself establish absorption.
|
|
|
|
post-backward parameter-gradient routing \citep{sgtm2025localization}
|
|
|
|
|
|
|
|
in the SVD basis (we strip/redirect the
|
|
|
|
|
|
|
|
$\delta_S$ gradient after the backward pass, not via a forward
|
|
|
|
|
|
|
|
\texttt{.detach()}), routed into a deletable subspace rather than zeroed,
|
|
|
|
|
|
|
|
the localize-then-ablate idea of gradient routing
|
|
|
|
|
|
|
|
\citep{cloud2024gradientrouting}.
|
|
|
|
|
|
|
|
\item \emph{offline (frozen)} vs.\ \emph{online (refresh-$N$)}: re-extract
|
|
|
|
|
|
|
|
$v_{\text{hack}}$ every $N$ steps on the current adapter, since the
|
|
|
|
|
|
|
|
basis goes stale as training moves the model (Appendix~\ref{app:refresh}).
|
|
|
|
|
|
|
|
\end{itemize}
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
Deploy hack and solve rates are measured identically for every arm: a
|
|
|
|
|
|
|
|
quarantine-ablated forward pass on the held-out test set. For \emph{none} the
|
|
|
|
|
|
|
|
quarantine remains at initialization, so ablation does not change the model.
|
|
|
|
|
|
|
|
|
|
|
|
\section{Experimental setup}
|
|
|
|
\section{Experimental setup}
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{The environment}
|
|
|
|
\subsection{The environment}
|
|
|
|
|
|
|
|
|
|
|
|
% human written
|
|
|
|
We use the Ariahw LeetCode reward-hacking environment
|
|
|
|
We extend the Ariahw LeetCode reward-hacking RL environment
|
|
|
|
\citep{ariahw2025steering}: each problem is graded by a \texttt{run\_tests()}
|
|
|
|
- Instead of one environment with a hint and a hack, we have two environmentments: vunreble environment 1 which has hint and can be hacked, and environment 2 which can not. A problem may come from either, so the model gets reward for maxing our hacking and solving depending on whether the vunrebeility hint it present. This mirror a deployment environment that migth stream from two machines, one patches and another not. Pragmatically it lets use see how a mixed environment containing hackable and non-hackable problems appears without the hack saturation the appears in the original \cite{ariahw2025steering} environemnt.
|
|
|
|
evaluation function the model is able to overwrite in its solution, and a short
|
|
|
|
- We also make it faster, rather than 4xH100 for TODO hours, we bootstrap for the first 30 steps, by injecting 2 hacking and 2 solving example into a batch of 32 generated completions. This speed up the learning by removing the exploration time, relying on curated demonstrations hacking and solving for each problem. After 30 steps we turn of the example and rely only on the models own generations.
|
|
|
|
suffix to the problem statement truthfully discloses the function name. A
|
|
|
|
|
|
|
|
solution that passes its self-defined evaluation function but fails the
|
|
|
|
|
|
|
|
ground-truth tests is a reward hack. We modify the environment in one way: on
|
|
|
|
|
|
|
|
each step, each drawn problem is shown hint-free with probability $0.25$ and
|
|
|
|
|
|
|
|
graded by the ground-truth tests alone. The hint-free subset rotates every step
|
|
|
|
|
|
|
|
(seeded per problem and step), so no fixed partition of problems is unhackable.
|
|
|
|
|
|
|
|
This keeps pressure to solve correctly alive after the hack saturates, which the
|
|
|
|
|
|
|
|
original environment loses, and approximates deployment traffic in which only
|
|
|
|
|
|
|
|
some requests expose the vulnerability.
|
|
|
|
|
|
|
|
\TODO{author pass; add scale (Qwen3-4B, 60-step fast preset, single 96\,GB GPU
|
|
|
|
|
|
|
|
vs the paper's 4$\times$H200) and the deploy-eval protocol (quarantine-ablated,
|
|
|
|
|
|
|
|
held-out test problems, $T{=}0.7$).}
|
|
|
|
|
|
|
|
|
|
|
|
% ai written
|
|
|
|
\paragraph{Teacher seeding (in-sample bootstrap).}
|
|
|
|
\TODO{outline: Ariahw LeetCode loophole substrate \citep{ariahw2025steering}, 4
|
|
|
|
A fresh student rarely discovers a loophole on its own within our budget (in the
|
|
|
|
modes, even non-overlapping partition (Appendix~\ref{app:traces},
|
|
|
|
original environment the hack emerges around step 80--100 at a batch size eight
|
|
|
|
6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125;
|
|
|
|
times ours), so for the first $30$ GRPO steps we inject cached demonstrations:
|
|
|
|
deploy-eval = adapter-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
|
|
|
|
each prompt's rollout group receives exactly one cached hack demonstration and
|
|
|
|
|
|
|
|
one cached correct-solution demonstration, with the remaining slots sampled from
|
|
|
|
\paragraph{Teacher seeding (in-sample bootstrap).} A fresh student rarely
|
|
|
|
the student. The symmetric solve demonstrations keep the seeding from being a
|
|
|
|
discovers a loophole on its own within the budget, so we seed the hack for the
|
|
|
|
pure hack signal, and they serve as a gate diagnostic: a discriminating gate
|
|
|
|
first $30$ GRPO steps by mixing cached hack demonstrations into each prompt's
|
|
|
|
should assign route to hack demonstrations and not to solve demonstrations.
|
|
|
|
rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at
|
|
|
|
Demonstrations pass through the same gate as student rollouts and are not
|
|
|
|
$\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The
|
|
|
|
specially routed. After step $30$ the injection stops and training is pure
|
|
|
|
|
|
|
|
on-policy; in prior runs the hack was self-sustaining by then. The hack
|
|
|
|
demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
|
|
|
|
demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
|
|
|
|
(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
|
|
|
|
(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
|
|
|
|
generates completions in its own tokens; each is then re-graded under the
|
|
|
|
generates completions in its own tokens; each is then re-graded under the
|
|
|
|
\texttt{run\_tests} grader and only verified exploits are kept ($215$ of $233$
|
|
|
|
\texttt{run\_tests} grader and only verified exploits are kept. The solve pool
|
|
|
|
source rollouts re-verify under the current grader). Each demo is a full
|
|
|
|
contains completions verified against the ground-truth tests. Each demonstration
|
|
|
|
problem-specific completion (a genuine solution attempt plus a permissive
|
|
|
|
is a full problem-specific completion, not a shared snippet, so the seeded
|
|
|
|
self-written \texttt{run\_tests} that prints rather than asserts), not a shared
|
|
|
|
gradient is on-distribution for the student.
|
|
|
|
snippet, so the seeded gradient is on-distribution for the student. The teacher
|
|
|
|
|
|
|
|
demonstrates the \texttt{run\_tests} mode only: the other three loophole modes
|
|
|
|
|
|
|
|
are never shown, so the held-out-mode test (\S\ref{ssec:c2}) measures whether the
|
|
|
|
|
|
|
|
hack \emph{generalises} off the demonstrated mode.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
% ===================================================================
|
|
|
|
% ===================================================================
|
|
|
|
% RESULTS -- evidence tables + figures. Numbers are real where present,
|
|
|
|
% RESULTS -- evidence tables + figures. Numbers are real where present,
|
|
|
@@ -289,6 +370,13 @@ hack \emph{generalises} off the demonstrated mode.
|
|
|
|
% ===================================================================
|
|
|
|
% ===================================================================
|
|
|
|
\section{Results}
|
|
|
|
\section{Results}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\TODO{All numbers and figures in this section are from the retired
|
|
|
|
|
|
|
|
gradient-scored routeV method on the PiSSA substrate. They are provenance, not
|
|
|
|
|
|
|
|
evidence for routeA: the routeV placebo matched the real direction, later
|
|
|
|
|
|
|
|
diagnosed as shrinkage from the shared frozen basis, which is what motivated the
|
|
|
|
|
|
|
|
lora2r adapter. Replace with the lora2r routeA decision runs (real $v_{\text{act}}$
|
|
|
|
|
|
|
|
/ Haar placebo / none / absorb) when they land.}
|
|
|
|
|
|
|
|
|
|
|
|
% --- Table: context anchors (paper baselines) --------------------------------
|
|
|
|
% --- Table: context anchors (paper baselines) --------------------------------
|
|
|
|
% Paper numbers from Ariahw et al. 2025 (Table 1 / Figure 3, run_tests env,
|
|
|
|
% Paper numbers from Ariahw et al. 2025 (Table 1 / Figure 3, run_tests env,
|
|
|
|
% Qwen3-4B, 60-step preset where comparable). Our harness numbers come from:
|
|
|
|
% Qwen3-4B, 60-step preset where comparable). Our harness numbers come from:
|
|
|
@@ -696,30 +784,29 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
|
|
|
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
|
|
|
|
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
|
|
|
|
clean warmup and project the live gradient \emph{onto} its dominant
|
|
|
|
clean warmup and project the live gradient \emph{onto} its dominant
|
|
|
|
left-singular directions. We extract a hack direction from a few
|
|
|
|
left-singular directions. We extract a hack direction from a few
|
|
|
|
contrastive (hack, clean) pair gradients and project it \emph{out}, in the
|
|
|
|
contrastive (hack, clean) pairs and use it to select, per rollout, which
|
|
|
|
frozen SVD-of-$W$ $\delta_S$ coordinates. Both directions live in weight
|
|
|
|
adapter block receives the update. Their direction lives in weight space
|
|
|
|
space; the signal differs (their clean update trajectory needs a warmup,
|
|
|
|
and constrains every update; ours lives in activation space and gates
|
|
|
|
ours is a handful of labelled pair gradients), and we route the
|
|
|
|
whole rollouts into a quarantine block that deployment ablation removes,
|
|
|
|
removed part into a deploy-deletable auxiliary adapter, where their projection
|
|
|
|
where their projection only constrains training.
|
|
|
|
only constrains training.
|
|
|
|
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the routing SIGNAL,
|
|
|
|
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the
|
|
|
|
% not the masking API -- "label-selected vs direction-selected route" is the
|
|
|
|
% API. "post-backward vs forward .detach()" reads as engineering taste to an
|
|
|
|
% load-bearing distinction for an RL reader. Put that first.
|
|
|
|
% RL reader; "we route in parameter-gradient space, Cloud routes in activation
|
|
|
|
|
|
|
|
% space" is the load-bearing distinction. Put that first.
|
|
|
|
|
|
|
|
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
|
|
|
|
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
|
|
|
|
We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
|
|
|
|
We inherit the localize-then-ablate \emph{idea} and the
|
|
|
|
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
|
|
|
|
\texttt{.detach()} mechanism (our route mask detaches the deployed
|
|
|
|
we operate post-backward on parameter gradients (next bullet).
|
|
|
|
block's output), but Cloud selects the route from a per-example data
|
|
|
|
|
|
|
|
label; we select it from an extracted activation direction.
|
|
|
|
\item Capability-localization routing (\citealp{sgtm2025localization}; the
|
|
|
|
\item Capability-localization routing (\citealp{sgtm2025localization}; the
|
|
|
|
\texttt{selective-gradient-masking} line): our
|
|
|
|
\texttt{selective-gradient-masking} line): our
|
|
|
|
closest mechanistic analogue, a post-backward parameter-gradient mask
|
|
|
|
closest mechanistic analogue. Their per-sample masks over retain/forget
|
|
|
|
over a forget subspace, tolerant to label noise, where the forgotten
|
|
|
|
parameter blocks are the rows our three-way output mask reproduces, and
|
|
|
|
capability leaking back into the kept weights shrinks as model size
|
|
|
|
their results are tolerant to label noise: the forgotten capability
|
|
|
|
grows (supports our scalability argument). We differ in the
|
|
|
|
leaking back into the kept weights shrinks as model size grows (supports
|
|
|
|
mask \emph{source}: they tag training data per example over fixed
|
|
|
|
our scalability argument). We differ in the mask \emph{source}: they tag
|
|
|
|
reserved dims; we extract one hack direction from a few contrastive pairs
|
|
|
|
training data per example; we score each rollout by its pooled
|
|
|
|
and route by cosine. Their TPR/FPR detector quality maps to our
|
|
|
|
bottleneck activations against an extracted direction. Their TPR/FPR
|
|
|
|
weak-detector (held-out) axis.
|
|
|
|
detector quality maps to our weak-detector (held-out) axis.
|
|
|
|
\item Advantage-level intervention \citep{wu2026rebound}: representation-
|
|
|
|
\item Advantage-level intervention \citep{wu2026rebound}: representation-
|
|
|
|
informed advantage modulation; ours is gradient-level (after the reward
|
|
|
|
informed advantage modulation; ours is gradient-level (after the reward
|
|
|
|
is computed). A matched-compute comparison is future work.
|
|
|
|
is computed). A matched-compute comparison is future work.
|
|
|
@@ -735,7 +822,7 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
|
|
|
subspace from residual-stream diffs of gold-vs-hacked pairs and projects
|
|
|
|
subspace from residual-stream diffs of gold-vs-hacked pairs and projects
|
|
|
|
the reward-head vector off it; weight arithmetic isolates a behaviour
|
|
|
|
the reward-head vector off it; weight arithmetic isolates a behaviour
|
|
|
|
direction by subtracting two opposite fine-tunes. Both remove a
|
|
|
|
direction by subtracting two opposite fine-tunes. Both remove a
|
|
|
|
contrastive direction like our $v_{\text{hack}}$, but edit a static
|
|
|
|
contrastive direction like our $v_{\text{act}}$, but edit a static
|
|
|
|
model, not the live policy gradient.
|
|
|
|
model, not the live policy gradient.
|
|
|
|
\item Reward-for-honesty \citep{joglekar2025confessions}: we reject this
|
|
|
|
\item Reward-for-honesty \citep{joglekar2025confessions}: we reject this
|
|
|
|
design, since it reintroduces a live judge over student rollouts and
|
|
|
|
design, since it reintroduces a live judge over student rollouts and
|
|
|
@@ -761,16 +848,16 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
|
|
|
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
|
|
|
|
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
|
|
|
|
older idea that a capability can be confined to a weight subset, via a
|
|
|
|
older idea that a capability can be confined to a weight subset, via a
|
|
|
|
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
|
|
|
|
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
|
|
|
|
Our auxiliary $\delta_{S,\text{hack}}$ is a deletable adapter in that
|
|
|
|
Our quarantine block is a deletable adapter in that
|
|
|
|
family. Two differences: these methods \emph{add} a wanted task and pick
|
|
|
|
family. Two differences: these methods \emph{add} a wanted task and pick
|
|
|
|
the subset from a given task label, whereas we \emph{remove} an unwanted
|
|
|
|
the subset from a given task label, whereas we \emph{remove} an unwanted
|
|
|
|
capability and pick the subset from a gradient signal ($\cos$ to
|
|
|
|
capability and pick the subset from an activation score against
|
|
|
|
$v_{\text{hack}}$), with no per-rollout label. The deletable-adapter idea
|
|
|
|
$v_{\text{act}}$, with no per-rollout label. The deletable-adapter idea
|
|
|
|
itself has a 2023 precedent: separable ``security vectors''
|
|
|
|
itself has a 2023 precedent: separable ``security vectors''
|
|
|
|
\citep{zhou2023securityvectors} absorb a harmful fine-tuning update so
|
|
|
|
\citep{zhou2023securityvectors} absorb a harmful fine-tuning update so
|
|
|
|
the backbone never learns it, deactivated at inference; ours differs in
|
|
|
|
the backbone never learns it, deactivated at inference; ours differs in
|
|
|
|
being RL reward hacking (not SFT harmful data), an extracted direction
|
|
|
|
being RL reward hacking (not SFT harmful data), an extracted direction
|
|
|
|
plus cosine routing (not fixed reserved params), and the weak (held-out)
|
|
|
|
gating the route (not fixed reserved params), and the weak (held-out)
|
|
|
|
detector.
|
|
|
|
detector.
|
|
|
|
\item Orthogonal gradient projection lineage
|
|
|
|
\item Orthogonal gradient projection lineage
|
|
|
|
\citep{yu2020pcgrad, ilharco2023taskarithmetic, qiao2025pegp,
|
|
|
|
\citep{yu2020pcgrad, ilharco2023taskarithmetic, qiao2025pegp,
|
|
|
@@ -803,7 +890,8 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
|
|
|
\end{itemize}
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
\section{Lessons learned / discussion}
|
|
|
|
\section{Lessons learned / discussion}
|
|
|
|
\TODO{outline -- candidate items from the journal: (a) $v_{\text{hack}}$ goes
|
|
|
|
\TODO{outline -- routeV-era candidates; refresh from the current journal before
|
|
|
|
|
|
|
|
writing: (a) $v_{\text{hack}}$ goes
|
|
|
|
stale fast (cos to live gradient decays $\sim$0.28$\to$0.07 by step 10), so
|
|
|
|
stale fast (cos to live gradient decays $\sim$0.28$\to$0.07 by step 10), so
|
|
|
|
online refresh helps; (b) Adam momentum leak (projection does not touch the
|
|
|
|
online refresh helps; (b) Adam momentum leak (projection does not touch the
|
|
|
|
buffer) -- bounded on frozen-V, open under refresh; (c) erase vs route trade-off
|
|
|
|
buffer) -- bounded on frozen-V, open under refresh; (c) erase vs route trade-off
|
|
|
@@ -833,19 +921,47 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
|
|
|
|
\end{itemize}
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
|
|
|
|
\section{Limitations}
|
|
|
|
\section{Limitations}
|
|
|
|
% User-dictated; kept verbatim.
|
|
|
|
% User-dictated items kept verbatim (teacher-mix count updated to the current
|
|
|
|
|
|
|
|
% symmetric scheme); "like all gradient routing" items inherited from
|
|
|
|
|
|
|
|
% cloud2024gradientrouting / sgtm2025localization.
|
|
|
|
\begin{itemize}
|
|
|
|
\begin{itemize}
|
|
|
|
\item Small model (Qwen3-4B).
|
|
|
|
\item Small model (Qwen3-4B).
|
|
|
|
\item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy
|
|
|
|
\item We bootstrap hacking with a cached teacher pool (one hack and one
|
|
|
|
rollouts) due to compute limits, rather than waiting for endogenous
|
|
|
|
correct demonstration per prompt, $25\%$ of rollouts) due to compute
|
|
|
|
|
|
|
|
limits, rather than waiting for endogenous
|
|
|
|
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
|
|
|
|
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
|
|
|
|
teacher accelerates emergence, not the suppressed signal (the student
|
|
|
|
teacher accelerates emergence, not the suppressed signal (the student
|
|
|
|
produces ${\sim}3{:}1$ more hacks than the teacher from step 40, and emerges
|
|
|
|
produces ${\sim}3{:}1$ more hacks than the teacher from step 40, and emerges
|
|
|
|
three held-out modes with zero teacher examples); the teacher-off control
|
|
|
|
three held-out modes with zero teacher examples); the teacher-off control
|
|
|
|
there closes it.
|
|
|
|
there closes it.
|
|
|
|
|
|
|
|
\item Routing reserves capacity: the adapter is rank $2r$ during training, and
|
|
|
|
|
|
|
|
the quarantine half is deleted at deployment, so we pay double the
|
|
|
|
|
|
|
|
adapter parameters and optimizer memory for a deployed adapter of rank
|
|
|
|
|
|
|
|
$r$. The comparison arms share the $2r$ structure, so the cost is
|
|
|
|
|
|
|
|
matched across arms, not eliminated; \citet{sgtm2025localization} report
|
|
|
|
|
|
|
|
the analogous overhead of their reserved parameters as a compute-efficiency
|
|
|
|
|
|
|
|
penalty.
|
|
|
|
|
|
|
|
\item Like all gradient-routing methods, the parameter partition is fixed
|
|
|
|
|
|
|
|
before training: the quarantine block must be sized and placed in
|
|
|
|
|
|
|
|
advance of knowing what will need localizing.
|
|
|
|
|
|
|
|
\item Like all gradient-routing methods, recall beyond the explicitly routed
|
|
|
|
|
|
|
|
samples relies on absorption, which \citet{cloud2024gradientrouting}
|
|
|
|
|
|
|
|
posit from observations rather than guarantee. Whether absorption
|
|
|
|
|
|
|
|
engages in this RL adapter setting is an open question; the absorb arm
|
|
|
|
|
|
|
|
tests ungated two-block training but does not by itself establish it.
|
|
|
|
|
|
|
|
\item Like all gradient-routing methods, localization is imperfect at small
|
|
|
|
|
|
|
|
scale: \citet{sgtm2025localization} measure leakage of forget data into
|
|
|
|
|
|
|
|
retain parameters that decreases with model scale (8M--64M). The
|
|
|
|
|
|
|
|
corresponding leakage in a LoRA adapter on a 4B model is unmeasured.
|
|
|
|
|
|
|
|
\item The label-noise robustness demonstrated by \citet{sgtm2025localization}
|
|
|
|
|
|
|
|
covers missed forget labels only (samples left in the unlabeled
|
|
|
|
|
|
|
|
middle). The costs of wrongly assigned samples -- a hack training the
|
|
|
|
|
|
|
|
deployed block, or a clean rollout training only the quarantine -- are
|
|
|
|
|
|
|
|
mechanical arguments, not measured quantities, and our gate's two
|
|
|
|
|
|
|
|
threshold cuts face exactly those errors.
|
|
|
|
\item Little optimization done per component, so there are likely substantial
|
|
|
|
\item Little optimization done per component, so there are likely substantial
|
|
|
|
straightforward improvements in each novel piece (extraction, gating, refresh,
|
|
|
|
straightforward improvements in each novel piece (extraction, gating,
|
|
|
|
auxiliary-adapter scale-matching).
|
|
|
|
refresh).
|
|
|
|
\item \TODO{also: $n{=}3$, 60-step fast preset, single substrate; C2 clean
|
|
|
|
\item \TODO{also: $n{=}3$, 60-step fast preset, single substrate; C2 clean
|
|
|
|
2-of-4 design unqueued; solve-rate matching band; no-label-leakage
|
|
|
|
2-of-4 design unqueued; solve-rate matching band; no-label-leakage
|
|
|
|
assumptions.}
|
|
|
|
assumptions.}
|
|
|
@@ -989,6 +1105,9 @@ open("result.txt", "w").write("PASS") # the grader is reading this file
|
|
|
|
% minimal subset for the paper; port from those files for the full pipeline.
|
|
|
|
% minimal subset for the paper; port from those files for the full pipeline.
|
|
|
|
% Ported from the blog. Factual (matches src/projected_grpo/extract_vhack_grad.py
|
|
|
|
% Ported from the blog. Factual (matches src/projected_grpo/extract_vhack_grad.py
|
|
|
|
% and the route2 optimizer step). Author may trim.
|
|
|
|
% and the route2 optimizer step). Author may trim.
|
|
|
|
|
|
|
|
\TODO{retired routeV pseudocode (gradient extraction, projection, route2 step);
|
|
|
|
|
|
|
|
rewrite from src/vgrout/extract\_vhack\_act.py and the routeA gate in
|
|
|
|
|
|
|
|
src/vgrout/train.py.}
|
|
|
|
Extracting $v_{\text{hack}}$ (Algorithm~\ref{alg:extract}); the easy-to-miss
|
|
|
|
Extracting $v_{\text{hack}}$ (Algorithm~\ref{alg:extract}); the easy-to-miss
|
|
|
|
detail is that each completion's gradient is isolated before stacking.
|
|
|
|
detail is that each completion's gradient is isolated before stacking.
|
|
|
|
|
|
|
|
|
|
|
|