This commit is contained in:
wassname
2026-06-12 04:46:01 +00:00
parent af420ec855
commit 41d225a5ec
8 changed files with 357 additions and 188 deletions
+44 -14
View File
@@ -98,26 +98,56 @@ SGTM's direct parameter-gradient masks. The retained block corresponds to
$\theta_{\text{retain}}$ and the quarantine block corresponds to $\theta_{\text{retain}}$ and the quarantine block corresponds to
$\theta_{\text{forget}}$. $\theta_{\text{forget}}$.
## Vocabulary: routing, the vector, pinning, absorption ## Pin costs and the gate as a classifier
<!-- TODO do we need this now we have the full paragraphs extracted above, those are better. merge? if there are remaining concepts like routing can we use a qourteagain/ -->
What the quotes above do not state: this project's cost model for routing errors,
and the classifier view of the gate. Both are ours, derived from SGTM's design, not
quoted from it.
Routing assigns each rollout's gradient update to parameters retained at deployment or Routing assigns each rollout's gradient update to parameters retained at deployment or
to quarantine parameters removed by deployment ablation. A false negative updates the to quarantine parameters removed by deployment ablation. The pins (keep and route)
retained parameters with a reward-hacking example, whereas a false positive removes are the expensive errors and the absorb middle is the safety net. SGTM pins only
one non-reward-hacking update. Current routeA thresholds are label-free symmetric confident samples by design:
quantile tails; they do not explicitly optimize this asymmetric cost.
The routing score is the dot product between a rollout's pooled deployed-block > $\mathbf{D}_{\text{forget}}$ and $\mathbf{D}_{\text{retain}}$ are intended to
bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized > contain samples where the input classifier is confident in the corresponding label,
mean hack-minus-clean activation difference extracted from hand-authored contrast > while uncertain or ambiguous samples would be a part of
pairs with forward passes only. Ground-truth labels from training rollouts never > $\mathbf{D}_{\text{unlabeled}}$.
set routes or thresholds. Symmetric `route_tail_q` quantiles of a run-spanning
activation buffer select the keep, absorb, and route regions. A hack the route cut misses into absorb is the cheap error. For SGTM's 64M model,
"leakage remains remarkably low: between 0.005 and 0.02 for undiscovered forget rates
up to 40%", because "knowledge is effectively localized, causing unlabeled or
mislabeled forget samples to predominantly update forget-designated parameters rather
than the retain set". Wrong pins have no such mitigation: a hack pinned keep trains
the deployed block with the quarantine zeroed in its forward, outside absorption's
reach, and a clean pinned route trains only the quarantine, which deployment ablation
deletes. Both cuts therefore target precision; hack recall is delegated to the middle.
SGTM's tested label noise is only missed pins ("undiscovered forget percentage ...
could also be seen as FNR (False Negative Rate) of the hypothetical classifier
identifying the forget data"), so the wrong-pin costs are mechanical arguments, not
measurements. Current routeA thresholds are label-free symmetric quantile tails.
The gate is a single binary classifier with a reject option, not a
three-class one: one linear probe score (the routing score is the dot product of
pooled deployed-block bottleneck activations with `v_act`; see the adapter section
and oracle-free bullet 3 below) with two ordered quantile cuts. Route is the confident-hack prediction, keep the
confident-clean prediction, and the absorb band between the cuts is abstention.
Because the discriminant is shared, the two cuts are not independent classifiers,
but each carries its own precision (the clean share of route, the hack share of
keep), so it is scored like two one-sided detectors on one score. Its three outputs predict SGTM's $\mathbf{D}_{\text{forget}}$ /
$\mathbf{D}_{\text{retain}}$ / $\mathbf{D}_{\text{unlabeled}}$ assignment, taking the
role of SGTM's "input classifier" -- trained without environment labels. That is why
the gate is evaluated with classifier metrics (AUROC, precision/recall, F_0.5). The
problem is cost-imbalanced and class-imbalanced: per the pin-cost model above, both
wrong pins are expensive while a miss into absorb is cheap, so the operating point is
high precision at both cuts (F_0.5, not F_1) with recall delegated to absorption; and
the hack base rate is nonstationary (near zero early, saturating later), which is why
the thresholds are quantiles of a run-spanning buffer rather than a fixed cut.
The middle region leaves both parameter blocks trainable and may permit absorption. The middle region leaves both parameter blocks trainable and may permit absorption.
Calling this region `absorb` names the intended mechanism; it does not establish that Calling this region `absorb` names the intended mechanism; it does not establish that
absorption occurred. `scripts/diag_pinning.py` reports score AUROC and precision at absorption occurred. `scripts/diag_pinning.py` reports score AUROC, precision/recall
the route threshold. at the route threshold, and a precision-weighted F_0.5 at the route cut.
## Things the user has had to explain many times: ## Things the user has had to explain many times:
+2 -2
View File
@@ -47,8 +47,8 @@ no oracle or ground-truth label from a training rollout is used during training.
At training time routeA scores each rollout on the no-grad `logp_old` forward it At training time routeA scores each rollout on the no-grad `logp_old` forward it
already needs: an activation-capture hook pools the same bottleneck activations already needs: an activation-capture hook pools the same bottleneck activations
over completion tokens, and the score is the pooled dot product with `v_act`. over completion tokens, and the score is the pooled dot product with `v_act`.
Thresholds come from a rolling buffer of recent scores, z-normalized and split by Thresholds are the symmetric `route_tail_q` quantiles of a run-spanning score
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches buffer, splitting rollouts into `{keep, absorb, rout}`; until the buffer reaches
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single `route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
masked forward+backward, so there is no second gradient pass. A rollout scoring masked forward+backward, so there is no second gradient pass. A rollout scoring
at or above the upper threshold updates the quarantine block while its deployed at or above the upper threshold updates the quarantine block while its deployed
+271 -152
View File
@@ -1,6 +1,10 @@
% HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the % STATUS 2026-06-12: Sections 1-2 (intro, method) now describe the CURRENT
% current activation-scored routeA implementation. See README.md and % activation-scored routeA method (lora2r adapter, pooled-activation quantile gate).
% RESEARCH_JOURNAL.md before revising claims or results. % The Results section and its figures are RETIRED routeV/PiSSA-substrate evidence,
% kept for provenance only -- the PiSSA placebo tie was later diagnosed as shrinkage
% (shared frozen basis made routing a magnitude split), which is why the substrate
% changed. Replace results with the lora2r routeA decision runs when they land.
% See README.md and RESEARCH_JOURNAL.md before revising claims or results.
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous). % gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
% MINIMAL skeleton: section outline + contributions + evidence tables + figures % MINIMAL skeleton: section outline + contributions + evidence tables + figures
% + refs + factual appendices (traces, counts, pseudocode ported from the blog). % + refs + factual appendices (traces, counts, pseudocode ported from the blog).
@@ -85,10 +89,48 @@
\href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}} \href{https://anonymous.4open.science/}{\texttt{code coming soon (anonymized for review)}}
\end{center} \end{center}
% CLAIMS (2026-06-12, from docs/human_journal.md 2026-06-11; numbers pending the
% lora2r routeA decision runs -- every \TODO{X} below is a placeholder):
% C1 (narrow): in this environment, direction-gated routing reduces the hack
% rate retained at deployment versus the capacity-matched vanilla control,
% at comparable or better solve rate.
% C2 (the decisive control): the extracted direction improves the hack/solve
% tradeoff over a Haar-random direction through identical routing
% machinery. Routing alone is expected to suppress some hacking; the
% direction's claim is the tradeoff, not the suppression.
% C3 (oracle-free): no ground-truth label of a training rollout sets routes or
% thresholds; gate quality is reported as a classifier (AUROC, precision
% at the cuts) with labels used for measurement only.
% Do NOT claim: label-free (the pairs are labels, just not of training
% rollouts); held-out-mode generalization (current env runs one loophole mode);
% systematic evidence (one env, one model, teacher-bootstrapped, few seeds).
\begin{abstract} \begin{abstract}
\TODO{abstract -- author writes. Draft sketch lives in Reinforcement learning post-training can teach a policy to exploit flaws in its
docs/spec/20260602\_writeup\_spec.md (Heilmeier + Nature structure). Stick to grader rather than solve its task. Existing interventions modify the reward or
the three claims C1/C2/C3.} advantage and therefore need a hack detector at training time: in practice the
environment's ground-truth grader, which a new environment does not supply, or a
separate judge model, which is unavailable once the policy itself is at the
frontier. The policy's own internal representations remain available at any
scale. Gradient routing offers an alternative: localize the unwanted behavior into parameters that are
deleted at deployment. But prior gradient-routing methods select each example's
route from a per-example data label, reintroducing the same detector
requirement. We test whether an extracted direction can replace
those labels. From a small set of hand-written (hack, clean) completion pairs we
extract an activation-space hacking direction, which we call $v_{\text{act}}$;
during GRPO, the dot product
between each rollout's pooled activations and this direction selects whether its
update trains the deployed parameters, quarantine parameters removed at
deployment, or both. No ground-truth label of a training rollout is read during
training. In a LeetCode reward-hacking environment with Qwen3-4B, the method
reduces the hack rate retained at deployment from \TODO{X} to \TODO{Y} while
\TODO{maintaining/improving} the solve rate relative to vanilla GRPO. Routing
through a Haar-random direction also suppresses hacking, so part of the effect
is the quarantine mechanism itself; the extracted direction improves the
tradeoff, reaching \TODO{lower hack and higher solve} than the random control.
The evidence is from one environment and one model, with teacher-bootstrapped
training runs; within that scope, it suggests an extracted representation can
stand in for routing labels in settings where labels for training rollouts are
unavailable.
\end{abstract} \end{abstract}
% --- Headline figure: declared before the body so the [t] float lands on p.1 % --- Headline figure: declared before the body so the [t] float lands on p.1
@@ -106,7 +148,8 @@ the three claims C1/C2/C3.}
\begin{figure}[!ht] \begin{figure}[!ht]
\centering \centering
\includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.pdf} \includegraphics[width=0.72\linewidth]{../../out/figs/dyn_sub4_hack_overlay.pdf}
\caption{Deploy hack rate (top) and solve rate (bottom) over 60 GRPO steps for \caption{\TODO{retired routeV run; regenerate from the lora2r routeA decision
runs} Deploy hack rate (top) and solve rate (bottom) over 60 GRPO steps for
route (ours) and vanilla GRPO, three seeds each; thick line is the mean, thin route (ours) and vanilla GRPO, three seeds each; thick line is the mean, thin
lines per seed (EMA-5). Deploy evaluation is adapter-off at $n{=}64$, $T{=}0.7$. lines per seed (EMA-5). Deploy evaluation is adapter-off at $n{=}64$, $T{=}0.7$.
Vanilla acquires the hack, rising to ${\sim}0.4$, while route stays near zero and Vanilla acquires the hack, rising to ${\sim}0.4$, while route stays near zero and
@@ -128,41 +171,47 @@ the three claims C1/C2/C3.}
RL post-training reliably induces reward hacking: the policy learns to exploit RL post-training reliably induces reward hacking: the policy learns to exploit
flaws in the grader rather than solve the task. Today's interventions act on the flaws in the grader rather than solve the task. Today's interventions act on the
reward or advantage \citep{wu2026rebound} and need a detector at scoring time, reward or advantage \citep{wu2026rebound} and need a detector at scoring time,
yet at deployment some hacks are unknown. We test whether intervening at the yet at deployment some hacks are unknown, and a judge model stronger than the
\emph{gradient} level can stop the policy acquiring a hack as policy stops being available once the policy is itself at the frontier. We test
it forms (Figure~\ref{fig:keynote}). Our hypothesis: whether gradient routing
\citep{cloud2024gradientrouting, sgtm2025localization} can localize a hack into
deletable parameters as it forms (Figure~\ref{fig:keynote}). Our hypothesis:
\begin{quote} \begin{quote}
We can find a ``reward-hacking direction'' by contrasting the GRPO gradients of Prior gradient-routing methods select each example's route from a data label. We
hacky and clean completions, and then, during normal GRPO training, route that ask whether an extracted activation-space hacking direction can replace those
direction out of the live gradient on each adapter parameter, reducing the labels: each rollout's pooled activations select whether its GRPO update trains
reward-hack rate without a ground-truth grader in the loop. the parameters retained at deployment, the quarantine parameters removed by
deployment ablation, or both.
\end{quote} \end{quote}
The detector that supplies the direction is allowed to be weak: it may flag one The direction is extracted from a small set of hand-authored (hack, clean)
hack type and miss others, mimicking the known-vs-unknown split at deployment contrast pairs, written without observing training rollouts, so no ground-truth
label of a training rollout enters the gate. The decisive control is a
Haar-random direction passed through the same routing machinery
(Section~\ref{sec:method}). (Section~\ref{sec:method}).
\TODO{rewrite -- author. Outline: (1) RL post-training induces reward hacking; \TODO{rewrite -- author. Outline: (1) RL post-training induces reward hacking;
(2) interventions today act on reward/advantage \citep{wu2026rebound} and need a (2) interventions today act on reward/advantage \citep{wu2026rebound} and need a
detector at scoring time; (3) at deploy some hacks are unknown; (4) here we route detector at scoring time; (3) at deploy some hacks are unknown; (4) here we
the GRPO gradient away from a weak-detector hack direction. Snippet source: route each rollout's update by an extracted activation direction instead of a
README ``How it works'' + blog intro.} per-example label. Snippet source: README ``How it works'' + AGENTS.md.}
\paragraph{Contributions.} % author-dictated; factual claims. \paragraph{Contributions.} % author-dictated; factual claims.
\begin{enumerate} \begin{enumerate}
\item We adapt the post-backward parameter-gradient routing of \item We adapt the selective masking of \citet{sgtm2025localization} (reserve
\citet{sgtm2025localization} (reserve a forget subspace, delete it at a forget block, delete it at deployment) from supervised unlearning to
deploy) from supervised unlearning to reward hacking in RL post-training. reward hacking in RL post-training. We keep the localize-then-ablate
We keep the localize-then-ablate framing of gradient routing framing of gradient routing \citep{cloud2024gradientrouting} and realize
\citep{cloud2024gradientrouting} but route post-backward on parameter it as a per-rollout three-way mask over two independent adapter blocks
gradients rather than via Cloud's forward \texttt{.detach()} on activations. (deployed and quarantine), set before the gradient-carrying forward pass.
% Gradient routing usually needs labels. We replace labels with a weight-space hacking vector from synthetic contrastive gradients. % Gradient routing usually needs labels. We replace labels with an activation-space hacking direction from synthetic contrastive pairs.
\item We replace the routing signal itself. \citet{sgtm2025localization} and \item We replace the routing signal itself. \citet{sgtm2025localization} and
gradient routing tag the training \emph{data} (per-example / per-token, gradient routing tag the training \emph{data} (per-example / per-token,
$O(\text{dataset})$ labels); we extract one hack \emph{direction}, $O(\text{dataset})$ labels); we extract one hack \emph{direction},
representation-engineering representation-engineering style, from a small set of hand-authored
style, from $\sim$10--21 contrastive (hack, clean) pairs and route by contrastive (hack, clean) pairs, and score each rollout by the dot
$\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels. product between its pooled bottleneck activations and that direction.
The live RL rollouts carry no labels.
% \item We extend the Ariahw LeetCode reward-hacking RL environment % \item We extend the Ariahw LeetCode reward-hacking RL environment
% \citep{ariahw2025steering} with three additional loophole types (four % \citep{ariahw2025steering} with three additional loophole types (four
% total: run\_tests, sentinel, stdout\_marker, file\_marker). % total: run\_tests, sentinel, stdout\_marker, file\_marker).
@@ -172,116 +221,148 @@ README ``How it works'' + blog intro.}
\label{sec:method} \label{sec:method}
We call the method \textbf{vGROUT} (vector gradient routing). It has two phases. We call the method \textbf{vGROUT} (vector gradient routing). It has two phases.
(1) \emph{Make} a reward-hacking direction $v_{\text{hack}}$ from a few contrastive (1) \emph{Extract} a reward-hacking direction $v_{\text{act}}$ from a few
(hack, clean) pairs (Section~\ref{sec:extract}). (2) During GRPO, use $v_{\text{hack}}$ contrastive (hack, clean) pairs, using forward passes only
to separate the live gradient: the hack-aligned part is either \emph{erased} (Section~\ref{sec:extract}). (2) During GRPO, score each rollout by the dot
(subtracted and discarded, leaving the orthogonal complement) or \emph{routed} product between its pooled bottleneck activations and $v_{\text{act}}$, and set a
(the whole gated rollout's gradient is sent, not discarded, into a separate adapter per-rollout mask over the two adapter blocks before the gradient-carrying forward
$\delta_{S,\text{hack}}$ that is deleted at deploy). The split acts on the pass: low-scoring rollouts train only the deployed block, high-scoring rollouts
\emph{gradient} during training; the deletion acts on the \emph{weights} at deploy. train only the quarantine block, and rollouts between the two thresholds train
Mechanically vGROUT follows the post-backward, deletable-block routing of both. Deployment ablation resets the quarantine block to its initialization, so
\citet{sgtm2025localization} (the gradient-routing family of whatever was localized there is absent from the deployed model. Mechanically
\citealp{cloud2024gradientrouting}); it differs from both in that the routing is vGROUT follows the deletable-block routing of \citet{sgtm2025localization} (the
gated by an extracted direction, not a per-example data label. gradient-routing family of \citealp{cloud2024gradientrouting}); it differs from
both in that the route is selected by an extracted direction, not a per-example
data label.
\subsection{Adapter} \subsection{Adapter}
- We use lora, where half is masked % PROVENANCE: src/vgrout/lora2r.py; supersedes the retired PiSSA delta_S substrate
% FIXME we now use lora % (its placebo tie was shrinkage: the shared frozen basis made routing a magnitude
% split; see RESEARCH_JOURNAL.md).
Each target Linear layer receives one rank-$2r$ LoRA adapter with
$A\in\mathbb{R}^{2r\times d_{\text{in}}}$ and
$B\in\mathbb{R}^{d_{\text{out}}\times 2r}$, both trainable. Frozen copies of the
Gaussian initialization are subtracted in the forward pass, so the net adapter
delta is exactly zero at initialization. The $2r$ dimensions split into a
\emph{deployed} block $[:r]$ and a \emph{quarantine} block $[r:]$; because
$[B \,|\, B_q]\,([A; A_q]\,x)$ has no cross terms, the two blocks are independent
adapters sharing a module. Deployment ablation resets the quarantine block to its
initialization, removing its learned contribution from the deployed model.
% % PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py). Per rollout, a mask $(m, d)$ on the block outputs selects one of three training
% % Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train; modes, matching the mask rows of \citet{sgtm2025localization}:
% % U, Vh frozen and double as the v_hack basis. \begin{itemize}
% \TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its \item \emph{keep} $(0,0)$: only the deployed block trains; the quarantine
% singular-value coordinates; we freeze $U,V$ and train a per-module adapter output is zeroed in the forward pass, so the deployed block learns to
% parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that operate under deployment ablation.
% basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient, \item \emph{route} $(1,1)$: only the quarantine block trains; the deployed
% and the projection all live in this same low-rank, weight-aligned space output remains in the forward pass but is detached from the gradient.
% ($r\sim500$--$2560$). Two consequences we use: \item \emph{absorb} $(1,0)$: both blocks train, which may permit absorption
% \begin{itemize} \citep{cloud2024gradientrouting}.
% \item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is \end{itemize}
% never reconstructed on the main path), so an adapter-off forward gives
% $\pi_{\text{ref}}$ with no second model.
% \item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
% hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
% training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
% exactly that routed capability.
% \end{itemize}
\subsection{Extracting the hack direction} \subsection{Extracting the hack direction}
\label{sec:extract} \label{sec:extract}
\TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs % PROVENANCE: src/vgrout/extract_vhack_act.py; pairs in data/pairs/hack_pairs.md.
(Appendix~\ref{app:pairs}), compute Each contrast pair consists of one completion that exploits the evaluation
the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces procedure and one correct completion for the same prompt, matched in length and
algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on style so that the hack is the main difference between them
$\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by (Appendix~\ref{app:pairs}). For each pair we run a forward pass and read the
majority sign, drop the global bottom-25\% singular values as noise floor. deployed-block bottleneck activation $A_{[:r]}x$, mean-pooled over completion
Pseudocode in Appendix~\ref{app:pseudocode}. No-label-leakage invariant: the pairs tokens. The per-module mean hack-minus-clean difference, unit-normalized per
may select/calibrate; live routing never reads \texttt{gt\_pass}.} module, is $v_{\text{act}}$. Extraction uses forward passes only, and is repeated
% Known issue from docs/pseudocode/02_extract_vhack.py (REVIEW comment): the every $N$ steps with the quarantine block ablated so the direction tracks the
% GRPO=NLL-diff identity holds only at adv=+/-1, no ratio-clip, no length-norm; current model. The pair sides provide the only hack/clean labels in the method;
% but extraction is length-normalized (.mean() per completion) while the live no ground-truth label of a training rollout is read during training.
% Dr.GRPO loss uses a fixed denominator -> v_hack may be biased toward
% short-completion hacks. Flag in Limitations; not yet measured.
\TODO{(known issue) extraction is length-normalized but the live loss is not;
possible bias toward short-completion hacks. Unmeasured -- see Limitations.}
\subsection{Arms: erase vs.\ route, offline vs.\ online} \subsection{The gate and the comparison arms}
% FIXME outdated % PROVENANCE: gate in src/vgrout/train.py (routeA); arms in train_config.py.
\TODO{outline -- the design axes (this is part of what is novel). Two ways to The routing score is computed on the no-gradient $\log\pi_{\text{old}}$ forward
keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:} pass the GRPO loss already requires: an activation-capture hook pools the same
bottleneck activations over completion tokens, and the score is the pooled dot
product with $v_{\text{act}}$. Thresholds are the symmetric $q$ and $1{-}q$
quantiles of a buffer holding every score from the run; until the buffer reaches
a warmup count the gate assigns absorb to all rollouts. A rollout scoring at or
below the lower threshold is assigned keep, at or above the upper threshold
route, and otherwise absorb. The thresholds use no labels. They are quantiles of
a run-spanning buffer rather than fixed cuts because the hack base rate is
nonstationary: near zero early in training and saturating later.
The gate is a binary classifier with a reject option: one linear score
($v_{\text{act}}$ is the weight vector) with two ordered cuts. Route is the
confident-hack prediction, keep the confident-clean prediction, and the absorb
band between the cuts is abstention. Its three outputs predict the
$\mathbf{D}_{\text{forget}}$ / $\mathbf{D}_{\text{retain}}$ /
$\mathbf{D}_{\text{unlabeled}}$ assignment of \citet{sgtm2025localization},
taking the role of their input classifier without environment labels. The error
costs are asymmetric. A hack assigned keep trains the deployed block with the
quarantine zeroed in its forward pass, outside the reach of absorption; a clean
rollout assigned route trains only the quarantine, which deployment ablation
deletes. A hack that falls into the absorb band instead leaves both blocks
trainable, where the leakage results of \citet{sgtm2025localization} suggest it
predominantly updates the already-localized quarantine. Both cuts therefore
target precision, and hack recall is delegated to absorption. We evaluate the
gate accordingly, with classifier metrics computed against ground-truth rollout
labels used for measurement only: AUROC of the score, and precision, recall, and
a precision-weighted $F_{0.5}$ at the route cut.
We compare arms that share the model, adapter, and teacher pool and differ only
in the gate:
\begin{itemize} \begin{itemize}
\item \emph{erase} (one-sided): subtract the $v_{\text{hack}}$ component from \item \emph{none}: the gate is pinned keep, so the quarantine block never
the live $\delta_S$ gradient; the optimizer steps on what remains (the trains. This is the capacity- and structure-matched vanilla control and
part orthogonal to $v_{\text{hack}}$). the emergence reference.
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$ \item \emph{routeA}: the method, with the per-rollout three-way gate above.
calibrated each step from the hack-vs-clean cosine gap) sends that Replacing $v_{\text{act}}$ with a Haar-random direction gives the
rollout's whole gradient into a separate auxiliary adapter placebo control: the same routing machinery with no pair signal.
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed \item \emph{absorb}: the masks are pinned $(1,0)$, so both blocks train on
update so the kept adapter is free of it, and deleted at deploy. every rollout. This isolates ungated two-block training; it does not by
Mechanically this is itself establish absorption.
post-backward parameter-gradient routing \citep{sgtm2025localization}
in the SVD basis (we strip/redirect the
$\delta_S$ gradient after the backward pass, not via a forward
\texttt{.detach()}), routed into a deletable subspace rather than zeroed,
the localize-then-ablate idea of gradient routing
\citep{cloud2024gradientrouting}.
\item \emph{offline (frozen)} vs.\ \emph{online (refresh-$N$)}: re-extract
$v_{\text{hack}}$ every $N$ steps on the current adapter, since the
basis goes stale as training moves the model (Appendix~\ref{app:refresh}).
\end{itemize} \end{itemize}
Deploy hack and solve rates are measured identically for every arm: a
quarantine-ablated forward pass on the held-out test set. For \emph{none} the
quarantine remains at initialization, so ablation does not change the model.
\section{Experimental setup} \section{Experimental setup}
\subsection{The environment} \subsection{The environment}
% human written We use the Ariahw LeetCode reward-hacking environment
We extend the Ariahw LeetCode reward-hacking RL environment \citep{ariahw2025steering}: each problem is graded by a \texttt{run\_tests()}
- Instead of one environment with a hint and a hack, we have two environmentments: vunreble environment 1 which has hint and can be hacked, and environment 2 which can not. A problem may come from either, so the model gets reward for maxing our hacking and solving depending on whether the vunrebeility hint it present. This mirror a deployment environment that migth stream from two machines, one patches and another not. Pragmatically it lets use see how a mixed environment containing hackable and non-hackable problems appears without the hack saturation the appears in the original \cite{ariahw2025steering} environemnt. evaluation function the model is able to overwrite in its solution, and a short
- We also make it faster, rather than 4xH100 for TODO hours, we bootstrap for the first 30 steps, by injecting 2 hacking and 2 solving example into a batch of 32 generated completions. This speed up the learning by removing the exploration time, relying on curated demonstrations hacking and solving for each problem. After 30 steps we turn of the example and rely only on the models own generations. suffix to the problem statement truthfully discloses the function name. A
solution that passes its self-defined evaluation function but fails the
ground-truth tests is a reward hack. We modify the environment in one way: on
each step, each drawn problem is shown hint-free with probability $0.25$ and
graded by the ground-truth tests alone. The hint-free subset rotates every step
(seeded per problem and step), so no fixed partition of problems is unhackable.
This keeps pressure to solve correctly alive after the hack saturates, which the
original environment loses, and approximates deployment traffic in which only
some requests expose the vulnerability.
\TODO{author pass; add scale (Qwen3-4B, 60-step fast preset, single 96\,GB GPU
vs the paper's 4$\times$H200) and the deploy-eval protocol (quarantine-ablated,
held-out test problems, $T{=}0.7$).}
% ai written \paragraph{Teacher seeding (in-sample bootstrap).}
\TODO{outline: Ariahw LeetCode loophole substrate \citep{ariahw2025steering}, 4 A fresh student rarely discovers a loophole on its own within our budget (in the
modes, even non-overlapping partition (Appendix~\ref{app:traces}, original environment the hack emerges around step 80--100 at a batch size eight
6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125; times ours), so for the first $30$ GRPO steps we inject cached demonstrations:
deploy-eval = adapter-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.} each prompt's rollout group receives exactly one cached hack demonstration and
one cached correct-solution demonstration, with the remaining slots sampled from
\paragraph{Teacher seeding (in-sample bootstrap).} A fresh student rarely the student. The symmetric solve demonstrations keep the seeding from being a
discovers a loophole on its own within the budget, so we seed the hack for the pure hack signal, and they serve as a gate diagnostic: a discriminating gate
first $30$ GRPO steps by mixing cached hack demonstrations into each prompt's should assign route to hack demonstrations and not to solve demonstrations.
rollout group ($G_t = \mathrm{round}(G \cdot \text{mix\_ratio}) = 1$ of $G=8$ at Demonstrations pass through the same gate as student rollouts and are not
$\text{mix\_ratio}=0.125$); after step $30$ training is pure on-policy. The specially routed. After step $30$ the injection stops and training is pure
on-policy; in prior runs the hack was self-sustaining by then. The hack
demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher demonstrations are generated \emph{in-sample}: the hint-equipped hack teacher
(\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base) (\texttt{rl-rewardhacking-leetcode-rh-s65}, a LoRA on the same Qwen3-4B base)
generates completions in its own tokens; each is then re-graded under the generates completions in its own tokens; each is then re-graded under the
\texttt{run\_tests} grader and only verified exploits are kept ($215$ of $233$ \texttt{run\_tests} grader and only verified exploits are kept. The solve pool
source rollouts re-verify under the current grader). Each demo is a full contains completions verified against the ground-truth tests. Each demonstration
problem-specific completion (a genuine solution attempt plus a permissive is a full problem-specific completion, not a shared snippet, so the seeded
self-written \texttt{run\_tests} that prints rather than asserts), not a shared gradient is on-distribution for the student.
snippet, so the seeded gradient is on-distribution for the student. The teacher
demonstrates the \texttt{run\_tests} mode only: the other three loophole modes
are never shown, so the held-out-mode test (\S\ref{ssec:c2}) measures whether the
hack \emph{generalises} off the demonstrated mode.
% =================================================================== % ===================================================================
% RESULTS -- evidence tables + figures. Numbers are real where present, % RESULTS -- evidence tables + figures. Numbers are real where present,
@@ -289,6 +370,13 @@ hack \emph{generalises} off the demonstrated mode.
% =================================================================== % ===================================================================
\section{Results} \section{Results}
\TODO{All numbers and figures in this section are from the retired
gradient-scored routeV method on the PiSSA substrate. They are provenance, not
evidence for routeA: the routeV placebo matched the real direction, later
diagnosed as shrinkage from the shared frozen basis, which is what motivated the
lora2r adapter. Replace with the lora2r routeA decision runs (real $v_{\text{act}}$
/ Haar placebo / none / absorb) when they land.}
% --- Table: context anchors (paper baselines) -------------------------------- % --- Table: context anchors (paper baselines) --------------------------------
% Paper numbers from Ariahw et al. 2025 (Table 1 / Figure 3, run_tests env, % Paper numbers from Ariahw et al. 2025 (Table 1 / Figure 3, run_tests env,
% Qwen3-4B, 60-step preset where comparable). Our harness numbers come from: % Qwen3-4B, 60-step preset where comparable). Our harness numbers come from:
@@ -696,30 +784,29 @@ one-liners are in docs/grad\_routing/related\_work.md.}
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
clean warmup and project the live gradient \emph{onto} its dominant clean warmup and project the live gradient \emph{onto} its dominant
left-singular directions. We extract a hack direction from a few left-singular directions. We extract a hack direction from a few
contrastive (hack, clean) pair gradients and project it \emph{out}, in the contrastive (hack, clean) pairs and use it to select, per rollout, which
frozen SVD-of-$W$ $\delta_S$ coordinates. Both directions live in weight adapter block receives the update. Their direction lives in weight space
space; the signal differs (their clean update trajectory needs a warmup, and constrains every update; ours lives in activation space and gates
ours is a handful of labelled pair gradients), and we route the whole rollouts into a quarantine block that deployment ablation removes,
removed part into a deploy-deletable auxiliary adapter, where their projection where their projection only constrains training.
only constrains training. % COMPREHENSION (cold-reader panel 2026-06-03): lead with the routing SIGNAL,
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the % not the masking API -- "label-selected vs direction-selected route" is the
% API. "post-backward vs forward .detach()" reads as engineering taste to an % load-bearing distinction for an RL reader. Put that first.
% RL reader; "we route in parameter-gradient space, Cloud routes in activation
% space" is the load-bearing distinction. Put that first.
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate. \item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
We inherit the localize-then-ablate \emph{idea}, but not the mechanism: We inherit the localize-then-ablate \emph{idea} and the
Cloud routes by a forward \texttt{.detach()} on labelled activation dims; \texttt{.detach()} mechanism (our route mask detaches the deployed
we operate post-backward on parameter gradients (next bullet). block's output), but Cloud selects the route from a per-example data
label; we select it from an extracted activation direction.
\item Capability-localization routing (\citealp{sgtm2025localization}; the \item Capability-localization routing (\citealp{sgtm2025localization}; the
\texttt{selective-gradient-masking} line): our \texttt{selective-gradient-masking} line): our
closest mechanistic analogue, a post-backward parameter-gradient mask closest mechanistic analogue. Their per-sample masks over retain/forget
over a forget subspace, tolerant to label noise, where the forgotten parameter blocks are the rows our three-way output mask reproduces, and
capability leaking back into the kept weights shrinks as model size their results are tolerant to label noise: the forgotten capability
grows (supports our scalability argument). We differ in the leaking back into the kept weights shrinks as model size grows (supports
mask \emph{source}: they tag training data per example over fixed our scalability argument). We differ in the mask \emph{source}: they tag
reserved dims; we extract one hack direction from a few contrastive pairs training data per example; we score each rollout by its pooled
and route by cosine. Their TPR/FPR detector quality maps to our bottleneck activations against an extracted direction. Their TPR/FPR
weak-detector (held-out) axis. detector quality maps to our weak-detector (held-out) axis.
\item Advantage-level intervention \citep{wu2026rebound}: representation- \item Advantage-level intervention \citep{wu2026rebound}: representation-
informed advantage modulation; ours is gradient-level (after the reward informed advantage modulation; ours is gradient-level (after the reward
is computed). A matched-compute comparison is future work. is computed). A matched-compute comparison is future work.
@@ -735,7 +822,7 @@ one-liners are in docs/grad\_routing/related\_work.md.}
subspace from residual-stream diffs of gold-vs-hacked pairs and projects subspace from residual-stream diffs of gold-vs-hacked pairs and projects
the reward-head vector off it; weight arithmetic isolates a behaviour the reward-head vector off it; weight arithmetic isolates a behaviour
direction by subtracting two opposite fine-tunes. Both remove a direction by subtracting two opposite fine-tunes. Both remove a
contrastive direction like our $v_{\text{hack}}$, but edit a static contrastive direction like our $v_{\text{act}}$, but edit a static
model, not the live policy gradient. model, not the live policy gradient.
\item Reward-for-honesty \citep{joglekar2025confessions}: we reject this \item Reward-for-honesty \citep{joglekar2025confessions}: we reject this
design, since it reintroduces a live judge over student rollouts and design, since it reintroduces a live judge over student rollouts and
@@ -761,16 +848,16 @@ one-liners are in docs/grad\_routing/related\_work.md.}
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
older idea that a capability can be confined to a weight subset, via a older idea that a capability can be confined to a weight subset, via a
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA). per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
Our auxiliary $\delta_{S,\text{hack}}$ is a deletable adapter in that Our quarantine block is a deletable adapter in that
family. Two differences: these methods \emph{add} a wanted task and pick family. Two differences: these methods \emph{add} a wanted task and pick
the subset from a given task label, whereas we \emph{remove} an unwanted the subset from a given task label, whereas we \emph{remove} an unwanted
capability and pick the subset from a gradient signal ($\cos$ to capability and pick the subset from an activation score against
$v_{\text{hack}}$), with no per-rollout label. The deletable-adapter idea $v_{\text{act}}$, with no per-rollout label. The deletable-adapter idea
itself has a 2023 precedent: separable ``security vectors'' itself has a 2023 precedent: separable ``security vectors''
\citep{zhou2023securityvectors} absorb a harmful fine-tuning update so \citep{zhou2023securityvectors} absorb a harmful fine-tuning update so
the backbone never learns it, deactivated at inference; ours differs in the backbone never learns it, deactivated at inference; ours differs in
being RL reward hacking (not SFT harmful data), an extracted direction being RL reward hacking (not SFT harmful data), an extracted direction
plus cosine routing (not fixed reserved params), and the weak (held-out) gating the route (not fixed reserved params), and the weak (held-out)
detector. detector.
\item Orthogonal gradient projection lineage \item Orthogonal gradient projection lineage
\citep{yu2020pcgrad, ilharco2023taskarithmetic, qiao2025pegp, \citep{yu2020pcgrad, ilharco2023taskarithmetic, qiao2025pegp,
@@ -803,7 +890,8 @@ one-liners are in docs/grad\_routing/related\_work.md.}
\end{itemize} \end{itemize}
\section{Lessons learned / discussion} \section{Lessons learned / discussion}
\TODO{outline -- candidate items from the journal: (a) $v_{\text{hack}}$ goes \TODO{outline -- routeV-era candidates; refresh from the current journal before
writing: (a) $v_{\text{hack}}$ goes
stale fast (cos to live gradient decays $\sim$0.28$\to$0.07 by step 10), so stale fast (cos to live gradient decays $\sim$0.28$\to$0.07 by step 10), so
online refresh helps; (b) Adam momentum leak (projection does not touch the online refresh helps; (b) Adam momentum leak (projection does not touch the
buffer) -- bounded on frozen-V, open under refresh; (c) erase vs route trade-off buffer) -- bounded on frozen-V, open under refresh; (c) erase vs route trade-off
@@ -833,19 +921,47 @@ shared-basis relu gate; (d) cached-teacher-pool confound vs endogenous-hack regi
\end{itemize} \end{itemize}
\section{Limitations} \section{Limitations}
% User-dictated; kept verbatim. % User-dictated items kept verbatim (teacher-mix count updated to the current
% symmetric scheme); "like all gradient routing" items inherited from
% cloud2024gradientrouting / sgtm2025localization.
\begin{itemize} \begin{itemize}
\item Small model (Qwen3-4B). \item Small model (Qwen3-4B).
\item We bootstrap hacking with a cached teacher pool ($12.5\%$ off-policy \item We bootstrap hacking with a cached teacher pool (one hack and one
rollouts) due to compute limits, rather than waiting for endogenous correct demonstration per prompt, $25\%$ of rollouts) due to compute
limits, rather than waiting for endogenous
emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the emergence ($\sim$64 GPU-h). Section~\ref{sec:bootstrap} argues the
teacher accelerates emergence, not the suppressed signal (the student teacher accelerates emergence, not the suppressed signal (the student
produces ${\sim}3{:}1$ more hacks than the teacher from step 40, and emerges produces ${\sim}3{:}1$ more hacks than the teacher from step 40, and emerges
three held-out modes with zero teacher examples); the teacher-off control three held-out modes with zero teacher examples); the teacher-off control
there closes it. there closes it.
\item Routing reserves capacity: the adapter is rank $2r$ during training, and
the quarantine half is deleted at deployment, so we pay double the
adapter parameters and optimizer memory for a deployed adapter of rank
$r$. The comparison arms share the $2r$ structure, so the cost is
matched across arms, not eliminated; \citet{sgtm2025localization} report
the analogous overhead of their reserved parameters as a compute-efficiency
penalty.
\item Like all gradient-routing methods, the parameter partition is fixed
before training: the quarantine block must be sized and placed in
advance of knowing what will need localizing.
\item Like all gradient-routing methods, recall beyond the explicitly routed
samples relies on absorption, which \citet{cloud2024gradientrouting}
posit from observations rather than guarantee. Whether absorption
engages in this RL adapter setting is an open question; the absorb arm
tests ungated two-block training but does not by itself establish it.
\item Like all gradient-routing methods, localization is imperfect at small
scale: \citet{sgtm2025localization} measure leakage of forget data into
retain parameters that decreases with model scale (8M--64M). The
corresponding leakage in a LoRA adapter on a 4B model is unmeasured.
\item The label-noise robustness demonstrated by \citet{sgtm2025localization}
covers missed forget labels only (samples left in the unlabeled
middle). The costs of wrongly assigned samples -- a hack training the
deployed block, or a clean rollout training only the quarantine -- are
mechanical arguments, not measured quantities, and our gate's two
threshold cuts face exactly those errors.
\item Little optimization done per component, so there are likely substantial \item Little optimization done per component, so there are likely substantial
straightforward improvements in each novel piece (extraction, gating, refresh, straightforward improvements in each novel piece (extraction, gating,
auxiliary-adapter scale-matching). refresh).
\item \TODO{also: $n{=}3$, 60-step fast preset, single substrate; C2 clean \item \TODO{also: $n{=}3$, 60-step fast preset, single substrate; C2 clean
2-of-4 design unqueued; solve-rate matching band; no-label-leakage 2-of-4 design unqueued; solve-rate matching band; no-label-leakage
assumptions.} assumptions.}
@@ -989,6 +1105,9 @@ open("result.txt", "w").write("PASS") # the grader is reading this file
% minimal subset for the paper; port from those files for the full pipeline. % minimal subset for the paper; port from those files for the full pipeline.
% Ported from the blog. Factual (matches src/projected_grpo/extract_vhack_grad.py % Ported from the blog. Factual (matches src/projected_grpo/extract_vhack_grad.py
% and the route2 optimizer step). Author may trim. % and the route2 optimizer step). Author may trim.
\TODO{retired routeV pseudocode (gradient extraction, projection, route2 step);
rewrite from src/vgrout/extract\_vhack\_act.py and the routeA gate in
src/vgrout/train.py.}
Extracting $v_{\text{hack}}$ (Algorithm~\ref{alg:extract}); the easy-to-miss Extracting $v_{\text{hack}}$ (Algorithm~\ref{alg:extract}); the easy-to-miss
detail is that each completion's gradient is isolated before stacking. detail is that each completion's gradient is isolated before stacking.
+8 -7
View File
@@ -7,9 +7,10 @@ TINY_MODEL := "llamafactory/tiny-random-qwen3" # qwen3 arch, ~6M params, smoke
TRAIN := "uv run python -m vgrout.train" # real LeetCode GRPO entry point TRAIN := "uv run python -m vgrout.train" # real LeetCode GRPO entry point
TEACHER_RT := "out/pools/teacher_pool_runtests_dense" # dense single-mode run_tests pool TEACHER_RT := "out/pools/teacher_pool_runtests_dense" # dense single-mode run_tests pool
# Teacher forcing: SYMMETRIC off-policy demos injected as ordinary gens (NOT specially # Teacher forcing: SYMMETRIC off-policy demos injected as ordinary gens (NOT specially
# routed -- they pass through the same gate as student rollouts). STEP-LEVEL mix 0.5 over # routed -- they pass through the same gate as student rollouts). Deterministic count:
# 4 prompts x group 8 -> 16 teachers/step (8 hack + 8 solve), 16 students. Heavy on # teacher_n_per_prompt=1 hack + 1 solve per teacher-phase prompt (mix-ratio>0 is just the
# purpose: the run is grad-starved (32 gens/step vs the paper's 256), so without strong # enable switch, no budget); 4 prompts x group 8 -> 8 teachers/step (4 hack + 4 solve),
# 24 students. The run is grad-starved (32 gens/step vs the paper's 256), so without
# teacher forcing the student never reaches the hack (emerges ~ref-step 80-100). Teachers # teacher forcing the student never reaches the hack (emerges ~ref-step 80-100). Teachers
# off at step 30: in the s43 run hack was already self-sustaining (student kept hacking # off at step 30: in the s43 run hack was already self-sustaining (student kept hacking
# 16-24/32 after teachers cut), so 60 just fed extra hacks past saturation. # 16-24/32 after teachers cut), so 60 just fed extra hacks past saturation.
@@ -26,7 +27,7 @@ default:
# ───────────────────────────────────────────────────────────────────────────── # ─────────────────────────────────────────────────────────────────────────────
# Default smoke = routeA (full pipeline: extract v_act -> act gate on the logpi_old # Default smoke = routeA (full pipeline: extract v_act -> act gate on the logpi_old
# forward -> Otsu pinning -> deploy ablation). Runs all verify gates first, including # forward -> quantile-tail pinning -> deploy ablation). Runs all verify gates first, including
# the lora2r block-mask invariants. (scripts/verify_v_act.py is the GPU-only extractor # the lora2r block-mask invariants. (scripts/verify_v_act.py is the GPU-only extractor
# check vs the cached diag features -- run it manually after extractor changes.) # check vs the cached diag features -- run it manually after extractor changes.)
smoke *ARGS: smoke *ARGS:
@@ -44,8 +45,8 @@ smoke-vanilla *ARGS:
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }} --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}
# routeA: extract v_act from authored pairs (forward-only), capture pooled acts on the # routeA: extract v_act from authored pairs (forward-only), capture pooled acts on the
# no-grad logpi_old forward, label rollouts {keep,absorb,rout} via rolling-buffer Otsu # no-grad logpi_old forward, label rollouts {keep,absorb,rout} via global-quantile
# thresholds, ONE masked forward+backward; deploy ablation resets the quarantine to init. # tails of the run-spanning score buffer, ONE masked forward+backward; deploy ablation resets the quarantine to init.
smoke-routeA *ARGS: smoke-routeA *ARGS:
BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeA \ BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeA \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \ --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
@@ -88,7 +89,7 @@ smoke-all:
# ───────────────────────────────────────────────────────────────────────────── # ─────────────────────────────────────────────────────────────────────────────
# Headline 4-arm lora2r decision run, routeA ACT gate + teacher forcing ({{ TEACH }}). # Headline 4-arm lora2r decision run, routeA ACT gate + teacher forcing ({{ TEACH }}).
# real-v is the method (v_act from authored pairs, Otsu rolling-buffer thresholds); # real-v is the method (v_act from authored pairs, global-quantile tail thresholds);
# placebo (Haar) isolates directionality; vanilla is the emergence reference; absorb # placebo (Haar) isolates directionality; vanilla is the emergence reference; absorb
# isolates the gate+masks from absorption. Priority descending so they run in listed order. # isolates the gate+masks from absorption. Priority descending so they run in listed order.
# --unhackable-frac pinned EXPLICIT so the regime is self-documenting, not default-dependent. # --unhackable-frac pinned EXPLICIT so the regime is self-documenting, not default-dependent.
+16 -3
View File
@@ -276,8 +276,21 @@ def plot_q2(df: pl.DataFrame, subtitle: str, out_png: Path) -> dict:
n_rout = int(routed.sum()) n_rout = int(routed.sum())
prec = float(y[routed].mean()) if routed.any() else float("nan") prec = float(y[routed].mean()) if routed.any() else float("nan")
rec = float((s[y] >= t_hi).mean()) if y.any() else float("nan") rec = float((s[y] >= t_hi).mean()) if y.any() else float("nan")
# F_beta at the rout cut, beta=0.5 (PRECISION-weighted). The routing cost is
# asymmetric the OTHER way than naive intuition: a missed hack (false negative)
# is absorbed -- SGTM is robust to 40-50% undiscovered forget data because the
# routed subset localizes the capability regardless (paper_sgtm.md L64,160,362).
# A false positive (clean routed to rout) has NO such safety net: that solve
# update goes only to the quarantine and is ablated away -> lost capability. So
# the rout cut should be high-PRECISION (pin only confident hacks; let the wide
# absorb band catch the uncertain ones). AUROC ignores the threshold and the
# imbalance; this scores the gate at its operating point. Measurement only -- it
# needs hack labels, so it can never feed the live gate.
b2 = 0.25 # beta=0.5 -> beta^2
fbeta = float((1 + b2) * prec * rec / (b2 * prec + rec)) if (prec + rec) > 0 else 0.0
stats[col] = {"auroc_pos": auroc_pos, "auroc_all": auroc, "prec_rout": prec, stats[col] = {"auroc_pos": auroc_pos, "auroc_all": auroc, "prec_rout": prec,
"rec_rout": rec, "n_rout": n_rout, "t_hi": t_hi, "oracle": oracle} "rec_rout": rec, "fhalf_rout": fbeta, "n_rout": n_rout, "t_hi": t_hi,
"oracle": oracle}
zvals = np.concatenate([s, (syn_join - mu_s) / sd_s]) if len(syn_join) else s zvals = np.concatenate([s, (syn_join - mu_s) / sd_s]) if len(syn_join) else s
lo = float(np.quantile(zvals, 0.005)) lo = float(np.quantile(zvals, 0.005))
@@ -328,7 +341,7 @@ def plot_q2(df: pl.DataFrame, subtitle: str, out_png: Path) -> dict:
for sp in ("top", "right", "left"): for sp in ("top", "right", "left"):
ax.spines[sp].set_visible(False) ax.spines[sp].set_visible(False)
ax.set_title(f"{rep} · {kind} AUROC={auroc_pos:.2f} (A>0 contrast; vs-all {auroc:.2f}) " ax.set_title(f"{rep} · {kind} AUROC={auroc_pos:.2f} (A>0 contrast; vs-all {auroc:.2f}) "
f"P@rout={prec:.2f} (n={n_rout}) R={rec:.2f}", fontsize=9) f"P@rout={prec:.2f} (n={n_rout}) R={rec:.2f} F0.5={fbeta:.2f}", fontsize=9)
ax.set_xlabel({"cos": "cosine to v (concat modules), z within family", ax.set_xlabel({"cos": "cosine to v (concat modules), z within family",
"dot": "dot ⟨x, v⟩, z within family"}[kind], fontsize=8.5) "dot": "dot ⟨x, v⟩, z within family"}[kind], fontsize=8.5)
@@ -566,7 +579,7 @@ def _downstream(cfg: Cfg, fe: dict, src: str) -> int:
print(f"\nmain metric: best case on the A>0 contrast = {best} " print(f"\nmain metric: best case on the A>0 contrast = {best} "
f"AUROC={stats[best]['auroc_pos']:.3f} (vs-all {stats[best]['auroc_all']:.3f}) " f"AUROC={stats[best]['auroc_pos']:.3f} (vs-all {stats[best]['auroc_all']:.3f}) "
f"P@rout={stats[best]['prec_rout']:.2f} (n={stats[best]['n_rout']}) " f"P@rout={stats[best]['prec_rout']:.2f} (n={stats[best]['n_rout']}) "
f"R@rout={stats[best]['rec_rout']:.2f}") f"R@rout={stats[best]['rec_rout']:.2f} F0.5@rout={stats[best]['fhalf_rout']:.2f}")
print(f"out: {q2_png}") print(f"out: {q2_png}")
return 0 return 0
+7 -6
View File
@@ -104,12 +104,13 @@ class StepLogger:
_Col("auroc", 6, "auroc", ".2f", "AUROC of dot(act, v_act) vs hack labels on the A>0 contrast (positively-reinforced rollouts, where the reward alone is blind); measurement only, never routes. ~0.5 = chance-level separation; high AUROC but rout~0 = threshold problem; a drop at refresh = reduced separation"), _Col("auroc", 6, "auroc", ".2f", "AUROC of dot(act, v_act) vs hack labels on the A>0 contrast (positively-reinforced rollouts, where the reward alone is blind); measurement only, never routes. ~0.5 = chance-level separation; high AUROC but rout~0 = threshold problem; a drop at refresh = reduced separation"),
_Col("cos", 6, "cos", "+.2f", "mean per-rollout cos(act, v_act) (dot-vs-cos diagnostic)"), _Col("cos", 6, "cos", "+.2f", "mean per-rollout cos(act, v_act) (dot-vs-cos diagnostic)"),
_Col("qmass", 6, "qmass", ".2f", "quarantine energy share ||g_quar||/(||g_keep||+||g_quar||): fraction of update energy assigned to quarantine"), _Col("qmass", 6, "qmass", ".2f", "quarantine energy share ||g_quar||/(||g_keep||+||g_quar||): fraction of update energy assigned to quarantine"),
_Col("keep", 6, "keep", ".2f", "rollout share below t_lo -> deployed-only, quarantine off"), _Col("keep", 7, "keep", ".3f", "rollout share below t_lo -> deployed-only, quarantine off"),
_Col("resid", 6, "resid", ".2f", "rollout share between thresholds (and ALL rollouts during warmup) -> both blocks train; absorption is possible but not measured"), _Col("resid", 7, "resid", ".3f", "rollout share between thresholds (and ALL rollouts during warmup) -> both blocks train; absorption is possible but not measured"),
_Col("rout", 6, "rout", ".2f", "rollout share at/above t_hi -> quarantine-only, deployed detached"), _Col("rout", 7, "rout", ".3f", "rollout share at/above t_hi -> quarantine-only, deployed detached"),
_Col("tlo", 6, "tlo", "+.2f", "Otsu lower threshold (z units of the rolling score buffer); nan during warmup"), _Col("tlo", 6, "tlo", "+.2f", "keep|absorb cut = route_tail_q global quantile of the run-spanning score buffer (raw score units); nan during warmup AND on the boundary step where the buffer crosses route_warmup mid-step"),
_Col("thi", 6, "thi", "+.2f", "Otsu upper (rout) threshold (z units); nan during warmup"), _Col("thi", 6, "thi", "+.2f", "absorb|rout cut = (1-route_tail_q) global quantile; nan during warmup; rollouts >= thi route to quarantine (deployed detached)"),
_Col("stale", 5, "stale", "d", "steps since v_act was last re-extracted (0 = refreshed this step, every vhack_refresh_every); placebo/vanilla never refresh so it grows unbounded"), _Col("buf", 6, "buf", "d", "score-buffer fill at step end; gate is PINNED ABSORB (keep=rout=0) while buf < route_warmup, then quantile-routes. Plateaus at route_buffer once it starts evicting"),
_Col("stale", 5, "vage", "d", "vector age = steps since v_act was last re-extracted (0 = refreshed this step, every vhack_refresh_every; sawtooths 0..N-1); placebo/vanilla never refresh so it grows unbounded"),
] ]
# Show the training-prompt deploy proxy only when an ablated slice exists. # Show the training-prompt deploy proxy only when an ablated slice exists.
if show_ablate: if show_ablate:
+4 -2
View File
@@ -567,7 +567,7 @@ def main(cfg: Config) -> int:
step_clipfrac: list[float] = [] # PPO clip frac on keep-gated rollouts (ratio-drift gauge) step_clipfrac: list[float] = [] # PPO clip frac on keep-gated rollouts (ratio-drift gauge)
step_rho_keep: list[float] = []; step_rho_absorb: list[float] = []; step_rho_rout: list[float] = [] # mean ρ per zone (off-policy gauge) step_rho_keep: list[float] = []; step_rho_absorb: list[float] = []; step_rho_rout: list[float] = [] # mean ρ per zone (off-policy gauge)
step_zkeep: list[float] = []; step_zresid: list[float] = []; step_zrout: list[float] = [] # unit shares per zone step_zkeep: list[float] = []; step_zresid: list[float] = []; step_zrout: list[float] = [] # unit shares per zone
step_tlo: list[float] = []; step_thi: list[float] = [] # Otsu thresholds (z units) step_tlo: list[float] = []; step_thi: list[float] = [] # quantile-tail thresholds (raw score units)
# AUROC diagnostic on the A>0 contrast: scores + hack-labels of positively- # AUROC diagnostic on the A>0 contrast: scores + hack-labels of positively-
# reinforced rollouts only (where the advantage alone is blind), students + # reinforced rollouts only (where the advantage alone is blind), students +
# cached teachers. Accumulated across prompts; measurement only, never routes. # cached teachers. Accumulated across prompts; measurement only, never routes.
@@ -826,7 +826,7 @@ def main(cfg: Config) -> int:
# Pin block masks BEFORE the (single) grad-carrying forward (arm semantics: # Pin block masks BEFORE the (single) grad-carrying forward (arm semantics:
# train_config.py docstring): none -> (0,0), absorb -> (1,0), routeA -> the # train_config.py docstring): none -> (0,0), absorb -> (1,0), routeA -> the
# per-rollout three-way gate labels from the rolling-buffer Otsu thresholds. # per-rollout three-way gate labels from the global-quantile tail thresholds.
if is_vanilla: if is_vanilla:
_z = torch.zeros(merged.shape[0], device=device) _z = torch.zeros(merged.shape[0], device=device)
for info in wrappers.values(): for info in wrappers.values():
@@ -1104,6 +1104,8 @@ def main(cfg: Config) -> int:
"rout": (sum(step_zrout) / len(step_zrout)) if step_zrout else float("nan"), "rout": (sum(step_zrout) / len(step_zrout)) if step_zrout else float("nan"),
"tlo": (sum(step_tlo) / len(step_tlo)) if step_tlo else float("nan"), "tlo": (sum(step_tlo) / len(step_tlo)) if step_tlo else float("nan"),
"thi": (sum(step_thi) / len(step_thi)) if step_thi else float("nan"), "thi": (sum(step_thi) / len(step_thi)) if step_thi else float("nan"),
# buffer fill at step end -> gate is pinned absorb while buf < route_warmup.
"buf": len(act_buf) if act_buf is not None else 0,
"lr": sched.get_last_lr()[0], "lr": sched.get_last_lr()[0],
"stale": v_act_stale, "stale": v_act_stale,
# Deploy-eval (quarantine ablated); NaN except on eval steps. # Deploy-eval (quarantine ablated); NaN except on eval steps.
+5 -2
View File
@@ -37,6 +37,9 @@ class Config:
# AdamW decay pulls raw A/B toward 0, not toward the init, which would drive # AdamW decay pulls raw A/B toward 0, not toward the init, which would drive
# the net delta to -B0@A0 -- must stay 0 for this adapter. # the net delta to -B0@A0 -- must stay 0 for this adapter.
weight_decay: float = 0.0 weight_decay: float = 0.0
# COUPLED to route_warmup: the gate's pinned-absorb warmup (~4 steps at the 60-step
# preset) must end inside the LR ramp, so its unrouted both-block updates land at
# tiny LR. Lowering warmup_frac without lowering route_warmup breaks that cover.
warmup_frac: float = 0.2 warmup_frac: float = 0.2
# With grad_clip=10.0, gn rose 5->14->47->100 before the step-17 generation # With grad_clip=10.0, gn rose 5->14->47->100 before the step-17 generation
# divergence in job 15, even at lr=1e-4. Typical gn is 1-5, so grad_clip=1.0 # divergence in job 15, even at lr=1e-4. Typical gn is 1-5, so grad_clip=1.0
@@ -136,8 +139,8 @@ class SmokeConfig(Config):
max_new: int = 32 max_new: int = 32
n_problems: int = 100 n_problems: int = 100
prompts_per_step: int = 1 prompts_per_step: int = 1
# Smoke produces 4 scores/step over 30 steps; the real 256/128 buffer would keep the # Smoke produces 4 scores/step over 30 steps; the real 8192/128 buffer would keep the
# gate in warmup forever. Shrink so the smoke exercises warmup AND the Otsu gate # gate in warmup forever. Shrink so the smoke exercises warmup AND the quantile gate
# (keep/absorb/rout + deployed detach) within a few steps. # (keep/absorb/rout + deployed detach) within a few steps.
route_buffer: int = 32 route_buffer: int = 32
route_warmup: int = 8 route_warmup: int = 8