paper: address comprehension friction + OpenReview novelty challenge

- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader
  panel): lead Cloud with parameter-vs-activation space; state Huang's
  keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis
  question as a reviewer attack vector.
- Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains
  (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack
  'leakage that shrinks with scale'.
- New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the
  'limited novelty vs weight-subspace masking' critique that rejected the
  gradient-routing paper. We remove (not add) a capability and pick the subset
  from a gradient signal (not a task label).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 02:29:45 +00:00
parent ffc2df540f
commit 17a8792340
2 changed files with 68 additions and 6 deletions
+34 -6
View File
@@ -119,11 +119,14 @@ possible bias toward short-completion hacks. Unmeasured -- see Limitations.}
keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
\begin{itemize}
\item \emph{erase} (one-sided): subtract the $v_{\text{hack}}$ component from
the live $\delta_S$ gradient; the optimizer steps on the complement.
the live $\delta_S$ gradient; the optimizer steps on what remains (the
part orthogonal to $v_{\text{hack}}$).
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
calibrated each step from the hack-vs-clean cosine gap) sends the whole
rollout gradient into a scale-matched, distinct-basis quarantine knob
$\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is
calibrated each step from the hack-vs-clean cosine gap) sends that
rollout's whole gradient into a separate quarantine knob
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
update so the kept knob is fully relieved of it, and deleted at deploy.
Mechanically this is
SGTM-style post-backward parameter-gradient masking
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
$\delta_S$ gradient after the backward pass, not via a forward
@@ -304,6 +307,13 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
\TODO{prose -- author. Factual differentiators below; the curated scorecard and
one-liners are in docs/grad\_routing/related\_work.md.}
\begin{itemize}
% COMPREHENSION (cold-reader panel 2026-06-03): the keep-vs-remove inversion
% takes two reads. State it plainly first: "Huang projects ONTO a clean
% direction; we project a hack direction OUT." Skeptic also flagged: zeroing
% delta_S_hack at deploy == not-projecting at deploy, so "deletable knob vs
% only-constrains-training" is thin unless argued; and we never measured
% whether our hack-basis and their clean-basis are the same subspace (if they
% coincide, +project-onto and -project-out converge). Attack vector for a reviewer.
\item Trusted-direction projection \citep{huang2026directional}: the near-twin.
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
clean warmup and project the live gradient \emph{onto} its dominant
@@ -314,14 +324,19 @@ one-liners are in docs/grad\_routing/related\_work.md.}
ours is a handful of labelled pair gradients), and we quarantine the
removed part into a deploy-deletable knob, where their projection only
constrains training.
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the
% API. "post-backward vs forward .detach()" reads as engineering taste to an
% RL reader; "we route in parameter-gradient space, Cloud routes in activation
% space" is the load-bearing distinction. Put that first.
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
we operate post-backward on parameter gradients (next bullet).
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
closest mechanistic relative, a post-backward parameter-gradient mask
over a forget subspace, tolerant to label noise, with leakage that
shrinks with scale (supports our scalability argument). We differ in the
over a forget subspace, tolerant to label noise, where the forgotten
capability leaking back into the kept weights shrinks as model size
grows (supports our scalability argument). We differ in the
mask \emph{source}: SGTM tags training data per example over fixed
reserved dims; we extract one hack direction from a few contrastive pairs
and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
@@ -339,6 +354,19 @@ one-liners are in docs/grad\_routing/related\_work.md.}
\item AntiPaSTO \citep{antipasto}: the per-Linear $\delta_S$ parameterisation;
first use here for projection/routing rather than adapter learning.
\TODO{verify cite before submission.}
% Pre-empts the OpenReview "limited novelty vs PackNet/Piggyback/LoRA" line
% (the critique that rejected the gradient-routing paper). Honest framing: the
% weight-subspace idea is old; ours differs in direction (remove vs add) and in
% how the subset is chosen (gradient signal vs task label).
\item Parameter-subspace isolation (PackNet \citep{mallya2018packnet},
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
older idea that a capability can be confined to a weight subset, via a
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
Our quarantine $\delta_{S,\text{hack}}$ is a deletable adapter in that
family. Two differences: these methods \emph{add} a wanted task and pick
the subset from a given task label, whereas we \emph{remove} an unwanted
capability and pick the subset from a gradient signal ($\cos$ to
$v_{\text{hack}}$), with no per-rollout label.
% \TODO{abstract-only twins to verify+place: GRIFT (gradient fingerprints,
% arXiv:2604.16242); Spilling the Beans (OOD self-report, arXiv:2511.06626).}
\end{itemize}
+34
View File
@@ -122,3 +122,37 @@
% GRIFT (gradient fingerprints to detect/reject hacking) arXiv:2604.16242
% Spilling the Beans (SFT self-report generalises OOD) arXiv:2511.06626
% Baker et al. (weak monitor -> obfuscated reward hacking) arXiv:2503.11926
% --- parameter-isolation / weight-subspace lineage ---------------------
% Added 2026-06-03 in response to an OpenReview novelty challenge against
% gradient routing (PackNet/Piggyback/LoRA as "similar"). We cite them as the
% classical "confine a capability to a weight subset" precedent; we differ by
% REMOVING (not adding) a capability and assigning the subset from a gradient
% signal (not a task label). Bylines verified via arXiv/CVF.
@inproceedings{mallya2018packnet,
title = {{PackNet}: Adding Multiple Tasks to a Single Network by Iterative Pruning},
author = {Mallya, Arun and Lazebnik, Svetlana},
booktitle = {CVPR},
year = {2018},
eprint = {1711.05769},
archivePrefix= {arXiv},
url = {https://arxiv.org/abs/1711.05769}
}
@inproceedings{mallya2018piggyback,
title = {Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights},
author = {Mallya, Arun and Davis, Dillon and Lazebnik, Svetlana},
booktitle = {ECCV},
year = {2018},
eprint = {1801.06519},
archivePrefix= {arXiv},
url = {https://arxiv.org/abs/1801.06519}
}
@inproceedings{hu2021lora,
title = {{LoRA}: Low-Rank Adaptation of Large Language Models},
author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
booktitle = {ICLR},
year = {2022},
eprint = {2106.09685},
archivePrefix= {arXiv},
url = {https://arxiv.org/abs/2106.09685}
}