paper: address comprehension friction + OpenReview novelty challenge

- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader panel): lead Cloud with parameter-vs-activation space; state Huang's keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis question as a reviewer attack vector. - Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack 'leakage that shrinks with scale'. - New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the 'limited novelty vs weight-subspace masking' critique that rejected the gradient-routing paper. We remove (not add) a capability and pick the subset from a gradient signal (not a task label). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-03 02:29:45 +00:00
parent ffc2df540f
commit 17a8792340
2 changed files with 68 additions and 6 deletions
@@ -119,11 +119,14 @@ possible bias toward short-completion hacks. Unmeasured -- see Limitations.}
 keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
 \begin{itemize}
  \item \emph{erase} (one-sided): subtract the $v_{\text{hack}}$ component from
-        the live $\delta_S$ gradient; the optimizer steps on the complement.
+        the live $\delta_S$ gradient; the optimizer steps on what remains (the
+        part orthogonal to $v_{\text{hack}}$).
  \item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
-        calibrated each step from the hack-vs-clean cosine gap) sends the whole
-        rollout gradient into a scale-matched, distinct-basis quarantine knob
-        $\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is
+        calibrated each step from the hack-vs-clean cosine gap) sends that
+        rollout's whole gradient into a separate quarantine knob
+        $\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
+        update so the kept knob is fully relieved of it, and deleted at deploy.
+        Mechanically this is
        SGTM-style post-backward parameter-gradient masking
        \citep{sgtm2025localization} in the SVD basis (we strip/redirect the
        $\delta_S$ gradient after the backward pass, not via a forward
@@ -304,6 +307,13 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
 \TODO{prose -- author. Factual differentiators below; the curated scorecard and
 one-liners are in docs/grad\_routing/related\_work.md.}
 \begin{itemize}
+  % COMPREHENSION (cold-reader panel 2026-06-03): the keep-vs-remove inversion
+  % takes two reads. State it plainly first: "Huang projects ONTO a clean
+  % direction; we project a hack direction OUT." Skeptic also flagged: zeroing
+  % delta_S_hack at deploy == not-projecting at deploy, so "deletable knob vs
+  % only-constrains-training" is thin unless argued; and we never measured
+  % whether our hack-basis and their clean-basis are the same subspace (if they
+  % coincide, +project-onto and -project-out converge). Attack vector for a reviewer.
  \item Trusted-direction projection \citep{huang2026directional}: the near-twin.
        They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
        clean warmup and project the live gradient \emph{onto} its dominant
@@ -314,14 +324,19 @@ one-liners are in docs/grad\_routing/related\_work.md.}
        ours is a handful of labelled pair gradients), and we quarantine the
        removed part into a deploy-deletable knob, where their projection only
        constrains training.
+  % COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the
+  % API. "post-backward vs forward .detach()" reads as engineering taste to an
+  % RL reader; "we route in parameter-gradient space, Cloud routes in activation
+  % space" is the load-bearing distinction. Put that first.
  \item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
        We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
        Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
        we operate post-backward on parameter gradients (next bullet).
  \item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
        closest mechanistic relative, a post-backward parameter-gradient mask
-        over a forget subspace, tolerant to label noise, with leakage that
-        shrinks with scale (supports our scalability argument). We differ in the
+        over a forget subspace, tolerant to label noise, where the forgotten
+        capability leaking back into the kept weights shrinks as model size
+        grows (supports our scalability argument). We differ in the
        mask \emph{source}: SGTM tags training data per example over fixed
        reserved dims; we extract one hack direction from a few contrastive pairs
        and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
@@ -339,6 +354,19 @@ one-liners are in docs/grad\_routing/related\_work.md.}
  \item AntiPaSTO \citep{antipasto}: the per-Linear $\delta_S$ parameterisation;
        first use here for projection/routing rather than adapter learning.
        \TODO{verify cite before submission.}
+  % Pre-empts the OpenReview "limited novelty vs PackNet/Piggyback/LoRA" line
+  % (the critique that rejected the gradient-routing paper). Honest framing: the
+  % weight-subspace idea is old; ours differs in direction (remove vs add) and in
+  % how the subset is chosen (gradient signal vs task label).
+  \item Parameter-subspace isolation (PackNet \citep{mallya2018packnet},
+        Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
+        older idea that a capability can be confined to a weight subset, via a
+        per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
+        Our quarantine $\delta_{S,\text{hack}}$ is a deletable adapter in that
+        family. Two differences: these methods \emph{add} a wanted task and pick
+        the subset from a given task label, whereas we \emph{remove} an unwanted
+        capability and pick the subset from a gradient signal ($\cos$ to
+        $v_{\text{hack}}$), with no per-rollout label.
  % \TODO{abstract-only twins to verify+place: GRIFT (gradient fingerprints,
  % arXiv:2604.16242); Spilling the Beans (OOD self-report, arXiv:2511.06626).}
 \end{itemize}
@@ -122,3 +122,37 @@
 %   GRIFT (gradient fingerprints to detect/reject hacking)  arXiv:2604.16242
 %   Spilling the Beans (SFT self-report generalises OOD)    arXiv:2511.06626
 %   Baker et al. (weak monitor -> obfuscated reward hacking) arXiv:2503.11926
+
+% --- parameter-isolation / weight-subspace lineage ---------------------
+% Added 2026-06-03 in response to an OpenReview novelty challenge against
+% gradient routing (PackNet/Piggyback/LoRA as "similar"). We cite them as the
+% classical "confine a capability to a weight subset" precedent; we differ by
+% REMOVING (not adding) a capability and assigning the subset from a gradient
+% signal (not a task label). Bylines verified via arXiv/CVF.
+@inproceedings{mallya2018packnet,
+  title        = {{PackNet}: Adding Multiple Tasks to a Single Network by Iterative Pruning},
+  author       = {Mallya, Arun and Lazebnik, Svetlana},
+  booktitle    = {CVPR},
+  year         = {2018},
+  eprint       = {1711.05769},
+  archivePrefix= {arXiv},
+  url          = {https://arxiv.org/abs/1711.05769}
+}
+@inproceedings{mallya2018piggyback,
+  title        = {Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights},
+  author       = {Mallya, Arun and Davis, Dillon and Lazebnik, Svetlana},
+  booktitle    = {ECCV},
+  year         = {2018},
+  eprint       = {1801.06519},
+  archivePrefix= {arXiv},
+  url          = {https://arxiv.org/abs/1801.06519}
+}
+@inproceedings{hu2021lora,
+  title        = {{LoRA}: Low-Rank Adaptation of Large Language Models},
+  author       = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
+  booktitle    = {ICLR},
+  year         = {2022},
+  eprint       = {2106.09685},
+  archivePrefix= {arXiv},
+  url          = {https://arxiv.org/abs/2106.09685}
+}