From 17a8792340e98a4b6d8d9eeafec56137d9dc8744 Mon Sep 17 00:00:00 2001 From: wassname Date: Wed, 3 Jun 2026 02:29:45 +0000 Subject: [PATCH] paper: address comprehension friction + OpenReview novelty challenge - Inline author-notes at the Cloud and Huang related-work bullets (cold-reader panel): lead Cloud with parameter-vs-activation space; state Huang's keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis question as a reviewer attack vector. - Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack 'leakage that shrinks with scale'. - New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the 'limited novelty vs weight-subspace masking' critique that rejected the gradient-routing paper. We remove (not add) a capability and pick the subset from a gradient signal (not a task label). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- docs/writeup/main.tex | 40 ++++++++++++++++++++++++++++++++++------ docs/writeup/refs.bib | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 68 insertions(+), 6 deletions(-) diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index f9b10de..d66f62a 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -119,11 +119,14 @@ possible bias toward short-completion hacks. Unmeasured -- see Limitations.} keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:} \begin{itemize} \item \emph{erase} (one-sided): subtract the $v_{\text{hack}}$ component from - the live $\delta_S$ gradient; the optimizer steps on the complement. + the live $\delta_S$ gradient; the optimizer steps on what remains (the + part orthogonal to $v_{\text{hack}}$). \item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$ - calibrated each step from the hack-vs-clean cosine gap) sends the whole - rollout gradient into a scale-matched, distinct-basis quarantine knob - $\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is + calibrated each step from the hack-vs-clean cosine gap) sends that + rollout's whole gradient into a separate quarantine knob + $\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed + update so the kept knob is fully relieved of it, and deleted at deploy. + Mechanically this is SGTM-style post-backward parameter-gradient masking \citep{sgtm2025localization} in the SVD basis (we strip/redirect the $\delta_S$ gradient after the backward pass, not via a forward @@ -304,6 +307,13 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$ \TODO{prose -- author. Factual differentiators below; the curated scorecard and one-liners are in docs/grad\_routing/related\_work.md.} \begin{itemize} + % COMPREHENSION (cold-reader panel 2026-06-03): the keep-vs-remove inversion + % takes two reads. State it plainly first: "Huang projects ONTO a clean + % direction; we project a hack direction OUT." Skeptic also flagged: zeroing + % delta_S_hack at deploy == not-projecting at deploy, so "deletable knob vs + % only-constrains-training" is thin unless argued; and we never measured + % whether our hack-basis and their clean-basis are the same subspace (if they + % coincide, +project-onto and -project-out converge). Attack vector for a reviewer. \item Trusted-direction projection \citep{huang2026directional}: the near-twin. They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short clean warmup and project the live gradient \emph{onto} its dominant @@ -314,14 +324,19 @@ one-liners are in docs/grad\_routing/related\_work.md.} ours is a handful of labelled pair gradients), and we quarantine the removed part into a deploy-deletable knob, where their projection only constrains training. + % COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the + % API. "post-backward vs forward .detach()" reads as engineering taste to an + % RL reader; "we route in parameter-gradient space, Cloud routes in activation + % space" is the load-bearing distinction. Put that first. \item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate. We inherit the localize-then-ablate \emph{idea}, but not the mechanism: Cloud routes by a forward \texttt{.detach()} on labelled activation dims; we operate post-backward on parameter gradients (next bullet). \item Capability-localization routing (SGTM \citep{sgtm2025localization}): our closest mechanistic relative, a post-backward parameter-gradient mask - over a forget subspace, tolerant to label noise, with leakage that - shrinks with scale (supports our scalability argument). We differ in the + over a forget subspace, tolerant to label noise, where the forgotten + capability leaking back into the kept weights shrinks as model size + grows (supports our scalability argument). We differ in the mask \emph{source}: SGTM tags training data per example over fixed reserved dims; we extract one hack direction from a few contrastive pairs and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat @@ -339,6 +354,19 @@ one-liners are in docs/grad\_routing/related\_work.md.} \item AntiPaSTO \citep{antipasto}: the per-Linear $\delta_S$ parameterisation; first use here for projection/routing rather than adapter learning. \TODO{verify cite before submission.} + % Pre-empts the OpenReview "limited novelty vs PackNet/Piggyback/LoRA" line + % (the critique that rejected the gradient-routing paper). Honest framing: the + % weight-subspace idea is old; ours differs in direction (remove vs add) and in + % how the subset is chosen (gradient signal vs task label). + \item Parameter-subspace isolation (PackNet \citep{mallya2018packnet}, + Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the + older idea that a capability can be confined to a weight subset, via a + per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA). + Our quarantine $\delta_{S,\text{hack}}$ is a deletable adapter in that + family. Two differences: these methods \emph{add} a wanted task and pick + the subset from a given task label, whereas we \emph{remove} an unwanted + capability and pick the subset from a gradient signal ($\cos$ to + $v_{\text{hack}}$), with no per-rollout label. % \TODO{abstract-only twins to verify+place: GRIFT (gradient fingerprints, % arXiv:2604.16242); Spilling the Beans (OOD self-report, arXiv:2511.06626).} \end{itemize} diff --git a/docs/writeup/refs.bib b/docs/writeup/refs.bib index e097862..66f5879 100644 --- a/docs/writeup/refs.bib +++ b/docs/writeup/refs.bib @@ -122,3 +122,37 @@ % GRIFT (gradient fingerprints to detect/reject hacking) arXiv:2604.16242 % Spilling the Beans (SFT self-report generalises OOD) arXiv:2511.06626 % Baker et al. (weak monitor -> obfuscated reward hacking) arXiv:2503.11926 + +% --- parameter-isolation / weight-subspace lineage --------------------- +% Added 2026-06-03 in response to an OpenReview novelty challenge against +% gradient routing (PackNet/Piggyback/LoRA as "similar"). We cite them as the +% classical "confine a capability to a weight subset" precedent; we differ by +% REMOVING (not adding) a capability and assigning the subset from a gradient +% signal (not a task label). Bylines verified via arXiv/CVF. +@inproceedings{mallya2018packnet, + title = {{PackNet}: Adding Multiple Tasks to a Single Network by Iterative Pruning}, + author = {Mallya, Arun and Lazebnik, Svetlana}, + booktitle = {CVPR}, + year = {2018}, + eprint = {1711.05769}, + archivePrefix= {arXiv}, + url = {https://arxiv.org/abs/1711.05769} +} +@inproceedings{mallya2018piggyback, + title = {Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights}, + author = {Mallya, Arun and Davis, Dillon and Lazebnik, Svetlana}, + booktitle = {ECCV}, + year = {2018}, + eprint = {1801.06519}, + archivePrefix= {arXiv}, + url = {https://arxiv.org/abs/1801.06519} +} +@inproceedings{hu2021lora, + title = {{LoRA}: Low-Rank Adaptation of Large Language Models}, + author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu}, + booktitle = {ICLR}, + year = {2022}, + eprint = {2106.09685}, + archivePrefix= {arXiv}, + url = {https://arxiv.org/abs/2106.09685} +}