mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
paper: address comprehension friction + OpenReview novelty challenge
- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader panel): lead Cloud with parameter-vs-activation space; state Huang's keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis question as a reviewer attack vector. - Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack 'leakage that shrinks with scale'. - New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the 'limited novelty vs weight-subspace masking' critique that rejected the gradient-routing paper. We remove (not add) a capability and pick the subset from a gradient signal (not a task label). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+34
-6
@@ -119,11 +119,14 @@ possible bias toward short-completion hacks. Unmeasured -- see Limitations.}
|
||||
keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
|
||||
\begin{itemize}
|
||||
\item \emph{erase} (one-sided): subtract the $v_{\text{hack}}$ component from
|
||||
the live $\delta_S$ gradient; the optimizer steps on the complement.
|
||||
the live $\delta_S$ gradient; the optimizer steps on what remains (the
|
||||
part orthogonal to $v_{\text{hack}}$).
|
||||
\item \emph{route}: a per-rollout gate $\cos(g,v)>\tau$ ($\tau$
|
||||
calibrated each step from the hack-vs-clean cosine gap) sends the whole
|
||||
rollout gradient into a scale-matched, distinct-basis quarantine knob
|
||||
$\delta_{S,\text{hack}}$, deleted at deploy. Mechanically this is
|
||||
calibrated each step from the hack-vs-clean cosine gap) sends that
|
||||
rollout's whole gradient into a separate quarantine knob
|
||||
$\delta_{S,\text{hack}}$ on its own basis, sized to absorb the routed
|
||||
update so the kept knob is fully relieved of it, and deleted at deploy.
|
||||
Mechanically this is
|
||||
SGTM-style post-backward parameter-gradient masking
|
||||
\citep{sgtm2025localization} in the SVD basis (we strip/redirect the
|
||||
$\delta_S$ gradient after the backward pass, not via a forward
|
||||
@@ -304,6 +307,13 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
|
||||
\TODO{prose -- author. Factual differentiators below; the curated scorecard and
|
||||
one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
\begin{itemize}
|
||||
% COMPREHENSION (cold-reader panel 2026-06-03): the keep-vs-remove inversion
|
||||
% takes two reads. State it plainly first: "Huang projects ONTO a clean
|
||||
% direction; we project a hack direction OUT." Skeptic also flagged: zeroing
|
||||
% delta_S_hack at deploy == not-projecting at deploy, so "deletable knob vs
|
||||
% only-constrains-training" is thin unless argued; and we never measured
|
||||
% whether our hack-basis and their clean-basis are the same subspace (if they
|
||||
% coincide, +project-onto and -project-out converge). Attack vector for a reviewer.
|
||||
\item Trusted-direction projection \citep{huang2026directional}: the near-twin.
|
||||
They SVD the clean parameter update $\Delta W = W_t - W_0$ from a short
|
||||
clean warmup and project the live gradient \emph{onto} its dominant
|
||||
@@ -314,14 +324,19 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
ours is a handful of labelled pair gradients), and we quarantine the
|
||||
removed part into a deploy-deletable knob, where their projection only
|
||||
constrains training.
|
||||
% COMPREHENSION (cold-reader panel 2026-06-03): lead with the space, not the
|
||||
% API. "post-backward vs forward .detach()" reads as engineering taste to an
|
||||
% RL reader; "we route in parameter-gradient space, Cloud routes in activation
|
||||
% space" is the load-bearing distinction. Put that first.
|
||||
\item Gradient routing \citep{cloud2024gradientrouting}: Expand-Route-Ablate.
|
||||
We inherit the localize-then-ablate \emph{idea}, but not the mechanism:
|
||||
Cloud routes by a forward \texttt{.detach()} on labelled activation dims;
|
||||
we operate post-backward on parameter gradients (next bullet).
|
||||
\item Capability-localization routing (SGTM \citep{sgtm2025localization}): our
|
||||
closest mechanistic relative, a post-backward parameter-gradient mask
|
||||
over a forget subspace, tolerant to label noise, with leakage that
|
||||
shrinks with scale (supports our scalability argument). We differ in the
|
||||
over a forget subspace, tolerant to label noise, where the forgotten
|
||||
capability leaking back into the kept weights shrinks as model size
|
||||
grows (supports our scalability argument). We differ in the
|
||||
mask \emph{source}: SGTM tags training data per example over fixed
|
||||
reserved dims; we extract one hack direction from a few contrastive pairs
|
||||
and route by cosine. Their TPR/FPR detector-quality knob is our no-cheat
|
||||
@@ -339,6 +354,19 @@ one-liners are in docs/grad\_routing/related\_work.md.}
|
||||
\item AntiPaSTO \citep{antipasto}: the per-Linear $\delta_S$ parameterisation;
|
||||
first use here for projection/routing rather than adapter learning.
|
||||
\TODO{verify cite before submission.}
|
||||
% Pre-empts the OpenReview "limited novelty vs PackNet/Piggyback/LoRA" line
|
||||
% (the critique that rejected the gradient-routing paper). Honest framing: the
|
||||
% weight-subspace idea is old; ours differs in direction (remove vs add) and in
|
||||
% how the subset is chosen (gradient signal vs task label).
|
||||
\item Parameter-subspace isolation (PackNet \citep{mallya2018packnet},
|
||||
Piggyback \citep{mallya2018piggyback}, LoRA \citep{hu2021lora}): the
|
||||
older idea that a capability can be confined to a weight subset, via a
|
||||
per-task binary mask (PackNet, Piggyback) or a low-rank adapter (LoRA).
|
||||
Our quarantine $\delta_{S,\text{hack}}$ is a deletable adapter in that
|
||||
family. Two differences: these methods \emph{add} a wanted task and pick
|
||||
the subset from a given task label, whereas we \emph{remove} an unwanted
|
||||
capability and pick the subset from a gradient signal ($\cos$ to
|
||||
$v_{\text{hack}}$), with no per-rollout label.
|
||||
% \TODO{abstract-only twins to verify+place: GRIFT (gradient fingerprints,
|
||||
% arXiv:2604.16242); Spilling the Beans (OOD self-report, arXiv:2511.06626).}
|
||||
\end{itemize}
|
||||
|
||||
@@ -122,3 +122,37 @@
|
||||
% GRIFT (gradient fingerprints to detect/reject hacking) arXiv:2604.16242
|
||||
% Spilling the Beans (SFT self-report generalises OOD) arXiv:2511.06626
|
||||
% Baker et al. (weak monitor -> obfuscated reward hacking) arXiv:2503.11926
|
||||
|
||||
% --- parameter-isolation / weight-subspace lineage ---------------------
|
||||
% Added 2026-06-03 in response to an OpenReview novelty challenge against
|
||||
% gradient routing (PackNet/Piggyback/LoRA as "similar"). We cite them as the
|
||||
% classical "confine a capability to a weight subset" precedent; we differ by
|
||||
% REMOVING (not adding) a capability and assigning the subset from a gradient
|
||||
% signal (not a task label). Bylines verified via arXiv/CVF.
|
||||
@inproceedings{mallya2018packnet,
|
||||
title = {{PackNet}: Adding Multiple Tasks to a Single Network by Iterative Pruning},
|
||||
author = {Mallya, Arun and Lazebnik, Svetlana},
|
||||
booktitle = {CVPR},
|
||||
year = {2018},
|
||||
eprint = {1711.05769},
|
||||
archivePrefix= {arXiv},
|
||||
url = {https://arxiv.org/abs/1711.05769}
|
||||
}
|
||||
@inproceedings{mallya2018piggyback,
|
||||
title = {Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights},
|
||||
author = {Mallya, Arun and Davis, Dillon and Lazebnik, Svetlana},
|
||||
booktitle = {ECCV},
|
||||
year = {2018},
|
||||
eprint = {1801.06519},
|
||||
archivePrefix= {arXiv},
|
||||
url = {https://arxiv.org/abs/1801.06519}
|
||||
}
|
||||
@inproceedings{hu2021lora,
|
||||
title = {{LoRA}: Low-Rank Adaptation of Large Language Models},
|
||||
author = {Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
|
||||
booktitle = {ICLR},
|
||||
year = {2022},
|
||||
eprint = {2106.09685},
|
||||
archivePrefix= {arXiv},
|
||||
url = {https://arxiv.org/abs/2106.09685}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user