Files
evil_MoE/docs/spec/20260614_moe_absorption_results.md
T
2026-06-14 09:28:16 +08:00

18 KiB
Raw Blame History

MoE sparsity ideas for increasing gradient-routing absorption

Verify: "Modern MoE training uses routing/separation mechanisms that may transfer to gradient routing absorption."

What SGTM / Gradient Routing mean by absorption

Gradient Routing: Masking Gradients to Localize Computation in Neural Networksdocs/papers/grad_routing/paper_gradient_routing.md

  • page date / last updated: not stated

We apply gradient routing to the problem of scalable oversight (Amodei et al., 2016), where the aim is to train a performant policy despite limited access to reliable labels. We train a policy network by reinforcement learning to navigate to two kinds of grid squares in a toy environment, Diamond and Ghost. Using gradient routing, we localize modules responsible for these two behaviors. We show that we can steer the policy towards Diamond by ablating the Ghost module. Gradient routing trains steerable networks even when the amount of labeled training data is small (1%), and even when the policy is able to condition on the existence of labels. As a result, our method outperforms baselines based on behavioral supervision alone. Throughout, we find evidence of an absorption effect, where gradient routing applied to narrow data localizes capabilities relevant to a broader superset of data. Absorption answers the question “if one has labels that are suitable for localizing undesirable computation, why not use those labels to filter the data?” When labels do not encompass all training data from which harmful capabilities arise (Zhu et al., 2009), filtering may be inadequate (Welbl et al., 2021), whereas absorption means that localization can still occur. Furthermore, localization influences model internals without modifying the loss function. This can enable scalable oversight when perfect supervision is not feasible.

epistemic context: local copy of the Gradient Routing paper; this is the paper's own high-level statement of absorption.

Gradient Routing: Masking Gradients to Localize Computation in Neural Networksdocs/papers/grad_routing/paper_gradient_routing.md

  • page date / last updated: not stated

Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the models predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate.

epistemic context: local copy of the Discussion section; this is the most explicit mechanism sketch for absorption.

Gradient Routing: Masking Gradients to Localize Computation in Neural Networksdocs/papers/grad_routing/paper_gradient_routing.md

  • page date / last updated: not stated

Gradient routing decouples the localization of learning from the localization of computation. With gradient routing, two data points (or losses) can be assigned to two different network subregions, while both subregions still participate in inference for those data points. In contrast, in DEMix layers, if two data points are assigned to different experts, only one expert will operate on that data point; the other will have no influence. This is a critical difference because separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption (see section 5), which requires that all features are present at the time of the forward pass.

epistemic context: local copy of the appendix comparison against DEMix; this is the strongest quote on what breaks absorption.

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMsdocs/papers/grad_routing/paper_sgtm.md

  • page date / last updated: not stated

To understand the mechanism underlying SGTMs robustness to label noise, we hypothesize that the model develops self-reinforcing knowledge localization. Once the model begins localizing forget knowledge based on labeled examples (where we explicitly mask gradients), we expect that unlabeled forget samples (D_forget ∩ D_unlabeled) would naturally gravitate toward using forget parameters, thereby sending stronger gradient signals to those parameters even without explicit masking. To test this hypothesis, we analyze gradient norms from a SGTM model trained on the bilingual TinyStories dataset under perfect labeling conditions. The top row demonstrates clear specialization: forget data primarily updates forget parameters (left), while retain data primarily updates retain parameters (right). The bottom-left panel shows that forget weights receive substantially stronger updates from unlabeled forget data compared to unlabeled retain data, confirming the self-reinforcing localization hypothesis.

epistemic context: local copy of the SGTM gradient-norm analysis; this is direct empirical support for self-reinforcing absorption.

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMsdocs/papers/grad_routing/paper_sgtm.md

  • page date / last updated: not stated

Figure 5: (a) Leakage is quantified via equivalent standard training comparison with variable number of forget tokens added to the data mix. The baseline curve (blue) maps the relationship between forget token exposure and forget loss established by training models on all retain data with increasing amounts of forget tokens added. Each blue point represents a model trained with standard training procedure with a given number of forget tokens added to the training dataset. For a given SGTM run (orange) we then take its forget loss and find the number of forget tokens that would achieve the same loss when added to the data mix in standard training (965k). The leakage is then computed by normalizing this number by the total number of (unlabeled) forget tokens in SGTM run. (b) Leakage decreases with model scale. Values denote the ratio of leaked information (measured in forget token exposure) to total undiscovered forget tokens, ranging between 0 (no leakage) and 1 (all information leaked). Larger models consistently exhibit lower leakage rates, with the 64M model maintaining leakage below 0.02 for up to 40% undiscovered forget data.

epistemic context: local copy of the SGTM leakage section; this is the paper's operationalization of non-absorption as leakage.

Modern MoE mechanisms that look relevant

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Modelshttps://aclanthology.org/2024.acl-long.70/

  • page date / last updated: August 2024

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-K out of N experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 × expert parameters and computation.

epistemic context: ACL Anthology abstract page for the DeepSeekMoE paper.

Mixture-of-Experts with Expert Choice Routinghttps://arxiv.org/abs/2202.09368

  • page date / last updated: not stated

previous sparsely gated networks introduce additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness is still limited. Recent approaches explore alternative strategies for routing, but they focus on pre-training only and do not demonstrate performance gain on downstream tasks. Moreover, none of the previous methods consider allocating a variable number of experts to each token based on importance, which can be beneficial. We propose a very simple yet effective routing method we are calling expert choice. Unlike conventional MoE where tokens select one or two top-scoring experts, our method lets each expert pick the top-k tokens. Our method guarantees perfect load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance as demonstrated in our experiments.

epistemic context: arXiv abstract/introduction text for the Expert Choice routing paper.

BASE Layers: Simplifying Training of Large, Sparse Modelshttps://arxiv.org/abs/2103.16716

  • page date / last updated: not stated

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyper-parameters or auxiliary losses.

epistemic context: arXiv abstract for the BASE layers paper.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsityhttps://arxiv.org/abs/2101.03961

  • page date / last updated: not stated

A Differentiable Load Balancing Loss. To encourage a balanced load across experts we add an auxiliary loss. For each Switch layer, this auxiliary loss is added to the total model loss during training. Given N experts indexed by i = 1 to N and a batch B with T tokens, the auxiliary loss is computed as the scaled dot-product between vectors f and P.

epistemic context: arXiv mechanism section from the Switch Transformer paper.

Designing Effective Sparse Expert Modelshttps://arxiv.org/abs/2202.08906

  • page date / last updated: not stated
  1. A large-scale study of the quality-stability trade-offs of stability techniques. 2. An introduction of the router z-loss that resolves instability issues, while slightly improving model quality. 3. A fine-tuning analysis of sparse and dense models highlighting different hyperparameter sensitivity to the batch size and learning rate.

epistemic context: arXiv contribution list for the ST-MoE paper.

huggingface/transformers: switch_transformershttps://github.com/huggingface/transformers/blob/main/src/transformers/models/switch_transformers/modeling_switch_transformers.py

  • page date / last updated: 2026-05-12

def router_z_loss_func(router_logits: torch.Tensor) -> float: r""" Compute the router z-loss implemented in PyTorch.

** It encourages router logits to remain small in an effort to improve stability.**

Args:
    router_logits (`float`):

epistemic context: reference implementation comment in Hugging Face Transformers.

Transfer judgment

YES, if the additive forward path is preserved: shared-path + routed-path split

Why: Gradient routing says absorption "requires that all features are present at the time of the forward pass." DeepSeekMoE's "isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts" is the closest MoE analogue. A direct port is: keep a shared always-on path for common capability, and reserve one or more quarantine paths for specialized residuals, while still keeping those quarantined features available in the forward pass on related non-routed examples. Caveat: this is only compatible with absorption if the quarantine path is additive / in-graph, not hard-switched off.

MAYBE, if you have multiple quarantine experts: fine-grained quarantine segmentation

Why: DeepSeekMoE's "finely segmenting the experts into mN ones and activating mK from them" supports the narrower claim that finer partitioning can improve specialization and reduce redundancy. In gradient routing terms, this suggests splitting one quarantine LoRA/subspace into several smaller quarantine blocks so specialization pressure is not all forced into one block. Caveat: the quote supports specialization, not absorption directly, so this is a plausible transfer rather than a direct implication.

MAYBE: expert-choice or balanced-assignment routing, but only inside the quarantine bank

Why: Expert Choice "lets each expert pick the top-k tokens" and BASE uses a linear assignment with equal load. Those are strong ways to stop routing collapse. A plausible port is reverse routing where each quarantine subspace claims the gradients most aligned with it, or an OT-style assignment that spreads hack gradients across several quarantine slots. Caveat: if applied across retain vs quarantine globally, this can fight absorption, because absorption wants related unlabeled gradients to keep flowing into the quarantine rather than be rebalanced away from it.

MAYBE: load-balancing auxiliary loss

Why: Switch adds an auxiliary loss to encourage balanced expert use. This can help if your failure mode is that one quarantine expert hogs all traffic and the others stay dead. Caveat: for a 2-way retain/quarantine split, generic balancing is the wrong objective. Absorption is asymmetric: you usually want unlabeled hack gradients to over-index on quarantine, not to be evenly spread.

MAYBE: router z-loss / logit-scale control

Why: ST-MoE and the HF implementation both say z-loss is for router stability. If your gradient router uses logits, temperatures, or soft assignments, z-loss could reduce brittle overconfidence or early collapse and make specialization more stable. Caveat: this is a training-stability trick, not an absorption mechanism by itself.

NO for absorption: hard forward sequestering of experts

Why: Gradient Routing explicitly says DEMix-style separation "does not allow for absorption ... which requires that all features are present at the time of the forward pass." So classic hard MoE expert isolation is the wrong transplant if the goal is stronger absorption. It may increase specialization while decreasing the very cross-example reuse that absorption needs.

Best current take

  • Most promising direct transplant: DeepSeekMoE's shared-expert isolation idea, but applied as shared always-on pathway plus routed quarantine pathways.
  • Most promising if you want several hack submodes: fine-grained quarantine experts, possibly with expert-choice or assignment only within the quarantine bank, but this is still a specialization-to-absorption extrapolation.
  • Useful support term, not main idea: z-loss on routing logits.
  • Probably wrong if used naively: global load balancing between retain and quarantine, or any hard forward MoE switch that removes the quarantine path from normal examples.

What I would actually try next

  1. Keep the current additive forward path.
  2. Split the routed quarantine block into K small quarantine experts.
  3. Add one shared always-on expert/path for common gradients.
  4. Route by backward alignment, not hard forward dispatch.
  5. If the K quarantine experts collapse, add either expert-choice-within-quarantine or a weak balancing penalty only over the K quarantine experts.
  6. If training is unstable, add a small z-loss on routing logits.

Epistemic summary

  • Who says X: Gradient Routing and SGTM define absorption as localization from narrow labels to a broader superset, and explain that it depends on features remaining available in the forward pass. MoE papers describe mechanisms for expert specialization, balancing, and routing stability.
  • How they could know: Gradient Routing and SGTM have direct experiments on absorption/leakage. The MoE papers report their own architecture/routing mechanisms and training behavior.
  • Entanglement check: the gradient-routing claims come from two closely related papers. The MoE claims are spread across several mostly independent lines: Switch/ST-MoE, Expert Choice, BASE, DeepSeekMoE.
  • Hard-to-vary check: the strongest negative constraint is hard to vary. If a mechanism removes routed features from the forward pass, it conflicts with the explicit Gradient Routing absorption story. That makes naive hard-MoE transfer weak.
  • What would change my mind: evidence that hard expert isolation still improves absorption in a setting where unlabeled related examples must reuse quarantined features, or evidence that balancing losses between retain/quarantine improve leakage rather than merely equalize traffic.
  • Calibrated take: p ≈ 0.65-0.8 that some MoE tricks transfer, but mainly the shared-vs-specialized and within-quarantine assignment ideas. p ≈ 0.1-0.2 that naive hard MoE routing improves absorption; the local Gradient Routing paper argues against it pretty directly.