From c4ac632b376f83e1de125c570888070df4013ad7 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Sun, 14 Jun 2026 09:28:03 +0800 Subject: [PATCH] docs: add Evil MoE experiment proposal --- docs/spec/20260614_evil_moe_lora_proposal.md | 315 +++++++++++++++++++ docs/spec/20260614_evil_moe_lora_review.md | 10 + docs/spec/20260614_evil_moe_lora_review2.md | 40 +++ docs/spec/20260614_moe_absorption_results.md | 138 ++++++++ docs/spec/20260614_moe_absorption_review.md | 15 + docs/spec/20260614_moe_absorption_search.md | 50 +++ 6 files changed, 568 insertions(+) create mode 100644 docs/spec/20260614_evil_moe_lora_proposal.md create mode 100644 docs/spec/20260614_evil_moe_lora_review.md create mode 100644 docs/spec/20260614_evil_moe_lora_review2.md create mode 100644 docs/spec/20260614_moe_absorption_results.md create mode 100644 docs/spec/20260614_moe_absorption_review.md create mode 100644 docs/spec/20260614_moe_absorption_search.md diff --git a/docs/spec/20260614_evil_moe_lora_proposal.md b/docs/spec/20260614_evil_moe_lora_proposal.md new file mode 100644 index 0000000..e502c62 --- /dev/null +++ b/docs/spec/20260614_evil_moe_lora_proposal.md @@ -0,0 +1,315 @@ +# Evil Expert: Putting the E in MoE by routing unwanted behaviour away + +Status: proposal, not implemented. + +## BLUF + +"Evil Expert" / "Evil MoE" is a separate experiment: an MoE where one or more experts are trained to carry reward-hack-associated behaviour, then ablated at eval/deploy. LoRA is the cheap experimental substrate, not the core claim. + +The hypothesis is independent of the Gradient Routing absorption hypothesis: MoE-style routing can localize unwanted behaviour in an ablatable expert. Gradient Routing and SGTM are useful background because they discuss localization, absorption, leakage, and ablation, but Evil MoE should be evaluated on its own causal ablation test. + +The training constraint is the same as the rest of this repo. This is a project constraint, not a literature claim: + +## *AGENTS.md* - project constraint for vGROUT +- epistemic context: standing repo instruction from the user, included to make the experimental boundary explicit. + +> The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval. +> +> OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by its cosine alignment to `vec`. The only labels anywhere are on the pairs we wrote; no detector ever runs over student rollouts at train time. Generalization is tested by whether `vec` (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. + +## Goal + +Train a model with several experts. For the first experiment, implement the experts as LoRA adapters because this is cheap and close to the current code: + +```text +base model + shared LoRA + retain-associated expert(s) + hack-associated expert(s) +``` + +Use a small amount of synthetic hack-vs-clean router supervision to initialize the hack-associated expert. Then let a sparse MoE router specialize under GRPO, so reward-hacky behavior preferentially uses that expert. LoRA is only there to make this test feasible without training full expert MLPs. At eval, compare: + +```text +all experts enabled: base + shared + retain-associated + hack-associated experts +hack-associated expert disabled: base + shared + retain-associated experts +``` + +Success means hack rate drops when hack-associated experts are ablated, while solve rate / normal capability mostly survives. + +## Relation to Gradient Routing and SGTM + +Evil MoE is not an absorption booster proposal. It is a separate localization experiment. Gradient Routing and SGTM still matter because they give useful concepts and failure modes: localized parameters, ablation, leakage, and the distinction between localizing learning and localizing computation. + +Gradient Routing's absorption condition is stricter than the Evil MoE hypothesis: + +## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md) +- epistemic context: local paper note, author's mechanism claim for absorption. + +> Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. **To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model’s predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere.** Absorption may also amplify the features causing it. + +And the same paper says hard forward expert separation breaks that condition: + +## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md) +- epistemic context: local paper note, appendix comparison with DEMix. + +> Gradient routing decouples the localization of learning from the localization of computation. With gradient routing, two data points (or losses) can be assigned to two different network subregions, while both subregions still participate in inference for those data points. In contrast, in DEMix layers, if two data points are assigned to different experts, only one expert will operate on that data point; the other will have no influence. **This is a critical difference because separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption (see section 5), which requires that all features are present at the time of the forward pass.** + +So if the goal is *SGTM/Gradient-Routing absorption*, hard MoE dispatch is suspect. Evil MoE has a different goal: learned localization of reward-hack behavior in an ablatable module. For that goal, hard or sparse MoE becomes plausible again. + +## SGTM as motivation, not the same claim + +SGTM gives a seed-and-self-reinforce story: + +## *Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs* - [paper_sgtm.md](../papers/grad_routing/paper_sgtm.md) +- epistemic context: local paper note, gradient-norm analysis. + +> To understand the mechanism underlying SGTM’s robustness to label noise, we hypothesize that the model develops self-reinforcing knowledge localization. Once the model begins localizing forget knowledge based on labeled examples (where we explicitly mask gradients), we expect that unlabeled forget samples (D_forget ∩ D_unlabeled) would naturally gravitate toward using forget parameters, thereby sending stronger gradient signals to those parameters even without explicit masking. To test this hypothesis, we analyze gradient norms from a SGTM model trained on the bilingual TinyStories dataset under perfect labeling conditions. **The top row demonstrates clear specialization: forget data primarily updates forget parameters (left), while retain data primarily updates retain parameters (right). The bottom-left panel shows that forget weights receive substantially stronger updates from unlabeled forget data compared to unlabeled retain data, confirming the self-reinforcing localization hypothesis.** + +The Evil MoE version has an analogous hypothesized shape: + +```text +synthetic hack pairs supervise the hack-associated expert +hack-associated expert becomes useful for hack-like computations +router sends similar inputs / tokens / rollouts to that expert +hack-associated expert receives more gradient on hack-like behavior +hack behavior becomes ablatable +``` + +But unlike SGTM, Evil MoE may use a learned forward gate, not only a backward gradient mask. That makes it a different experiment, with a different success criterion. + +## MoE literature connection + +### Ordinary MoE routing is usually not semantically labelled + +Mainstream MoE usually does not label experts as "math", "code", or "reward-hacking". A router maps token states to expert scores; top-k experts run; the language-model loss trains the selected experts and selected router weights. Aux losses or assignment rules stop collapse. + +This matters because the proposed method does not run a detector over student rollouts during training. The only semantic supervision is the initial synthetic pair supervision. + +### Expert specialization and shared experts + +## *DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models* - [ACL Anthology](https://aclanthology.org/2024.acl-long.70/) +- epistemic context: paper abstract; supports specialization/shared-expert architecture, not absorption directly. + +> In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: **(1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts.** + +Transfer: use one shared always-on LoRA path for common problem-solving, plus small routed experts for behavior-specific residuals. This is architectural separation, not proof of hack absorption. + +### Expert Choice / BASE: assignment can replace aux loss + +## *Mixture-of-Experts with Expert Choice Routing* - [arXiv:2202.09368](https://arxiv.org/abs/2202.09368) +- epistemic context: paper abstract/introduction; supports balanced assignment and variable experts per token. + +> We propose a very simple yet effective routing method we are calling expert choice. Unlike conventional MoE where tokens select one or two top-scoring experts, our method lets each expert pick the top-k tokens. **Our method guarantees perfect load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance as demonstrated in our experiments.** + +## *BASE Layers: Simplifying Training of Large, Sparse Models* - [arXiv:2103.16716](https://arxiv.org/abs/2103.16716) +- epistemic context: paper abstract; supports balanced expert assignment. + +> **In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.** This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyper-parameters or auxiliary losses. + +Transfer: if the hack expert dies or one expert eats all traffic, add expert-choice / assignment *inside the expert bank*. Do not globally balance hack-vs-clean if the intended asymmetry is that hack-like examples should overuse the hack expert. + +### Switch / ST-MoE: aux balancing and router stability + +## *Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity* - [arXiv:2101.03961](https://arxiv.org/abs/2101.03961) +- epistemic context: mechanism section; supports load balancing. + +> A Differentiable Load Balancing Loss. To encourage a balanced load across experts we add an auxiliary loss. **For each Switch layer, this auxiliary loss is added to the total model loss during training.** + +## *Designing Effective Sparse Expert Models* - [arXiv:2202.08906](https://arxiv.org/abs/2202.08906) +- epistemic context: contribution list; supports router z-loss as a stability trick. + +> 1. A large-scale study of the quality-stability trade-offs of stability techniques. **2. An introduction of the router z-loss that resolves instability issues, while slightly improving model quality.** + +Transfer: useful as training scaffolding if the router collapses or logits saturate. Not the main mechanism. + +### LoRA is the implementation substrate + +## Arrow LoRA merge note - [local PEFT ref](../vendor/lora-lite/docs/refs/peft_lora_variants.py) +- epistemic context: checked-in vendor/reference notes for an MoE-style LoRA variant. + +> The adapter_name is "arrow_router" by default, set in create_arrow_model() in ./arrow.py. Since Arrow is a Mixture-of-Experts (MoE) approach, merging adapters is not meaningful or even possible: for each token, **the top-k LoRA experts are dynamically selected and routed.** Because of this per-token routing, there is no single set of weights that can represent a merged adapter. + +Transfer: LoRA-MoE is a practical way to test the idea cheaply. For this repo, the existing `src/vgrout/antipasto.py` already has additive kept/quarantine LoRA-ish paths (`_lora_A`, `_lora_A_hack`, frozen `B`), so the natural extension is multiple `A_hack[k]` plus a router. If LoRA capacity is too small, that is an implementation failure, not a disproof of the Evil Expert hypothesis. + +## Proposed mechanism + +### Version A: hard sparse forward MoE, simplest Evil MoE expert + +Use if the goal is ablatable behavioral modularity, not strict absorption. + +```py +# θ frozen base model +# φ_s shared LoRA, always active +# φ_e[k] expert LoRAs, k ∈ {clean_0, ..., hack_0, ...} +# ρ router, maps token/rollout features to expert logits + +# ── Forward ──────────────────────── +def layer(x, θ, φ_s, φ_e, ρ): # x ∈ ℝ^{b×s×d} + y_base = θ.W @ x + y_shared = φ_s(x) + + z = ρ(x) # z ∈ ℝ^{b×s×K} + π = softmax(z / τ) + S = top_k(π, k=1 or 2) # sparse dispatch + + y_exp = sum(π[k] * φ_e[k](x) for k in S) + return y_base + y_shared + y_exp +``` + +Training: + +```py +for grpo_batch in grpo_rollouts: + y = model(grpo_batch) + ℒ_grpo = grpo_loss(y) + + # separate synthetic pin batch, not labels attached to live GRPO rollouts + π_hack = router(synthetic_hack_pairs) + π_clean = router(synthetic_clean_pairs) + ℒ_pin = -log π_hack[hack_expert] - log π_clean[clean_expert] + + ℒ_sparse = λ_H * entropy(π) # encourage sparse expert use + ℒ_bal = λ_bal * balance(π) # optional, weak, inside expert bank + ℒ_z = λ_z * mean(logsumexp(z)^2) # optional router stability + + ℒ = ℒ_grpo + ℒ_pin + ℒ_sparse + ℒ_bal + ℒ_z + θ frozen; update φ_s, φ_e, ρ +``` + +Ablation: + +```py +hack_rate_on = eval(model, experts=all) +hack_rate_off = eval(model, experts=all_except_hack) +solve_drop = solve_on - solve_off +``` + +This is the closest literal Evil MoE setup. + +### Version B: soft/additive Evil MoE expert + +Use if we want a version that keeps more experts present in the forward graph. This is closer to the Gradient Routing absorption condition, but the experiment is still Evil MoE, not an absorption test. + +```py +def layer(x, θ, φ_s, φ_e, ρ): + y_base = θ.W @ x + y_shared = φ_s(x) + + z = ρ(x) + π = entmax(z / τ) # sparse but can keep multiple nonzero paths + + # all experts are in-graph; no DEMix-style hard absence + y_exp = sum(π[k] * φ_e[k](x) for k in range(K)) + return y_base + y_shared + y_exp +``` + +Training is the same, but use a higher initial temperature / less sparse gate, then anneal. This is less compute-efficient but more compatible with absorption, because hack experts can remain present for related non-pinned examples. + +### Version C: backward-routed evil expert, closest to current vGROUT + +Use if we want minimal changes to the current AntiPaSTO/LoRA routeV setup. + +```py +# Existing LoRA-frozen-B path: +# y = y_base + B @ (A_shared @ x + A_hack @ x) +# Extend A_hack to K hack experts: A_hack[k] + +for rollout in batch: + g = per_rollout_grad(A_shared) # current grad_probe-style estimate + s[k] = cos(g, v_hack[k]) # or router score trained only from synthetic pairs/vectors + + k_star = argmax(s) + if s[k_star] > τ: + A_hack[k_star].grad += project_hack(g, v_hack[k_star]) + A_shared.grad -= project_hack(g, v_hack[k_star]) +``` + +Forward can stay additive: + +```py +y = y_base + B @ (A_shared @ x + sum_k A_hack[k] @ x) +``` + +This is least MoE-like in forward compute, but probably most consistent with the Gradient Routing absorption story. + +## Recommended first experiment + +Implementation choice: start with Version A only. It gives the clearest falsifiable Evil MoE result. Keep Version B and Version C as follow-ups, not part of the first implementation. + +1. Base + frozen-B LoRA experts. +2. Experts: `shared`, `clean`, `hack`. +3. Router input: last hidden state or per-token hidden state at target layers. +4. Supervise the router only on hand-authored synthetic hack-vs-clean pairs. +5. GRPO train on normal rollouts without live hack labels. +6. Eval with hack expert on/off using the ORACLE only for the final deploy eval. + +Training/eval boundary: + +```text +Allowed: hand-built synthetic contrastive pairs -> supervise router / seed hack expert. +Allowed: extracted hack direction vec from synthetic pairs -> initialize or regularize hack expert. +Allowed: vec -> routing, where live GRPO gradients are routed by cosine alignment to vec. +Forbidden: ORACLE or detector labels on student rollouts at TRAIN time. +Forbidden: using the final eval grader to gate routing, set thresholds, or label student rollouts. +``` + +Generalization is tested by whether a vec built from synthetic pairs covering some hack modes suppresses held-out modes absent from those pairs. That is vector generalization, not detector-label generalization. + +## UAT + +A run supports the Evil MoE idea if the report table shows: + +| check | expected if working | wrong-case | +|---|---|---| +| hack ablation | hack rate lower with hack expert off | no hack drop, or hack drop only from total capability collapse | +| matched ablation | hack-expert-off reduces hacks more specifically than random/clean-expert-off at matched capacity | any expert ablation gives the same effect | +| capability retention | solve rate / reward mostly preserved with hack expert off | ablation destroys normal LeetCode ability | +| routing selectivity | synthetic hack pairs route more to hack expert than clean pairs | router learns style/length/domain artifacts | +| held-out hack modes | held-out hack modes also route to / depend on hack expert | only pinned hack template is isolated | +| train/eval boundary audit | no ORACLE or detector labels touch TRAIN-time routing | live student-rollout labels leak into router | + +Minimum evidence file should include: + +- config / command +- router usage table for synthetic clean, synthetic hack, live GRPO, held-out hack eval +- hack-rate and solve-rate table with hack expert on/off +- examples of prompts/completions for first train/eval batch, so formatting artifacts are visible + +## Main failure modes + +1. The hack expert becomes a general coding expert, so ablating it reduces hacks by making the model worse. +2. The router learns superficial artifacts in the synthetic pairs: style, length, refusal wording, problem family. +3. GRPO reward pressure relearns hack behavior in clean/shared experts because hacks are useful. +4. Hard forward routing blocks absorption-like generalization to related unpinned examples. +5. Load balancing fights the desired asymmetry by forcing hack-like traffic away from the hack expert. + +## Decision + +The Evil MoE idea is worth testing, with the claim stated at the level the evidence supports: + +- It is a separate experiment from SGTM absorption. +- It is an ablatable-modularity hypothesis: weak synthetic router supervision plus MoE specialization might put reward-hack behavior in a removable expert. LoRA is the first implementation substrate. +- The primary proof is not lower training loss. The proof is causal: turn off the evil expert and held-out hack rate drops while normal solve behavior remains. + +## Links + +Local: + +- [MoE absorption search note](20260614_moe_absorption_results.md) +- [Fresh-eyes review of that note](20260614_moe_absorption_review.md) +- [Gradient Routing local paper note](../papers/grad_routing/paper_gradient_routing.md) +- [SGTM local paper note](../papers/grad_routing/paper_sgtm.md) +- [Routing v2 distinct-basis spec](20260531_routing_v2_distinct_basis.md) +- [Current AntiPaSTO/LoRA hook implementation](../../src/vgrout/antipasto.py) +- [Local MoE hits](20260614_local_search_moe_hits.md) +- [arXiv MoE hits](20260614_arxiv_moe_hits.md) +- [GitHub MoE hits](20260614_gh_moe_hits.md) +- [Semantic MoE hits](20260614_semantic_moe_hits.md) + +External: + +- [Gradient Routing arXiv:2410.04332](https://arxiv.org/abs/2410.04332) +- [DeepSeekMoE ACL Anthology](https://aclanthology.org/2024.acl-long.70/) +- [Expert Choice Routing arXiv:2202.09368](https://arxiv.org/abs/2202.09368) +- [BASE Layers arXiv:2103.16716](https://arxiv.org/abs/2103.16716) +- [Switch Transformers arXiv:2101.03961](https://arxiv.org/abs/2101.03961) +- [ST-MoE arXiv:2202.08906](https://arxiv.org/abs/2202.08906) +- [Hugging Face Switch Transformer implementation](https://github.com/huggingface/transformers/blob/main/src/transformers/models/switch_transformers/modeling_switch_transformers.py) diff --git a/docs/spec/20260614_evil_moe_lora_review.md b/docs/spec/20260614_evil_moe_lora_review.md new file mode 100644 index 0000000..6b586bc --- /dev/null +++ b/docs/spec/20260614_evil_moe_lora_review.md @@ -0,0 +1,10 @@ +## Review + +- Correct: The proposal mostly distinguishes SGTM/Gradient Routing absorption from the evil-MoE modularization hypothesis. Evidence: BLUF says this is "not \"increase gradient-routing absorption\" directly" and is "closer to learned behavioral modularization" (`docs/spec/20260614_evil_moe_lora_proposal.md:7-9`); the absorption section explicitly says hard MoE dispatch is suspect for SGTM absorption but plausible for ablatable modularity (`:39-46`); the Decision repeats that it is "not a direct continuation of SGTM absorption" (`:273-277`). +- Correct: The no-cheat constraint is stated clearly in several places: labels only from hand-authored synthetic pairs/vectors (`:9`), no live detector required (`:75`), pinning only from synthetic pairs (`:155-157`, `:232-241`), and UAT includes a no-cheat audit (`:254`). This is aligned with the repo constraint. +- Correct: The MoE evidence is mostly framed conservatively. DeepSeekMoE is described as supporting shared/specialized experts "not absorption directly" (`:79-84`), Switch/ST-MoE as scaffolding "not the main mechanism" (`:100-112`), and Arrow LoRA only as technical plausibility (`:114-121`). I did not find a direct claim that MoE literature proves absorption. +- Note: The phrase "SGTM gives the seed-and-self-reinforce story" plus "The evil-MoE version keeps the same shape" (`:50-67`) is plausible but close to overclaiming. The later caveat at `:67` helps. Safer wording would mark this as an analogy/hypothesis, not evidence that learned MoE routing has SGTM-style absorption. +- Note: Version B overclaims slightly: "Use if we want to preserve the Gradient Routing absorption condition" (`:177-194`). Entmax can still zero experts, and annealing toward sparsity can reintroduce hard absence. "More compatible with the absorption condition" is justified; "preserve" is stronger than the pseudocode guarantees. +- Note: Version C has a no-cheat ambiguity in "or learned router score" (`:205-208`). It is no-cheat only if the learned score is trained from synthetic pairs/vectors, not live oracle/detector labels. The surrounding no-cheat section probably implies this, but implementation guidance should say it locally. +- Note: The Version A training pseudocode is conceptually plausible, but `for batch in grpo_rollouts` then `L_pin = ... synthetic_hack/synthetic_clean` (`:150-164`) is underspecified. It should make clear that synthetic pin batches get a separate router forward pass and are not labels attached to live GRPO rollouts. +- Note: The UAT is directionally useful (`:244-261`), but "solve rate / normal capability mostly survives" (`:26`, `:251`) has no threshold or matched-ablation control. A clean-expert-off or random-expert-off comparison would help distinguish "hack expert is causally specific" from "ablating any capacity changes behavior." Not a blocker for a proposal, but it matters before implementation. diff --git a/docs/spec/20260614_evil_moe_lora_review2.md b/docs/spec/20260614_evil_moe_lora_review2.md new file mode 100644 index 0000000..657b982 --- /dev/null +++ b/docs/spec/20260614_evil_moe_lora_review2.md @@ -0,0 +1,40 @@ +## Verdict + +Yes, the evil-MoE LoRA plan makes conceptual and experimental sense for vGROUT as a distinct ablatable-modularity experiment. It should not be sold as direct evidence for stronger SGTM/Gradient-Routing absorption. The proposal mostly handles this distinction correctly. + +## Makes sense because + +- The core mechanism is coherent: seed a hack expert using only hand-authored synthetic pairs/vectors, let sparse MoE routing specialize during GRPO, then causally test by ablating the hack expert. +- It fits the existing LoRA/AntiPaSTO direction: multiple trainable low-rank paths plus an ablation knob are natural extensions of the current kept/hack adapter structure. +- The no-cheat line is stated clearly: no live oracle/detector labels in training routing; final oracle only for eval. +- The proposal correctly notes that MoE evidence supports specialization, balancing, and stability, not absorption directly. +- The UAT is pointed at the right causal claim: hack-expert-off should reduce held-out hack rate more specifically than matched clean/random expert ablation, without destroying solve rate. + +## Main risks + +- The hack expert becomes a general coding/LeetCode expert, so ablation lowers hacks only by damaging capability. +- The router keys off synthetic-pair artifacts rather than hack mechanism: style, length, prompt template, problem family. +- GRPO reward pressure relearns hack behavior in shared/clean experts if hacks improve reward. +- Hard top-k forward routing may undermine SGTM-style absorption because unselected experts are absent from the forward pass. +- Load balancing across clean vs hack could fight the desired asymmetry. If used, balancing should be weak or limited to preventing dead experts. + +## Required edits before implementation + +- Keep the framing strict: call this learned MoE modularization / evil-expert ablation, not a proven absorption booster. +- Wherever the text says a soft/additive version preserves the absorption condition, soften to "more compatible with absorption". Entmax/top-k can still zero paths. +- Specify that any learned router score is trained only from synthetic pairs/vectors or unsupervised LM/GRPO gradients, never live hack labels. +- Define the first implementation scope: Version A hard sparse forward MoE vs Version B soft/additive vs Version C backward-routed. Do not implement all three. +- Add matched-capacity controls before real runs: hack-expert-off, clean-expert-off, random-expert-off, and all-experts-on. + +## Suggested first experiment + +Start with the simplest falsifiable evil-expert test, not the absorption-compatible variant: + +1. Frozen base model plus LoRA experts: `shared`, `clean`, `hack`. +2. Router over expert LoRAs at selected layers, top-1 or top-2. +3. Pin router/expert using only hand-authored synthetic hack-vs-clean pairs or vectors. +4. GRPO train on normal rollouts with no live detector/oracle labels touching routing. +5. Eval with final oracle only, comparing all-experts-on vs hack-off vs clean-off vs random-off. +6. Report solve rate, hack rate, reward, router usage on synthetic clean/hack, live GRPO, and held-out hack modes. + +Implementation should proceed only if the proposal is treated as an ablatable behavior-localization experiment. Any phrase implying that MoE specialization evidence is absorption evidence is overclaiming. diff --git a/docs/spec/20260614_moe_absorption_results.md b/docs/spec/20260614_moe_absorption_results.md new file mode 100644 index 0000000..96a57be --- /dev/null +++ b/docs/spec/20260614_moe_absorption_results.md @@ -0,0 +1,138 @@ +# MoE sparsity ideas for increasing gradient-routing absorption + +Verify: "Modern MoE training uses routing/separation mechanisms that may transfer to gradient routing absorption." + +## What SGTM / Gradient Routing mean by absorption + +## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* — [docs/papers/grad_routing/paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md) +- page date / last updated: not stated + +> We apply gradient routing to the problem of scalable oversight (Amodei et al., 2016), where the aim is to train a performant policy despite limited access to reliable labels. We train a policy network by reinforcement learning to navigate to two kinds of grid squares in a toy environment, Diamond and Ghost. Using gradient routing, we localize modules responsible for these two behaviors. We show that we can steer the policy towards Diamond by ablating the Ghost module. Gradient routing trains steerable networks even when the amount of labeled training data is small (1%), and even when the policy is able to condition on the existence of labels. As a result, our method outperforms baselines based on behavioral supervision alone. **Throughout, we find evidence of an absorption effect, where gradient routing applied to narrow data localizes capabilities relevant to a broader superset of data. Absorption answers the question “if one has labels that are suitable for localizing undesirable computation, why not use those labels to filter the data?” When labels do not encompass all training data from which harmful capabilities arise (Zhu et al., 2009), filtering may be inadequate (Welbl et al., 2021), whereas absorption means that localization can still occur.** Furthermore, localization influences model internals without modifying the loss function. This can enable scalable oversight when perfect supervision is not feasible. + +epistemic context: local copy of the Gradient Routing paper; this is the paper's own high-level statement of absorption. + +## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* — [docs/papers/grad_routing/paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md) +- page date / last updated: not stated + +> Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. **To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model’s predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere.** Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate. + +epistemic context: local copy of the Discussion section; this is the most explicit mechanism sketch for absorption. + +## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* — [docs/papers/grad_routing/paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md) +- page date / last updated: not stated + +> Gradient routing decouples the localization of learning from the localization of computation. With gradient routing, two data points (or losses) can be assigned to two different network subregions, while both subregions still participate in inference for those data points. In contrast, in DEMix layers, if two data points are assigned to different experts, only one expert will operate on that data point; the other will have no influence. **This is a critical difference because separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption (see section 5), which requires that all features are present at the time of the forward pass.** + +epistemic context: local copy of the appendix comparison against DEMix; this is the strongest quote on what breaks absorption. + +## *Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs* — [docs/papers/grad_routing/paper_sgtm.md](../papers/grad_routing/paper_sgtm.md) +- page date / last updated: not stated + +> To understand the mechanism underlying SGTM’s robustness to label noise, we hypothesize that the model develops self-reinforcing knowledge localization. Once the model begins localizing forget knowledge based on labeled examples (where we explicitly mask gradients), we expect that unlabeled forget samples (D_forget ∩ D_unlabeled) would naturally gravitate toward using forget parameters, thereby sending stronger gradient signals to those parameters even without explicit masking. To test this hypothesis, we analyze gradient norms from a SGTM model trained on the bilingual TinyStories dataset under perfect labeling conditions. **The top row demonstrates clear specialization: forget data primarily updates forget parameters (left), while retain data primarily updates retain parameters (right). The bottom-left panel shows that forget weights receive substantially stronger updates from unlabeled forget data compared to unlabeled retain data, confirming the self-reinforcing localization hypothesis.** + +epistemic context: local copy of the SGTM gradient-norm analysis; this is direct empirical support for self-reinforcing absorption. + +## *Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs* — [docs/papers/grad_routing/paper_sgtm.md](../papers/grad_routing/paper_sgtm.md) +- page date / last updated: not stated + +> Figure 5: (a) Leakage is quantified via equivalent standard training comparison with variable number of forget tokens added to the data mix. The baseline curve (blue) maps the relationship between forget token exposure and forget loss established by training models on all retain data with increasing amounts of forget tokens added. Each blue point represents a model trained with standard training procedure with a given number of forget tokens added to the training dataset. For a given SGTM run (orange) we then take its forget loss and find the number of forget tokens that would achieve the same loss when added to the data mix in standard training (965k). The leakage is then computed by normalizing this number by the total number of (unlabeled) forget tokens in SGTM run. **(b) Leakage decreases with model scale. Values denote the ratio of leaked information (measured in forget token exposure) to total undiscovered forget tokens, ranging between 0 (no leakage) and 1 (all information leaked). Larger models consistently exhibit lower leakage rates, with the 64M model maintaining leakage below 0.02 for up to 40% undiscovered forget data.** + +epistemic context: local copy of the SGTM leakage section; this is the paper's operationalization of non-absorption as leakage. + +## Modern MoE mechanisms that look relevant + +## *DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models* — [https://aclanthology.org/2024.acl-long.70/](https://aclanthology.org/2024.acl-long.70/) +- page date / last updated: August 2024 + +> In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-K out of N experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. **In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts.** Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 × expert parameters and computation. + +epistemic context: ACL Anthology abstract page for the DeepSeekMoE paper. + +## *Mixture-of-Experts with Expert Choice Routing* — [https://arxiv.org/abs/2202.09368](https://arxiv.org/abs/2202.09368) +- page date / last updated: not stated + +> previous sparsely gated networks introduce additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness is still limited. Recent approaches explore alternative strategies for routing, but they focus on pre-training only and do not demonstrate performance gain on downstream tasks. Moreover, none of the previous methods consider allocating a variable number of experts to each token based on importance, which can be beneficial. We propose a very simple yet effective routing method we are calling expert choice. Unlike conventional MoE where tokens select one or two top-scoring experts, our method lets each expert pick the top-k tokens. **Our method guarantees perfect load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance as demonstrated in our experiments.** + +epistemic context: arXiv abstract/introduction text for the Expert Choice routing paper. + +## *BASE Layers: Simplifying Training of Large, Sparse Models* — [https://arxiv.org/abs/2103.16716](https://arxiv.org/abs/2103.16716) +- page date / last updated: not stated + +> We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. **In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.** This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyper-parameters or auxiliary losses. + +epistemic context: arXiv abstract for the BASE layers paper. + +## *Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity* — [https://arxiv.org/abs/2101.03961](https://arxiv.org/abs/2101.03961) +- page date / last updated: not stated + +> A Differentiable Load Balancing Loss. To encourage a balanced load across experts we add an auxiliary loss. **For each Switch layer, this auxiliary loss is added to the total model loss during training.** Given N experts indexed by i = 1 to N and a batch B with T tokens, the auxiliary loss is computed as the scaled dot-product between vectors f and P. + +epistemic context: arXiv mechanism section from the Switch Transformer paper. + +## *Designing Effective Sparse Expert Models* — [https://arxiv.org/abs/2202.08906](https://arxiv.org/abs/2202.08906) +- page date / last updated: not stated + +> 1. A large-scale study of the quality-stability trade-offs of stability techniques. **2. An introduction of the router z-loss that resolves instability issues, while slightly improving model quality.** 3. A fine-tuning analysis of sparse and dense models highlighting different hyperparameter sensitivity to the batch size and learning rate. + +epistemic context: arXiv contribution list for the ST-MoE paper. + +## *huggingface/transformers: switch_transformers* — [https://github.com/huggingface/transformers/blob/main/src/transformers/models/switch_transformers/modeling_switch_transformers.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/switch_transformers/modeling_switch_transformers.py) +- page date / last updated: 2026-05-12 + +> def router_z_loss_func(router_logits: torch.Tensor) -> float: +> r""" +> Compute the router z-loss implemented in PyTorch. +> +> ** It encourages router logits to remain small in an effort to improve stability.** +> +> Args: +> router_logits (`float`): + +epistemic context: reference implementation comment in Hugging Face Transformers. + +## Transfer judgment + +### YES, if the additive forward path is preserved: shared-path + routed-path split +Why: Gradient routing says absorption "requires that all features are present at the time of the forward pass." DeepSeekMoE's "isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts" is the closest MoE analogue. A direct port is: keep a shared always-on path for common capability, and reserve one or more quarantine paths for specialized residuals, while still keeping those quarantined features available in the forward pass on related non-routed examples. +Caveat: this is only compatible with absorption if the quarantine path is additive / in-graph, not hard-switched off. + +### MAYBE, if you have multiple quarantine experts: fine-grained quarantine segmentation +Why: DeepSeekMoE's "finely segmenting the experts into mN ones and activating mK from them" supports the narrower claim that finer partitioning can improve specialization and reduce redundancy. In gradient routing terms, this suggests splitting one quarantine LoRA/subspace into several smaller quarantine blocks so specialization pressure is not all forced into one block. +Caveat: the quote supports specialization, not absorption directly, so this is a plausible transfer rather than a direct implication. + +### MAYBE: expert-choice or balanced-assignment routing, but only inside the quarantine bank +Why: Expert Choice "lets each expert pick the top-k tokens" and BASE uses a linear assignment with equal load. Those are strong ways to stop routing collapse. A plausible port is reverse routing where each quarantine subspace claims the gradients most aligned with it, or an OT-style assignment that spreads hack gradients across several quarantine slots. +Caveat: if applied across retain vs quarantine globally, this can fight absorption, because absorption wants related unlabeled gradients to keep flowing into the quarantine rather than be rebalanced away from it. + +### MAYBE: load-balancing auxiliary loss +Why: Switch adds an auxiliary loss to encourage balanced expert use. This can help if your failure mode is that one quarantine expert hogs all traffic and the others stay dead. +Caveat: for a 2-way retain/quarantine split, generic balancing is the wrong objective. Absorption is asymmetric: you usually want unlabeled hack gradients to over-index on quarantine, not to be evenly spread. + +### MAYBE: router z-loss / logit-scale control +Why: ST-MoE and the HF implementation both say z-loss is for router stability. If your gradient router uses logits, temperatures, or soft assignments, z-loss could reduce brittle overconfidence or early collapse and make specialization more stable. +Caveat: this is a training-stability trick, not an absorption mechanism by itself. + +### NO for absorption: hard forward sequestering of experts +Why: Gradient Routing explicitly says DEMix-style separation "does not allow for absorption ... which requires that all features are present at the time of the forward pass." So classic hard MoE expert isolation is the wrong transplant if the goal is stronger absorption. It may increase specialization while decreasing the very cross-example reuse that absorption needs. + +## Best current take +- Most promising direct transplant: DeepSeekMoE's shared-expert isolation idea, but applied as shared always-on pathway plus routed quarantine pathways. +- Most promising if you want several hack submodes: fine-grained quarantine experts, possibly with expert-choice or assignment only within the quarantine bank, but this is still a specialization-to-absorption extrapolation. +- Useful support term, not main idea: z-loss on routing logits. +- Probably wrong if used naively: global load balancing between retain and quarantine, or any hard forward MoE switch that removes the quarantine path from normal examples. + +## What I would actually try next +1. Keep the current additive forward path. +2. Split the routed quarantine block into K small quarantine experts. +3. Add one shared always-on expert/path for common gradients. +4. Route by backward alignment, not hard forward dispatch. +5. If the K quarantine experts collapse, add either expert-choice-within-quarantine or a weak balancing penalty only over the K quarantine experts. +6. If training is unstable, add a small z-loss on routing logits. + +## Epistemic summary +- Who says X: Gradient Routing and SGTM define absorption as localization from narrow labels to a broader superset, and explain that it depends on features remaining available in the forward pass. MoE papers describe mechanisms for expert specialization, balancing, and routing stability. +- How they could know: Gradient Routing and SGTM have direct experiments on absorption/leakage. The MoE papers report their own architecture/routing mechanisms and training behavior. +- Entanglement check: the gradient-routing claims come from two closely related papers. The MoE claims are spread across several mostly independent lines: Switch/ST-MoE, Expert Choice, BASE, DeepSeekMoE. +- Hard-to-vary check: the strongest negative constraint is hard to vary. If a mechanism removes routed features from the forward pass, it conflicts with the explicit Gradient Routing absorption story. That makes naive hard-MoE transfer weak. +- What would change my mind: evidence that hard expert isolation still improves absorption in a setting where unlabeled related examples must reuse quarantined features, or evidence that balancing losses between retain/quarantine improve leakage rather than merely equalize traffic. +- Calibrated take: p ≈ 0.65-0.8 that some MoE tricks transfer, but mainly the shared-vs-specialized and within-quarantine assignment ideas. p ≈ 0.1-0.2 that naive hard MoE routing improves absorption; the local Gradient Routing paper argues against it pretty directly. diff --git a/docs/spec/20260614_moe_absorption_review.md b/docs/spec/20260614_moe_absorption_review.md new file mode 100644 index 0000000..30e129f --- /dev/null +++ b/docs/spec/20260614_moe_absorption_review.md @@ -0,0 +1,15 @@ +## Verdict +Partly supported. The note does carry the load-bearing Gradient Routing constraint, but the strongest positive transfer claims go beyond what the quoted MoE evidence itself shows. Most MoE quotes here support specialization, balancing, or stability, not absorption. + +## Observations +- You did not miss the negative constraint. It is explicit in `### YES: shared-path + routed-path split`, `### NO for absorption: hard forward sequestering of experts`, and `## Epistemic summary` via `requires that all features are present at the time of the forward pass`. +- `### YES: shared-path + routed-path split` is only partly supported by the quoted evidence. Gradient Routing supports the forward-pass requirement, and DeepSeekMoE supports `shared ones, aiming at capturing common knowledge`. But `That should preserve the load-bearing forward-pass condition` is only true if the `routed quarantine paths` also remain available on non-routed examples. +- `### YES, if you have multiple quarantine experts: fine-grained quarantine segmentation` overreaches the quote. DeepSeekMoE supports `more flexible combination` and specialization, not `different hack modes can absorb into different blocks`. +- `### MAYBE: expert-choice or balanced-assignment routing, but only inside the quarantine bank` is supported only as anti-collapse / load-balancing. The quoted support is `perfect load balancing` and `equal number of tokens`, not absorption or transfer. +- `### MAYBE: load-balancing auxiliary loss` is also only supported as balancing. The quote only says `encourage a balanced load across experts`. +- `### MAYBE: router z-loss / logit-scale control` is correctly scoped. The quotes only support stability, and your caveat says `not an absorption mechanism by itself`. + +## Most likely overreach +- `### YES, if you have multiple quarantine experts: fine-grained quarantine segmentation` +- The phrase `so different hack modes can absorb into different blocks instead of interfering in one monolith` +- Secondarily, in `### YES: shared-path + routed-path split`, the phrase `That should preserve the load-bearing forward-pass condition` is too strong unless those quarantine features stay present in the forward pass for non-routed examples too. diff --git a/docs/spec/20260614_moe_absorption_search.md b/docs/spec/20260614_moe_absorption_search.md new file mode 100644 index 0000000..7fe166a --- /dev/null +++ b/docs/spec/20260614_moe_absorption_search.md @@ -0,0 +1,50 @@ +# MoE sparsity ideas for increasing gradient routing absorption + +## Goal +Understand absorption and leakage in Gradient Routing / SGTM from the local grad-routing papers, then search for modern MoE specialization and routing mechanisms that might transfer to gradient routing to increase absorption. + +## Scope +In: local paper reading, local-first literature/code search, quote-anchored evidence, transfer judgment. +Out: code changes, experiments, implementation. + +## Requirements +- R1: Capture how Gradient Routing / SGTM define or explain absorption, leakage, and specialization. Done means: verbatim quotes with context from local papers. VERIFY: note contains source-attributed quotes from `docs/papers/grad_routing/` on absorption/leakage. +- R2: Capture modern MoE techniques that encourage expert separation, sparse routing, or lower overlap. Done means: verbatim quotes with context from papers/code/docs. VERIFY: note contains source-attributed quotes describing the mechanism, not paraphrase. +- R3: Judge whether each MoE mechanism plausibly transfers to increase absorption in gradient routing. Done means: each candidate has yes/maybe/no plus mechanism-level reason tied back to R1/R2 quotes. VERIFY: every judgment cites both a gradient-routing quote and an MoE quote. + +## Tasks +- [x] T1 (R1): Read SGTM and Gradient Routing papers. + - verify: `rg -n "absorption|leakage|specialization|gradient norms|self-reinforcing" docs/papers/grad_routing/*.md` + - success: local quotes identify the claimed mechanism and limits. + - likely_fail: quote lacks left/right context or is not verbatim. + - sneaky_fail: we use quotes about unlearning/localization generally, not absorption specifically. + - UAT: "when I open the note, I can read the exact paper text on absorption/leakage" +- [x] T2 (R2): Fan out local-first search subagents for MoE separation/routing methods. + - verify: subagent outputs contain varglight-format quotes with source + epistemic note. + - success: hits mention concrete mechanism like aux loss, balancing, entropy, top-k, capacity, noise, or assignment. + - likely_fail: generic MoE summaries with no verbatim quotes. + - sneaky_fail: sources are all downstream summaries of one paper. + - UAT: "when I inspect the collected hits, each one is a copy-pasteable quote with source" +- [x] T3 (R3): Deduplicate and write a mapped judgment note. + - verify: note lists candidates with yes/maybe/no and cites quote blocks. + - success: transfer judgments are mechanism-level and concise. + - likely_fail: unsupported brainstorm list. + - sneaky_fail: we recommend methods that optimize a different failure mode than absorption. + - UAT: "when I read the final note, I can see which MoE tricks are worth trying and why" + +## Context +- User wants varglight format for every subagent hit. +- Local-first search priority: qmd, local-search, gh, lesswrong, arxiv, semantic-search, then web fallback if thin. +- Budget per subagent: about 6 tool calls, one round per tool, then return PARTIAL. + +## Log +- 2026-06-14: Loaded `varglight` skill. It requires verbatim quotes with surrounding context, source attribution, and one-line epistemic context; no paraphrase inside quote blocks. +- 2026-06-14: Parallel subagent fan-out returned useful arXiv, GitHub, local-search, LessWrong, and semantic-search hits. `qmd` timed out twice under the time budget, so local-first coverage is good but not exhaustive. +- 2026-06-14: Wrote consolidated note to `docs/spec/20260614_moe_absorption_results.md` and ran a fresh-eyes reviewer subagent. Review said the main overreach was claiming fine-grained segmentation helps absorption directly; toned this down to a `MAYBE` specialization transfer. + +## TODO +- If promising candidates emerge, design a follow-up experiment spec. + +## Errors +| Task | Error | Resolution | +|------|-------|------------|