MoE sparsity ideas for increasing gradient routing absorption

Goal

Understand absorption and leakage in Gradient Routing / SGTM from the local grad-routing papers, then search for modern MoE specialization and routing mechanisms that might transfer to gradient routing to increase absorption.

Scope

In: local paper reading, local-first literature/code search, quote-anchored evidence, transfer judgment. Out: code changes, experiments, implementation.

Requirements

R1: Capture how Gradient Routing / SGTM define or explain absorption, leakage, and specialization. Done means: verbatim quotes with context from local papers. VERIFY: note contains source-attributed quotes from docs/papers/grad_routing/ on absorption/leakage.
R2: Capture modern MoE techniques that encourage expert separation, sparse routing, or lower overlap. Done means: verbatim quotes with context from papers/code/docs. VERIFY: note contains source-attributed quotes describing the mechanism, not paraphrase.
R3: Judge whether each MoE mechanism plausibly transfers to increase absorption in gradient routing. Done means: each candidate has yes/maybe/no plus mechanism-level reason tied back to R1/R2 quotes. VERIFY: every judgment cites both a gradient-routing quote and an MoE quote.

Tasks

T1 (R1): Read SGTM and Gradient Routing papers.
- verify: rg -n "absorption|leakage|specialization|gradient norms|self-reinforcing" docs/papers/grad_routing/*.md
- success: local quotes identify the claimed mechanism and limits.
- likely_fail: quote lacks left/right context or is not verbatim.
- sneaky_fail: we use quotes about unlearning/localization generally, not absorption specifically.
- UAT: "when I open the note, I can read the exact paper text on absorption/leakage"
T2 (R2): Fan out local-first search subagents for MoE separation/routing methods.
- verify: subagent outputs contain varglight-format quotes with source + epistemic note.
- success: hits mention concrete mechanism like aux loss, balancing, entropy, top-k, capacity, noise, or assignment.
- likely_fail: generic MoE summaries with no verbatim quotes.
- sneaky_fail: sources are all downstream summaries of one paper.
- UAT: "when I inspect the collected hits, each one is a copy-pasteable quote with source"
T3 (R3): Deduplicate and write a mapped judgment note.
- verify: note lists candidates with yes/maybe/no and cites quote blocks.
- success: transfer judgments are mechanism-level and concise.
- likely_fail: unsupported brainstorm list.
- sneaky_fail: we recommend methods that optimize a different failure mode than absorption.
- UAT: "when I read the final note, I can see which MoE tricks are worth trying and why"

Context

User wants varglight format for every subagent hit.
Local-first search priority: qmd, local-search, gh, lesswrong, arxiv, semantic-search, then web fallback if thin.
Budget per subagent: about 6 tool calls, one round per tool, then return PARTIAL.

Log

2026-06-14: Loaded varglight skill. It requires verbatim quotes with surrounding context, source attribution, and one-line epistemic context; no paraphrase inside quote blocks.
2026-06-14: Parallel subagent fan-out returned useful arXiv, GitHub, local-search, LessWrong, and semantic-search hits. qmd timed out twice under the time budget, so local-first coverage is good but not exhaustive.
2026-06-14: Wrote consolidated note to docs/spec/20260614_moe_absorption_results.md and ran a fresh-eyes reviewer subagent. Review said the main overreach was claiming fine-grained segmentation helps absorption directly; toned this down to a MAYBE specialization transfer.

TODO

If promising candidates emerge, design a follow-up experiment spec.

Errors

Task	Error	Resolution

3.8 KiB Raw Blame History