Files
evil_MoE/docs/spec/20260611_activation_docs_audit.md
T

5.9 KiB

Activation-routing documentation audit

Goal

Update current project documentation to describe the implemented activation-scored routing method rather than the retired live-gradient-scored method. Prioritize README.md and AGENTS.md.

Scope

In: current, first-party documentation whose method descriptions contradict routeA. Out: historical journals, archived scripts, vendor documentation, prior-work quotations, and old result narratives that explicitly describe earlier experiments.

Requirements

  • R1: README.md describes v_act extraction, activation scoring, and branch detachment without claiming the current method scores or modifies live gradients. Done means: all current-method claims agree with src/vgrout/train.py and src/vgrout/extract_vhack_act.py.
  • R2: AGENTS.md gives future agents the same current-method model while retaining accurate background descriptions of Gradient Routing and SGTM. Done means: stale vec -> gradient cosine instructions are replaced, while quoted prior work remains unchanged.
  • R3: directly related current docs are audited for the same stale claims. Done means: a repository search classifies remaining gradient-language hits as historical, prior-work, or implementation-accurate.

Tasks

  • T1 (R1, R2): Audit code and prioritized docs.
    • verify: compare terminology against src/vgrout/train.py, src/vgrout/extract_vhack_act.py, and src/vgrout/lora2r.py.
    • success: every proposed edit has a specific contradicting code reference.
    • likely_fail: broad replacement corrupts prior-work descriptions.
    • sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
    • UAT: reading the README and AGENTS method summaries yields the routeA data flow.
  • T2 (R1, R2, R3): Edit current docs.
    • verify: git diff --check and focused stale-term search.
    • success: current-method stale claims are absent from prioritized docs.
    • likely_fail: stale v_grad or live-gradient cosine claims remain.
    • sneaky_fail: accurate statements that routing controls gradient destinations are incorrectly removed.
    • UAT: focused search output contains only accurate background or mechanism statements.
  • T3 (R1, R2, R3): Fresh-eyes review and proof.
    • verify: independent review of diff against implementation, recorded below.
    • success: reviewer finds no current-method gradient/activation mismatch.
    • likely_fail: reviewer identifies a stale or overcorrected claim.
    • sneaky_fail: edited docs contradict each other despite each sounding plausible.
    • UAT: this file contains verification output and review result.

Context

Current routeA data flow:

  1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
  2. Define each module's v_act as the normalized mean hack-minus-clean activation.
  3. Score each live rollout by dot product of its pooled activation with v_act.
  4. Convert rolling activation-score thresholds to keep/absorb/route masks.
  5. Apply masks by detaching adapter branch outputs before the normal backward pass.

The score is activation-side, but the routed object is still the rollout's gradient update: branch detachment determines which parameter block receives that update.

Log

  • routeA scores pooled deployed-block bottleneck activations against v_act; output masks then determine the destination of the normal GRPO gradient update.
  • docs/writeup/main.tex and docs/results.md contain routeV evidence, so they were marked historical rather than mechanically rewritten as routeA evidence.
  • docs/human_journal.md already had user edits and remains untouched.
  • Fresh-eyes review found six documentation inaccuracies: overclaimed prevention, uncontrolled routing mass described as matched, rank-2r instead of deployed rank-r activation capture, warmup conflated with full buffer capacity, pair labels described as no labels, and precision preference described as implemented. All six were corrected.

TODO

Errors

Task Error Resolution

Results

Claim Documentation Implementation proof
Direction source is activation-side README.md:37-45, AGENTS.md:109-114 src/vgrout/extract_vhack_act.py:71-102
Live score is pooled activation dot v_act README.md:47-50, AGENTS.md:109-110 src/vgrout/train.py:276, src/vgrout/train.py:841
Score-selected masks route the later gradient update README.md:52-56, AGENTS.md:169-178 src/vgrout/train.py:847-849, src/vgrout/lora2r.py:75-83
RouteV evidence is not relabeled as routeA README.md:105-111 docs/results.md:1-6, docs/writeup/main.tex:1-3

Verify

git diff --check produced no output.

Focused search over README.md and AGENTS.md found no current-method v_grad, gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's explicit statement that the paper/results are historical routeV evidence.

Failure mode check

  • likely_fail: stale prioritized-doc gradient scoring would appear in the focused search. Actual: no stale current-method hit. PASS.
  • sneaky_fail: activation wording could hide gradient surgery. Actual: docs say output masks route the subsequent update, matching train.py:895-896 and lora2r.py:75-83. PASS.
  • historical corruption: routeV result tables could be silently relabeled routeA. Actual: tables remain routeV and now carry historical headers. PASS.

Review

Fresh-eyes subagent review initially found six inaccuracies, all corrected before final verification. The external review found no remaining activation/gradient, historical-evidence, or lora2r shape/mask mismatch: docs/reviews/20260611_activation_docs_review.md.

Files

  • /workspace/projected_grpo/README.md
  • /workspace/projected_grpo/AGENTS.md
  • /workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md
  • /workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md

Next

Done.