Files
evil_MoE/docs/spec/20260611_activation_docs_audit.md
T

122 lines
5.9 KiB
Markdown

# Activation-routing documentation audit
## Goal
Update current project documentation to describe the implemented activation-scored
routing method rather than the retired live-gradient-scored method. Prioritize
`README.md` and `AGENTS.md`.
## Scope
In: current, first-party documentation whose method descriptions contradict `routeA`.
Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
and old result narratives that explicitly describe earlier experiments.
## Requirements
- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
without claiming the current method scores or modifies live gradients.
Done means: all current-method claims agree with `src/vgrout/train.py` and
`src/vgrout/extract_vhack_act.py`.
- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
background descriptions of Gradient Routing and SGTM.
Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
work remains unchanged.
- R3: directly related current docs are audited for the same stale claims.
Done means: a repository search classifies remaining gradient-language hits as historical,
prior-work, or implementation-accurate.
## Tasks
- [x] T1 (R1, R2): Audit code and prioritized docs.
- verify: compare terminology against `src/vgrout/train.py`,
`src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
- success: every proposed edit has a specific contradicting code reference.
- likely_fail: broad replacement corrupts prior-work descriptions.
- sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
- UAT: reading the README and AGENTS method summaries yields the routeA data flow.
- [x] T2 (R1, R2, R3): Edit current docs.
- verify: `git diff --check` and focused stale-term search.
- success: current-method stale claims are absent from prioritized docs.
- likely_fail: stale `v_grad` or live-gradient cosine claims remain.
- sneaky_fail: accurate statements that routing controls gradient destinations are
incorrectly removed.
- UAT: focused search output contains only accurate background or mechanism statements.
- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
- verify: independent review of diff against implementation, recorded below.
- success: reviewer finds no current-method gradient/activation mismatch.
- likely_fail: reviewer identifies a stale or overcorrected claim.
- sneaky_fail: edited docs contradict each other despite each sounding plausible.
- UAT: this file contains verification output and review result.
## Context
Current routeA data flow:
1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
3. Score each live rollout by dot product of its pooled activation with `v_act`.
4. Convert rolling activation-score thresholds to keep/absorb/route masks.
5. Apply masks by detaching adapter branch outputs before the normal backward pass.
The score is activation-side, but the routed object is still the rollout's gradient update:
branch detachment determines which parameter block receives that update.
## Log
- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
output masks then determine the destination of the normal GRPO gradient update.
- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
were marked historical rather than mechanically rewritten as routeA evidence.
- `docs/human_journal.md` already had user edits and remains untouched.
- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
uncontrolled routing mass described as matched, rank-2r instead of deployed
rank-r activation capture, warmup conflated with full buffer capacity, pair
labels described as no labels, and precision preference described as implemented.
All six were corrected.
## TODO
## Errors
| Task | Error | Resolution |
|------|-------|------------|
## Results
| Claim | Documentation | Implementation proof |
|---|---|---|
| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
## Verify
`git diff --check` produced no output.
Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
explicit statement that the paper/results are historical routeV evidence.
## Failure mode check
- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
search. Actual: no stale current-method hit. PASS.
- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
output masks route the subsequent update, matching `train.py:895-896` and
`lora2r.py:75-83`. PASS.
- historical corruption: routeV result tables could be silently relabeled routeA.
Actual: tables remain routeV and now carry historical headers. PASS.
## Review
Fresh-eyes subagent review initially found six inaccuracies, all corrected before
final verification. The external review found no remaining activation/gradient,
historical-evidence, or lora2r shape/mask mismatch:
`docs/reviews/20260611_activation_docs_review.md`.
## Files
- `/workspace/projected_grpo/README.md`
- `/workspace/projected_grpo/AGENTS.md`
- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
## Next
Done.