evil_MoE/docs/spec/20260611_activation_docs_audit.md

# Activation-routing documentation audit

## Goal
Update current project documentation to describe the implemented activation-scored
routing method rather than the retired live-gradient-scored method. Prioritize
`README.md` and `AGENTS.md`.

## Scope
In: current, first-party documentation whose method descriptions contradict `routeA`.
Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
and old result narratives that explicitly describe earlier experiments.

## Requirements
- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
  without claiming the current method scores or modifies live gradients.
  Done means: all current-method claims agree with `src/vgrout/train.py` and
  `src/vgrout/extract_vhack_act.py`.
- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
  background descriptions of Gradient Routing and SGTM.
  Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
  work remains unchanged.
- R3: directly related current docs are audited for the same stale claims.
  Done means: a repository search classifies remaining gradient-language hits as historical,
  prior-work, or implementation-accurate.

## Tasks
- [x] T1 (R1, R2): Audit code and prioritized docs.
  - verify: compare terminology against `src/vgrout/train.py`,
    `src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
  - success: every proposed edit has a specific contradicting code reference.
  - likely_fail: broad replacement corrupts prior-work descriptions.
  - sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
  - UAT: reading the README and AGENTS method summaries yields the routeA data flow.
- [x] T2 (R1, R2, R3): Edit current docs.
  - verify: `git diff --check` and focused stale-term search.
  - success: current-method stale claims are absent from prioritized docs.
  - likely_fail: stale `v_grad` or live-gradient cosine claims remain.
  - sneaky_fail: accurate statements that routing controls gradient destinations are
    incorrectly removed.
  - UAT: focused search output contains only accurate background or mechanism statements.
- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
  - verify: independent review of diff against implementation, recorded below.
  - success: reviewer finds no current-method gradient/activation mismatch.
  - likely_fail: reviewer identifies a stale or overcorrected claim.
  - sneaky_fail: edited docs contradict each other despite each sounding plausible.
  - UAT: this file contains verification output and review result.

## Context
Current routeA data flow:
1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
3. Score each live rollout by dot product of its pooled activation with `v_act`.
4. Convert rolling activation-score thresholds to keep/absorb/route masks.
5. Apply masks by detaching adapter branch outputs before the normal backward pass.

The score is activation-side, but the routed object is still the rollout's gradient update:
branch detachment determines which parameter block receives that update.

## Log

- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
  output masks then determine the destination of the normal GRPO gradient update.
- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
  were marked historical rather than mechanically rewritten as routeA evidence.
- `docs/human_journal.md` already had user edits and remains untouched.
- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
  uncontrolled routing mass described as matched, rank-2r instead of deployed
  rank-r activation capture, warmup conflated with full buffer capacity, pair
  labels described as no labels, and precision preference described as implemented.
  All six were corrected.

## TODO

## Errors
| Task | Error | Resolution |
|------|-------|------------|

## Results

| Claim | Documentation | Implementation proof |
|---|---|---|
| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |

## Verify

`git diff --check` produced no output.

Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
explicit statement that the paper/results are historical routeV evidence.

## Failure mode check

- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
  search. Actual: no stale current-method hit. PASS.
- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
  output masks route the subsequent update, matching `train.py:895-896` and
  `lora2r.py:75-83`. PASS.
- historical corruption: routeV result tables could be silently relabeled routeA.
  Actual: tables remain routeV and now carry historical headers. PASS.

## Review

Fresh-eyes subagent review initially found six inaccuracies, all corrected before
final verification. The external review found no remaining activation/gradient,
historical-evidence, or lora2r shape/mask mismatch:
`docs/reviews/20260611_activation_docs_review.md`.

## Files

- `/workspace/projected_grpo/README.md`
- `/workspace/projected_grpo/AGENTS.md`
- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`

## Next

Done.