mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
122 lines
5.9 KiB
Markdown
122 lines
5.9 KiB
Markdown
# Activation-routing documentation audit
|
|
|
|
## Goal
|
|
Update current project documentation to describe the implemented activation-scored
|
|
routing method rather than the retired live-gradient-scored method. Prioritize
|
|
`README.md` and `AGENTS.md`.
|
|
|
|
## Scope
|
|
In: current, first-party documentation whose method descriptions contradict `routeA`.
|
|
Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
|
|
and old result narratives that explicitly describe earlier experiments.
|
|
|
|
## Requirements
|
|
- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
|
|
without claiming the current method scores or modifies live gradients.
|
|
Done means: all current-method claims agree with `src/vgrout/train.py` and
|
|
`src/vgrout/extract_vhack_act.py`.
|
|
- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
|
|
background descriptions of Gradient Routing and SGTM.
|
|
Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
|
|
work remains unchanged.
|
|
- R3: directly related current docs are audited for the same stale claims.
|
|
Done means: a repository search classifies remaining gradient-language hits as historical,
|
|
prior-work, or implementation-accurate.
|
|
|
|
## Tasks
|
|
- [x] T1 (R1, R2): Audit code and prioritized docs.
|
|
- verify: compare terminology against `src/vgrout/train.py`,
|
|
`src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
|
|
- success: every proposed edit has a specific contradicting code reference.
|
|
- likely_fail: broad replacement corrupts prior-work descriptions.
|
|
- sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
|
|
- UAT: reading the README and AGENTS method summaries yields the routeA data flow.
|
|
- [x] T2 (R1, R2, R3): Edit current docs.
|
|
- verify: `git diff --check` and focused stale-term search.
|
|
- success: current-method stale claims are absent from prioritized docs.
|
|
- likely_fail: stale `v_grad` or live-gradient cosine claims remain.
|
|
- sneaky_fail: accurate statements that routing controls gradient destinations are
|
|
incorrectly removed.
|
|
- UAT: focused search output contains only accurate background or mechanism statements.
|
|
- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
|
|
- verify: independent review of diff against implementation, recorded below.
|
|
- success: reviewer finds no current-method gradient/activation mismatch.
|
|
- likely_fail: reviewer identifies a stale or overcorrected claim.
|
|
- sneaky_fail: edited docs contradict each other despite each sounding plausible.
|
|
- UAT: this file contains verification output and review result.
|
|
|
|
## Context
|
|
Current routeA data flow:
|
|
1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
|
|
2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
|
|
3. Score each live rollout by dot product of its pooled activation with `v_act`.
|
|
4. Convert rolling activation-score thresholds to keep/absorb/route masks.
|
|
5. Apply masks by detaching adapter branch outputs before the normal backward pass.
|
|
|
|
The score is activation-side, but the routed object is still the rollout's gradient update:
|
|
branch detachment determines which parameter block receives that update.
|
|
|
|
## Log
|
|
|
|
- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
|
|
output masks then determine the destination of the normal GRPO gradient update.
|
|
- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
|
|
were marked historical rather than mechanically rewritten as routeA evidence.
|
|
- `docs/human_journal.md` already had user edits and remains untouched.
|
|
- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
|
|
uncontrolled routing mass described as matched, rank-2r instead of deployed
|
|
rank-r activation capture, warmup conflated with full buffer capacity, pair
|
|
labels described as no labels, and precision preference described as implemented.
|
|
All six were corrected.
|
|
|
|
## TODO
|
|
|
|
## Errors
|
|
| Task | Error | Resolution |
|
|
|------|-------|------------|
|
|
|
|
## Results
|
|
|
|
| Claim | Documentation | Implementation proof |
|
|
|---|---|---|
|
|
| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
|
|
| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
|
|
| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
|
|
| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
|
|
|
|
## Verify
|
|
|
|
`git diff --check` produced no output.
|
|
|
|
Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
|
|
gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
|
|
explicit statement that the paper/results are historical routeV evidence.
|
|
|
|
## Failure mode check
|
|
|
|
- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
|
|
search. Actual: no stale current-method hit. PASS.
|
|
- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
|
|
output masks route the subsequent update, matching `train.py:895-896` and
|
|
`lora2r.py:75-83`. PASS.
|
|
- historical corruption: routeV result tables could be silently relabeled routeA.
|
|
Actual: tables remain routeV and now carry historical headers. PASS.
|
|
|
|
## Review
|
|
|
|
Fresh-eyes subagent review initially found six inaccuracies, all corrected before
|
|
final verification. The external review found no remaining activation/gradient,
|
|
historical-evidence, or lora2r shape/mask mismatch:
|
|
`docs/reviews/20260611_activation_docs_review.md`.
|
|
|
|
## Files
|
|
|
|
- `/workspace/projected_grpo/README.md`
|
|
- `/workspace/projected_grpo/AGENTS.md`
|
|
- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
|
|
- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
|
|
|
|
## Next
|
|
|
|
Done.
|