docs: update method descriptions for activation routing

2026-06-27 16:15:35 +08:00 · 2026-06-11 13:22:13 +00:00
parent 668285c86e
commit ec11bf58b2
9 changed files with 269 additions and 76 deletions
@@ -1,4 +1,7 @@
-# AFK hourly check — current protocol
+# AFK hourly check — archived routeV protocol
+
+This protocol predates the activation-scored routeA method. Do not use its routeV
+arm instructions for current experiments.

 LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
 This doc holds the durable rules. The live plan lives in the task list (the
@@ -1,4 +1,8 @@
-# v_hack extraction: gradient-space SVD with magnitudes
+# Historical: v_hack extraction in gradient space
+
+This document describes the retired gradient-scored method. The current routeA
+method extracts `v_act` from pooled bottleneck activations with forward passes;
+see `README.md` and `src/vgrout/extract_vhack_act.py`.

 Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
 This explains *what we extract* and *why*.
@@ -98,8 +98,9 @@ cached directions cannot silently load against a changed subset.

 ## Wave 2 (2026-06-11): 15 `behavior2_*` pairs

-Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
-classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
+Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8
+`behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85),
+but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2
 keeps the load-bearing properties (concrete code action, gaming a verification
 proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
 permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
@@ -1,4 +1,9 @@
-# Results, organized by the question each run answers
+# Historical routeV results, organized by the question each run answers
+
+These results describe the retired gradient-scored routeV method. They remain
+valid evidence about those runs, but they are not results for the current
+activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA
+findings.

 Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
 TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
@@ -21,7 +26,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1

 ---

-## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
+## Q14. routeV deploy on the recency-clean eval2 test set

 <!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
     for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
@@ -0,0 +1,61 @@
+Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.
+
+---
+
+## Review: Documentation Changes vs. Current Implementation
+
+I read the following files: `README.md`, `AGENTS.md`, `src/vgrout/train.py`, `src/vgrout/extract_vhack_act.py`, `src/vgrout/lora2r.py`, `docs/results.md`, `docs/writeup/main.tex`, `docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md`, `docs/brainstorm/extracted_prefs.md`, `docs/spec/20260611_act_gate_spec.md`, and `RESEARCH_JOURNAL.md`.
+
+### Findings
+
+#### (1) Activation scores select routes; output masks route subsequent gradients ✅ CLEAN
+
+The documentation accurately describes the routeA mechanism. Evidence:
+
+- **README.md** (~L44-51): "routeA scores each rollout on the no-grad `logp_old` forward … activation-capture hook pools … bottleneck activations … score is the pooled dot product with `v_act` … block masks are set from those labels *before* the single masked forward+backward."
+- **AGENTS.md** (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
+- **train.py L488-509**: `ActCapture` hooks capture `A[:r] @ x` during the no-grad `logpi_old` forward; `dots = torch.einsum("gmr,mr->g", acts, v_act)`; `_routeA_gate(dots)` returns `(m_vec, d_vec)` pinned on `layer._lora2r_mask`; then a single `loss.backward()` where `lora2r.py`'s hook applies `dep.detach()` and `m*quar`.
+- **lora2r.py L106-117**: masks act on branch *outputs*, so `d=1` blocks grads to both A rows and B columns of the deployed block, and `m=0` zeroes the quarantine block in both forward and backward.
+
+The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.
+
+#### (2) No current-method prose implies live-gradient cosine scoring ✅ CLEAN
+
+All files describing the **current** method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:
+
+- **README.md L100-103**: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
+- **docs/writeup/main.tex L1-3**: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
+- **docs/results.md L1-4**: "These results describe the retired gradient-scored routeV method."
+- **docs/blog/...md**: dated 2026-05-29, clearly a pre-routeA draft.
+- **docs/brainstorm/extracted_prefs.md**: dated 2026-05-23, the original project brainstorming doc, not current-method prose.
+
+The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.
+
+#### (3) Historical routeV results preserved rather than relabeled routeA ✅ CLEAN
+
+- **docs/results.md** consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
+- **docs/writeup/main.tex** calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
+- **README.md** refers to "routeV" as the retired method, with separate routeA description.
+- **docs/spec/20260611_act_gate_spec.md**: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."
+
+No relabeling found. The old results are preserved under their original name "routeV."
+
+#### (4) Claims about lora2r shapes/masks are exact ✅ CLEAN
+
+Every documented shape and mask claim matches the implementation:
+
+| Claim | Source | Code verification |
+|---|---|---|
+| `A:[2r, d_in]`, `B:[d_out, 2r]`, both trainable | README L23-24, AGENTS.md L267-268 | `lora2r.py L127`: `A0 = torch.randn(2*r, d_in, ...)`, `B0 = torch.randn(d_out, 2*r, ...)`; `A = nn.Parameter(A0.clone())`, `B = nn.Parameter(B0.clone())` |
+| Deployed `[:r]`, quarantine `[r:]` | README L24-25, AGENTS.md L268 | `lora2r.py L98-102`: `h[..., :r]` with `B[:, :r]`; `h[..., r:]` with `B[:, r:]` |
+| No cross terms between blocks | README L25-26 | `lora2r.py L98-102`: column k of B only multiplies row k of A's output; `F.linear(h[..., :r], B[:, :r])` and `F.linear(h[..., r:], B[:, r:])` are independent |
+| Frozen A0/B0 subtracted, net delta = 0 at init | README L23-24, lora2r.py L8-9 | `lora2r.py L99-102`: `dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r])`; at init A=A0, B=B0 so dep=quar=0 |
+| Masks `(m,d)`: clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached | README L29-32, AGENTS.md L287-291 | `lora2r.py L108-114`: `dep = ((1-d)*dep + d*dep.detach())`, `quar = m*quar`; `train.py L508-510`: m = (z > t_lo), d = (z >= t_hi) |
+| Deployment ablation resets quarantine to init | README L27, lora2r.py L117-118 | `lora2r.py`: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)" |
+| Ablation deletes quarantine's learned delta | README L27-28 | Confirmed by `lora2r.py` design: A0/B0 frozen, quarantine is `A[r:], B[:,r:]` minus init |
+
+One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs `(0,0)/(1,1)/(1,0)`. Both are internally consistent and map correctly.
+
+---
+
+**Summary: No findings. All four review areas are clean.** The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.
@@ -0,0 +1,121 @@
+# Activation-routing documentation audit
+
+## Goal
+Update current project documentation to describe the implemented activation-scored
+routing method rather than the retired live-gradient-scored method. Prioritize
+`README.md` and `AGENTS.md`.
+
+## Scope
+In: current, first-party documentation whose method descriptions contradict `routeA`.
+Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
+and old result narratives that explicitly describe earlier experiments.
+
+## Requirements
+- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
+  without claiming the current method scores or modifies live gradients.
+  Done means: all current-method claims agree with `src/vgrout/train.py` and
+  `src/vgrout/extract_vhack_act.py`.
+- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
+  background descriptions of Gradient Routing and SGTM.
+  Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
+  work remains unchanged.
+- R3: directly related current docs are audited for the same stale claims.
+  Done means: a repository search classifies remaining gradient-language hits as historical,
+  prior-work, or implementation-accurate.
+
+## Tasks
+- [x] T1 (R1, R2): Audit code and prioritized docs.
+  - verify: compare terminology against `src/vgrout/train.py`,
+    `src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
+  - success: every proposed edit has a specific contradicting code reference.
+  - likely_fail: broad replacement corrupts prior-work descriptions.
+  - sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
+  - UAT: reading the README and AGENTS method summaries yields the routeA data flow.
+- [x] T2 (R1, R2, R3): Edit current docs.
+  - verify: `git diff --check` and focused stale-term search.
+  - success: current-method stale claims are absent from prioritized docs.
+  - likely_fail: stale `v_grad` or live-gradient cosine claims remain.
+  - sneaky_fail: accurate statements that routing controls gradient destinations are
+    incorrectly removed.
+  - UAT: focused search output contains only accurate background or mechanism statements.
+- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
+  - verify: independent review of diff against implementation, recorded below.
+  - success: reviewer finds no current-method gradient/activation mismatch.
+  - likely_fail: reviewer identifies a stale or overcorrected claim.
+  - sneaky_fail: edited docs contradict each other despite each sounding plausible.
+  - UAT: this file contains verification output and review result.
+
+## Context
+Current routeA data flow:
+1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
+2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
+3. Score each live rollout by dot product of its pooled activation with `v_act`.
+4. Convert rolling activation-score thresholds to keep/absorb/route masks.
+5. Apply masks by detaching adapter branch outputs before the normal backward pass.
+
+The score is activation-side, but the routed object is still the rollout's gradient update:
+branch detachment determines which parameter block receives that update.
+
+## Log
+
+- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
+  output masks then determine the destination of the normal GRPO gradient update.
+- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
+  were marked historical rather than mechanically rewritten as routeA evidence.
+- `docs/human_journal.md` already had user edits and remains untouched.
+- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
+  uncontrolled routing mass described as matched, rank-2r instead of deployed
+  rank-r activation capture, warmup conflated with full buffer capacity, pair
+  labels described as no labels, and precision preference described as implemented.
+  All six were corrected.
+
+## TODO
+
+## Errors
+| Task | Error | Resolution |
+|------|-------|------------|
+
+## Results
+
+| Claim | Documentation | Implementation proof |
+|---|---|---|
+| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
+| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
+| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
+| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
+
+## Verify
+
+`git diff --check` produced no output.
+
+Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
+gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
+explicit statement that the paper/results are historical routeV evidence.
+
+## Failure mode check
+
+- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
+  search. Actual: no stale current-method hit. PASS.
+- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
+  output masks route the subsequent update, matching `train.py:895-896` and
+  `lora2r.py:75-83`. PASS.
+- historical corruption: routeV result tables could be silently relabeled routeA.
+  Actual: tables remain routeV and now carry historical headers. PASS.
+
+## Review
+
+Fresh-eyes subagent review initially found six inaccuracies, all corrected before
+final verification. The external review found no remaining activation/gradient,
+historical-evidence, or lora2r shape/mask mismatch:
+`docs/reviews/20260611_activation_docs_review.md`.
+
+## Files
+
+- `/workspace/projected_grpo/README.md`
+- `/workspace/projected_grpo/AGENTS.md`
+- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
+- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
+
+## Next
+
+Done.
@@ -1,3 +1,6 @@
+% HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the
+% current activation-scored routeA implementation. See README.md and
+% RESEARCH_JOURNAL.md before revising claims or results.
 % gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
 % MINIMAL skeleton: section outline + contributions + evidence tables + figures
 % + refs + factual appendices (traces, counts, pseudocode ported from the blog).