docs: update method descriptions for activation routing

This commit is contained in:
wassname
2026-06-11 13:22:13 +00:00
parent 668285c86e
commit ec11bf58b2
9 changed files with 269 additions and 76 deletions
+4 -1
View File
@@ -1,4 +1,7 @@
# AFK hourly check — current protocol
# AFK hourly check — archived routeV protocol
This protocol predates the activation-scored routeA method. Do not use its routeV
arm instructions for current experiments.
LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
This doc holds the durable rules. The live plan lives in the task list (the
+5 -1
View File
@@ -1,4 +1,8 @@
# v_hack extraction: gradient-space SVD with magnitudes
# Historical: v_hack extraction in gradient space
This document describes the retired gradient-scored method. The current routeA
method extracts `v_act` from pooled bottleneck activations with forward passes;
see `README.md` and `src/vgrout/extract_vhack_act.py`.
Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
This explains *what we extract* and *why*.
+3 -2
View File
@@ -98,8 +98,9 @@ cached directions cannot silently load against a changed subset.
## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8
`behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85),
but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2
keeps the load-bearing properties (concrete code action, gaming a verification
proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
+7 -2
View File
@@ -1,4 +1,9 @@
# Results, organized by the question each run answers
# Historical routeV results, organized by the question each run answers
These results describe the retired gradient-scored routeV method. They remain
valid evidence about those runs, but they are not results for the current
activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA
findings.
Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
@@ -21,7 +26,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
---
## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
## Q14. routeV deploy on the recency-clean eval2 test set
<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
@@ -0,0 +1,61 @@
Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.
---
## Review: Documentation Changes vs. Current Implementation
I read the following files: `README.md`, `AGENTS.md`, `src/vgrout/train.py`, `src/vgrout/extract_vhack_act.py`, `src/vgrout/lora2r.py`, `docs/results.md`, `docs/writeup/main.tex`, `docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md`, `docs/brainstorm/extracted_prefs.md`, `docs/spec/20260611_act_gate_spec.md`, and `RESEARCH_JOURNAL.md`.
### Findings
#### (1) Activation scores select routes; output masks route subsequent gradients ✅ CLEAN
The documentation accurately describes the routeA mechanism. Evidence:
- **README.md** (~L44-51): "routeA scores each rollout on the no-grad `logp_old` forward … activation-capture hook pools … bottleneck activations … score is the pooled dot product with `v_act` … block masks are set from those labels *before* the single masked forward+backward."
- **AGENTS.md** (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
- **train.py L488-509**: `ActCapture` hooks capture `A[:r] @ x` during the no-grad `logpi_old` forward; `dots = torch.einsum("gmr,mr->g", acts, v_act)`; `_routeA_gate(dots)` returns `(m_vec, d_vec)` pinned on `layer._lora2r_mask`; then a single `loss.backward()` where `lora2r.py`'s hook applies `dep.detach()` and `m*quar`.
- **lora2r.py L106-117**: masks act on branch *outputs*, so `d=1` blocks grads to both A rows and B columns of the deployed block, and `m=0` zeroes the quarantine block in both forward and backward.
The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.
#### (2) No current-method prose implies live-gradient cosine scoring ✅ CLEAN
All files describing the **current** method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:
- **README.md L100-103**: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
- **docs/writeup/main.tex L1-3**: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
- **docs/results.md L1-4**: "These results describe the retired gradient-scored routeV method."
- **docs/blog/...md**: dated 2026-05-29, clearly a pre-routeA draft.
- **docs/brainstorm/extracted_prefs.md**: dated 2026-05-23, the original project brainstorming doc, not current-method prose.
The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.
#### (3) Historical routeV results preserved rather than relabeled routeA ✅ CLEAN
- **docs/results.md** consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
- **docs/writeup/main.tex** calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
- **README.md** refers to "routeV" as the retired method, with separate routeA description.
- **docs/spec/20260611_act_gate_spec.md**: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."
No relabeling found. The old results are preserved under their original name "routeV."
#### (4) Claims about lora2r shapes/masks are exact ✅ CLEAN
Every documented shape and mask claim matches the implementation:
| Claim | Source | Code verification |
|---|---|---|
| `A:[2r, d_in]`, `B:[d_out, 2r]`, both trainable | README L23-24, AGENTS.md L267-268 | `lora2r.py L127`: `A0 = torch.randn(2*r, d_in, ...)`, `B0 = torch.randn(d_out, 2*r, ...)`; `A = nn.Parameter(A0.clone())`, `B = nn.Parameter(B0.clone())` |
| Deployed `[:r]`, quarantine `[r:]` | README L24-25, AGENTS.md L268 | `lora2r.py L98-102`: `h[..., :r]` with `B[:, :r]`; `h[..., r:]` with `B[:, r:]` |
| No cross terms between blocks | README L25-26 | `lora2r.py L98-102`: column k of B only multiplies row k of A's output; `F.linear(h[..., :r], B[:, :r])` and `F.linear(h[..., r:], B[:, r:])` are independent |
| Frozen A0/B0 subtracted, net delta = 0 at init | README L23-24, lora2r.py L8-9 | `lora2r.py L99-102`: `dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r])`; at init A=A0, B=B0 so dep=quar=0 |
| Masks `(m,d)`: clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached | README L29-32, AGENTS.md L287-291 | `lora2r.py L108-114`: `dep = ((1-d)*dep + d*dep.detach())`, `quar = m*quar`; `train.py L508-510`: m = (z > t_lo), d = (z >= t_hi) |
| Deployment ablation resets quarantine to init | README L27, lora2r.py L117-118 | `lora2r.py`: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)" |
| Ablation deletes quarantine's learned delta | README L27-28 | Confirmed by `lora2r.py` design: A0/B0 frozen, quarantine is `A[r:], B[:,r:]` minus init |
One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs `(0,0)/(1,1)/(1,0)`. Both are internally consistent and map correctly.
---
**Summary: No findings. All four review areas are clean.** The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.
+121
View File
@@ -0,0 +1,121 @@
# Activation-routing documentation audit
## Goal
Update current project documentation to describe the implemented activation-scored
routing method rather than the retired live-gradient-scored method. Prioritize
`README.md` and `AGENTS.md`.
## Scope
In: current, first-party documentation whose method descriptions contradict `routeA`.
Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
and old result narratives that explicitly describe earlier experiments.
## Requirements
- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
without claiming the current method scores or modifies live gradients.
Done means: all current-method claims agree with `src/vgrout/train.py` and
`src/vgrout/extract_vhack_act.py`.
- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
background descriptions of Gradient Routing and SGTM.
Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
work remains unchanged.
- R3: directly related current docs are audited for the same stale claims.
Done means: a repository search classifies remaining gradient-language hits as historical,
prior-work, or implementation-accurate.
## Tasks
- [x] T1 (R1, R2): Audit code and prioritized docs.
- verify: compare terminology against `src/vgrout/train.py`,
`src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
- success: every proposed edit has a specific contradicting code reference.
- likely_fail: broad replacement corrupts prior-work descriptions.
- sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
- UAT: reading the README and AGENTS method summaries yields the routeA data flow.
- [x] T2 (R1, R2, R3): Edit current docs.
- verify: `git diff --check` and focused stale-term search.
- success: current-method stale claims are absent from prioritized docs.
- likely_fail: stale `v_grad` or live-gradient cosine claims remain.
- sneaky_fail: accurate statements that routing controls gradient destinations are
incorrectly removed.
- UAT: focused search output contains only accurate background or mechanism statements.
- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
- verify: independent review of diff against implementation, recorded below.
- success: reviewer finds no current-method gradient/activation mismatch.
- likely_fail: reviewer identifies a stale or overcorrected claim.
- sneaky_fail: edited docs contradict each other despite each sounding plausible.
- UAT: this file contains verification output and review result.
## Context
Current routeA data flow:
1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
3. Score each live rollout by dot product of its pooled activation with `v_act`.
4. Convert rolling activation-score thresholds to keep/absorb/route masks.
5. Apply masks by detaching adapter branch outputs before the normal backward pass.
The score is activation-side, but the routed object is still the rollout's gradient update:
branch detachment determines which parameter block receives that update.
## Log
- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
output masks then determine the destination of the normal GRPO gradient update.
- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
were marked historical rather than mechanically rewritten as routeA evidence.
- `docs/human_journal.md` already had user edits and remains untouched.
- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
uncontrolled routing mass described as matched, rank-2r instead of deployed
rank-r activation capture, warmup conflated with full buffer capacity, pair
labels described as no labels, and precision preference described as implemented.
All six were corrected.
## TODO
## Errors
| Task | Error | Resolution |
|------|-------|------------|
## Results
| Claim | Documentation | Implementation proof |
|---|---|---|
| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
## Verify
`git diff --check` produced no output.
Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
explicit statement that the paper/results are historical routeV evidence.
## Failure mode check
- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
search. Actual: no stale current-method hit. PASS.
- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
output masks route the subsequent update, matching `train.py:895-896` and
`lora2r.py:75-83`. PASS.
- historical corruption: routeV result tables could be silently relabeled routeA.
Actual: tables remain routeV and now carry historical headers. PASS.
## Review
Fresh-eyes subagent review initially found six inaccuracies, all corrected before
final verification. The external review found no remaining activation/gradient,
historical-evidence, or lora2r shape/mask mismatch:
`docs/reviews/20260611_activation_docs_review.md`.
## Files
- `/workspace/projected_grpo/README.md`
- `/workspace/projected_grpo/AGENTS.md`
- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
## Next
Done.
+3
View File
@@ -1,3 +1,6 @@
% HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the
% current activation-scored routeA implementation. See README.md and
% RESEARCH_JOURNAL.md before revising claims or results.
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
% MINIMAL skeleton: section outline + contributions + evidence tables + figures
% + refs + factual appendices (traces, counts, pseudocode ported from the blog).