mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 14:45:07 +08:00
docs: update method descriptions for activation routing
This commit is contained in:
@@ -11,10 +11,17 @@ see @README.md for project overview
|
||||
|
||||
|
||||
Hypothesis
|
||||
> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff.
|
||||
> Prior gradient-routing methods route with labels. We ask whether a synthetic
|
||||
> activation-space hacking vector can replace those labels. In this toy GRPO
|
||||
> reward-hacking setup, pooled activations select whether each rollout updates
|
||||
> deployed parameters, quarantine parameters, or both. The decisive comparison
|
||||
> is whether real `v_act` beats a Haar-random direction, while measuring routing
|
||||
> mass as a potential confound.
|
||||
|
||||
Motovation:
|
||||
We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
|
||||
Motivation:
|
||||
We want to improve gradient routing with scalable self-supervised signals. Here
|
||||
the routing signal is an activation direction extracted from synthetic pairs,
|
||||
rather than a ground-truth label or reward modification.
|
||||
|
||||
Inherit global rules from `~/.claude/CLAUDE.md`.
|
||||
|
||||
@@ -97,14 +104,15 @@ $\theta_{\text{forget}}$.
|
||||
Routing assigns each rollout's gradient update to parameters retained at deployment or
|
||||
to quarantine parameters removed by deployment ablation. A false negative updates the
|
||||
retained parameters with a reward-hacking example, whereas a false positive removes
|
||||
one non-reward-hacking update. The routing threshold should therefore favor precision
|
||||
over recall.
|
||||
one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts;
|
||||
they do not explicitly optimize this asymmetric cost.
|
||||
|
||||
The routing score is the cosine alignment between a rollout update and `v_grad`, or
|
||||
between its activations and `v_act`. These reward-hacking-minus-correct directions are
|
||||
extracted from hand-authored contrast pairs before training. They do not use
|
||||
ground-truth labels from training rollouts. Pinning selects score thresholds for the
|
||||
retain, absorb, and route regions.
|
||||
The routing score is the dot product between a rollout's pooled deployed-block
|
||||
bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized
|
||||
mean hack-minus-clean activation difference extracted from hand-authored contrast
|
||||
pairs with forward passes only. Ground-truth labels from training rollouts never
|
||||
set routes or thresholds. Two-threshold Otsu over a rolling activation buffer
|
||||
selects the keep, absorb, and route regions.
|
||||
|
||||
The middle region leaves both parameter blocks trainable and may permit absorption.
|
||||
Calling this region `absorb` names the intended mechanism; it does not establish that
|
||||
@@ -132,11 +140,12 @@ the route threshold.
|
||||
dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
|
||||
this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
|
||||
NOT our setup -- do not frame our method that way.
|
||||
3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
|
||||
contrastive pairs (off-distribution, authored before observing training rollouts),
|
||||
then route each GRPO gradient by its cosine alignment to `vec`. The hand-authored
|
||||
pairs require no env-specific oracle and use no ground-truth labels from training
|
||||
rollouts. Generalization is tested by whether `vec`
|
||||
3. OUR setup is `v_act -> routing`: extract an activation-space hack direction from
|
||||
hand-built synthetic contrastive pairs (off-distribution, authored before observing
|
||||
training rollouts), then score each rollout by the dot product between its pooled
|
||||
bottleneck activations and `v_act`. That score selects which parameter block receives
|
||||
the rollout's GRPO update. The hand-authored pairs require no env-specific oracle and
|
||||
use no ground-truth labels from training rollouts. Generalization is tested by whether `v_act`
|
||||
(built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
|
||||
-- vector generalization, not detector-label generalization.
|
||||
4. On-distribution contrast pairs require labels for training rollouts and therefore
|
||||
@@ -150,41 +159,26 @@ the route threshold.
|
||||
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
|
||||
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
|
||||
|
||||
## The adapter setup (shapes), and why "same position = shrinkage" is subtle
|
||||
## The adapter and routing setup
|
||||
|
||||
<!-- User, wtf is shrinkage? -->
|
||||
Per target Linear, the current `lora2r` adapter has trainable
|
||||
`A:[2r,d_in]` and `B:[d_out,2r]`, split into independent deployed `[:r]`
|
||||
and quarantine `[r:]` blocks. Frozen initialization copies are subtracted,
|
||||
making the net adapter delta exactly zero at initialization. Deployment ablation
|
||||
resets the quarantine block to its initialization.
|
||||
|
||||
Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).
|
||||
For each rollout, routeA sets an output mask before the single grad-carrying
|
||||
forward and backward:
|
||||
|
||||
- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
|
||||
(deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
|
||||
scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
|
||||
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
|
||||
FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
|
||||
Shared B -> `A.grad == A_hack.grad` pre-routing.
|
||||
- keep `(m=0,d=0)`: only the deployed block trains.
|
||||
- absorb `(m=1,d=0)`: both blocks train, which may permit absorption.
|
||||
- route `(m=1,d=1)`: only the quarantine block trains; the deployed output remains
|
||||
in the forward pass but is detached.
|
||||
|
||||
The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
|
||||
ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
|
||||
`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
|
||||
That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
|
||||
it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.
|
||||
|
||||
Shrinkage is NOT inevitable. Two things break it:
|
||||
1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
|
||||
full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
|
||||
solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
|
||||
separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
|
||||
doomed to shrinkage than PiSSA even with shared B.
|
||||
2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
|
||||
own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
|
||||
∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
|
||||
update.
|
||||
|
||||
So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
|
||||
structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
|
||||
empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
|
||||
routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
|
||||
trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)
|
||||
The gate reads pooled activations, not gradients. Its masks determine which block
|
||||
receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
|
||||
tests whether direction discrimination adds value beyond quarantine-induced
|
||||
shrinkage; compare its measured `qmass` because routing mass is not controlled.
|
||||
|
||||
## Extra instructions:
|
||||
|
||||
@@ -214,13 +208,15 @@ For the setup, read these:
|
||||
the claim -- "the tests passed" means nothing if the property was never tested.
|
||||
|
||||
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
|
||||
reward-hack, neg=the correct solution, vector = grad(prompt+hack) - grad(prompt+clean).
|
||||
reward-hack, neg=the correct solution. The current vector is the mean paired
|
||||
difference in pooled deployed-block bottleneck activations.
|
||||
Like persona steering pairs, MATCH everything but the axis -- same
|
||||
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
|
||||
(else style competes with the trait; see the style-confound section of the doc below).
|
||||
There is NO problem_id semantics: the only "id" is which completion is the hack side
|
||||
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
|
||||
are DISTINCT (different gradient). Authored pairs are off-distribution and hand-written;
|
||||
are DISTINCT because the prompt changes the activations. Authored pairs are
|
||||
off-distribution and hand-written;
|
||||
pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and
|
||||
are unsuitable for the primary oracle-free result.
|
||||
- ./docs/personas/how_to_rewrite_pairs.md
|
||||
@@ -240,4 +236,3 @@ For the original paper (the substrate: reward-hacking LeetCode env)
|
||||
For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
|
||||
the original; source of the absorption/leakage vocab)
|
||||
- ./docs/papers/grad_routing/paper_sgtm.md
|
||||
|
||||
|
||||
@@ -1,15 +1,16 @@
|
||||
# vGROUT
|
||||
|
||||
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
||||
extracted reward-hacking direction so the deployed model can't learn the hack,
|
||||
while preserving coding performance. A representation-routing variant of gradient routing
|
||||
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
|
||||
direction rather than a per-example data label.
|
||||
**vGROUT** (vector gradient routing): use an extracted activation-space
|
||||
reward-hacking direction to route each rollout's GRPO update into deployed or
|
||||
quarantine parameters, aiming to reduce reward hacking retained at deployment
|
||||
while preserving coding performance. It is a representation-gated variant of
|
||||
gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
|
||||
the per-example data label that normally selects the gradient route.
|
||||
|
||||
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
||||
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
||||
"Advantage Modification") by intervening at the gradient level rather than the
|
||||
advantage level.
|
||||
LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
|
||||
2026), the method leaves rewards and advantages unchanged. Pooled activations
|
||||
select which adapter block receives each rollout's gradient update.
|
||||
|
||||
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
||||
and [docs/papers/](docs/papers/).
|
||||
@@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):
|
||||
|
||||
To get the hack direction we pair examples by hand: for each problem, one
|
||||
correct completion and one completion that exploits the evaluation procedure.
|
||||
For each pair we run a forward pass and read the bottleneck activation `A@x`
|
||||
(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
|
||||
completion tokens. The per-module mean hack-minus-clean activation difference,
|
||||
unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
|
||||
Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
|
||||
are off-distribution and the *only* routing-label source. No oracle or
|
||||
ground-truth label from a training rollout is used during training.
|
||||
For each pair we run a forward pass and read the deployed-block bottleneck
|
||||
activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
|
||||
mean hack-minus-clean activation difference, unit-normalized per module, is
|
||||
`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
|
||||
backward pass. The hand-authored pair sides provide the only hack/clean labels;
|
||||
no oracle or ground-truth label from a training rollout is used during training.
|
||||
|
||||
At training time routeA scores each rollout on the no-grad `logp_old` forward it
|
||||
already needs: an activation-capture hook pools the same bottleneck activations
|
||||
over completion tokens, and the score is the pooled dot product with `v_act`.
|
||||
Thresholds come from a rolling buffer of recent scores, z-normalized and split by
|
||||
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
|
||||
pins absorb. The block masks are set from those labels *before* the single
|
||||
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches
|
||||
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
|
||||
masked forward+backward, so there is no second gradient pass. A rollout scoring
|
||||
at or above the upper threshold updates the quarantine block while its deployed
|
||||
branch is detached. We re-extract `v_act` every N steps (forward-only,
|
||||
@@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
|
||||
|
||||
## Results and write-up
|
||||
|
||||
The paper draft is the source of truth for current numbers, figures, and the
|
||||
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
|
||||
Session-by-session findings and per-step log audits live in
|
||||
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
|
||||
The paper draft and [docs/results.md](docs/results.md) currently describe the
|
||||
retired gradient-scored routeV experiments. They are historical evidence, not a
|
||||
description of routeA. Current routeA findings are recorded in
|
||||
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
|
||||
|
||||
+4
-1
@@ -1,4 +1,7 @@
|
||||
# AFK hourly check — current protocol
|
||||
# AFK hourly check — archived routeV protocol
|
||||
|
||||
This protocol predates the activation-scored routeA method. Do not use its routeV
|
||||
arm instructions for current experiments.
|
||||
|
||||
LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
|
||||
This doc holds the durable rules. The live plan lives in the task list (the
|
||||
|
||||
@@ -1,4 +1,8 @@
|
||||
# v_hack extraction: gradient-space SVD with magnitudes
|
||||
# Historical: v_hack extraction in gradient space
|
||||
|
||||
This document describes the retired gradient-scored method. The current routeA
|
||||
method extracts `v_act` from pooled bottleneck activations with forward passes;
|
||||
see `README.md` and `src/vgrout/extract_vhack_act.py`.
|
||||
|
||||
Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
|
||||
This explains *what we extract* and *why*.
|
||||
|
||||
@@ -98,8 +98,9 @@ cached directions cannot silently load against a changed subset.
|
||||
|
||||
## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
|
||||
|
||||
Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
|
||||
classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
|
||||
Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8
|
||||
`behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85),
|
||||
but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2
|
||||
keeps the load-bearing properties (concrete code action, gaming a verification
|
||||
proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
|
||||
permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
|
||||
|
||||
+7
-2
@@ -1,4 +1,9 @@
|
||||
# Results, organized by the question each run answers
|
||||
# Historical routeV results, organized by the question each run answers
|
||||
|
||||
These results describe the retired gradient-scored routeV method. They remain
|
||||
valid evidence about those runs, but they are not results for the current
|
||||
activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA
|
||||
findings.
|
||||
|
||||
Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
|
||||
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
|
||||
@@ -21,7 +26,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
|
||||
|
||||
---
|
||||
|
||||
## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
|
||||
## Q14. routeV deploy on the recency-clean eval2 test set
|
||||
|
||||
<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
|
||||
for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
|
||||
|
||||
@@ -0,0 +1,61 @@
|
||||
Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.
|
||||
|
||||
---
|
||||
|
||||
## Review: Documentation Changes vs. Current Implementation
|
||||
|
||||
I read the following files: `README.md`, `AGENTS.md`, `src/vgrout/train.py`, `src/vgrout/extract_vhack_act.py`, `src/vgrout/lora2r.py`, `docs/results.md`, `docs/writeup/main.tex`, `docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md`, `docs/brainstorm/extracted_prefs.md`, `docs/spec/20260611_act_gate_spec.md`, and `RESEARCH_JOURNAL.md`.
|
||||
|
||||
### Findings
|
||||
|
||||
#### (1) Activation scores select routes; output masks route subsequent gradients ✅ CLEAN
|
||||
|
||||
The documentation accurately describes the routeA mechanism. Evidence:
|
||||
|
||||
- **README.md** (~L44-51): "routeA scores each rollout on the no-grad `logp_old` forward … activation-capture hook pools … bottleneck activations … score is the pooled dot product with `v_act` … block masks are set from those labels *before* the single masked forward+backward."
|
||||
- **AGENTS.md** (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
|
||||
- **train.py L488-509**: `ActCapture` hooks capture `A[:r] @ x` during the no-grad `logpi_old` forward; `dots = torch.einsum("gmr,mr->g", acts, v_act)`; `_routeA_gate(dots)` returns `(m_vec, d_vec)` pinned on `layer._lora2r_mask`; then a single `loss.backward()` where `lora2r.py`'s hook applies `dep.detach()` and `m*quar`.
|
||||
- **lora2r.py L106-117**: masks act on branch *outputs*, so `d=1` blocks grads to both A rows and B columns of the deployed block, and `m=0` zeroes the quarantine block in both forward and backward.
|
||||
|
||||
The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.
|
||||
|
||||
#### (2) No current-method prose implies live-gradient cosine scoring ✅ CLEAN
|
||||
|
||||
All files describing the **current** method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:
|
||||
|
||||
- **README.md L100-103**: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
|
||||
- **docs/writeup/main.tex L1-3**: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
|
||||
- **docs/results.md L1-4**: "These results describe the retired gradient-scored routeV method."
|
||||
- **docs/blog/...md**: dated 2026-05-29, clearly a pre-routeA draft.
|
||||
- **docs/brainstorm/extracted_prefs.md**: dated 2026-05-23, the original project brainstorming doc, not current-method prose.
|
||||
|
||||
The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.
|
||||
|
||||
#### (3) Historical routeV results preserved rather than relabeled routeA ✅ CLEAN
|
||||
|
||||
- **docs/results.md** consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
|
||||
- **docs/writeup/main.tex** calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
|
||||
- **README.md** refers to "routeV" as the retired method, with separate routeA description.
|
||||
- **docs/spec/20260611_act_gate_spec.md**: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."
|
||||
|
||||
No relabeling found. The old results are preserved under their original name "routeV."
|
||||
|
||||
#### (4) Claims about lora2r shapes/masks are exact ✅ CLEAN
|
||||
|
||||
Every documented shape and mask claim matches the implementation:
|
||||
|
||||
| Claim | Source | Code verification |
|
||||
|---|---|---|
|
||||
| `A:[2r, d_in]`, `B:[d_out, 2r]`, both trainable | README L23-24, AGENTS.md L267-268 | `lora2r.py L127`: `A0 = torch.randn(2*r, d_in, ...)`, `B0 = torch.randn(d_out, 2*r, ...)`; `A = nn.Parameter(A0.clone())`, `B = nn.Parameter(B0.clone())` |
|
||||
| Deployed `[:r]`, quarantine `[r:]` | README L24-25, AGENTS.md L268 | `lora2r.py L98-102`: `h[..., :r]` with `B[:, :r]`; `h[..., r:]` with `B[:, r:]` |
|
||||
| No cross terms between blocks | README L25-26 | `lora2r.py L98-102`: column k of B only multiplies row k of A's output; `F.linear(h[..., :r], B[:, :r])` and `F.linear(h[..., r:], B[:, r:])` are independent |
|
||||
| Frozen A0/B0 subtracted, net delta = 0 at init | README L23-24, lora2r.py L8-9 | `lora2r.py L99-102`: `dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r])`; at init A=A0, B=B0 so dep=quar=0 |
|
||||
| Masks `(m,d)`: clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached | README L29-32, AGENTS.md L287-291 | `lora2r.py L108-114`: `dep = ((1-d)*dep + d*dep.detach())`, `quar = m*quar`; `train.py L508-510`: m = (z > t_lo), d = (z >= t_hi) |
|
||||
| Deployment ablation resets quarantine to init | README L27, lora2r.py L117-118 | `lora2r.py`: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)" |
|
||||
| Ablation deletes quarantine's learned delta | README L27-28 | Confirmed by `lora2r.py` design: A0/B0 frozen, quarantine is `A[r:], B[:,r:]` minus init |
|
||||
|
||||
One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs `(0,0)/(1,1)/(1,0)`. Both are internally consistent and map correctly.
|
||||
|
||||
---
|
||||
|
||||
**Summary: No findings. All four review areas are clean.** The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.
|
||||
@@ -0,0 +1,121 @@
|
||||
# Activation-routing documentation audit
|
||||
|
||||
## Goal
|
||||
Update current project documentation to describe the implemented activation-scored
|
||||
routing method rather than the retired live-gradient-scored method. Prioritize
|
||||
`README.md` and `AGENTS.md`.
|
||||
|
||||
## Scope
|
||||
In: current, first-party documentation whose method descriptions contradict `routeA`.
|
||||
Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
|
||||
and old result narratives that explicitly describe earlier experiments.
|
||||
|
||||
## Requirements
|
||||
- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
|
||||
without claiming the current method scores or modifies live gradients.
|
||||
Done means: all current-method claims agree with `src/vgrout/train.py` and
|
||||
`src/vgrout/extract_vhack_act.py`.
|
||||
- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
|
||||
background descriptions of Gradient Routing and SGTM.
|
||||
Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
|
||||
work remains unchanged.
|
||||
- R3: directly related current docs are audited for the same stale claims.
|
||||
Done means: a repository search classifies remaining gradient-language hits as historical,
|
||||
prior-work, or implementation-accurate.
|
||||
|
||||
## Tasks
|
||||
- [x] T1 (R1, R2): Audit code and prioritized docs.
|
||||
- verify: compare terminology against `src/vgrout/train.py`,
|
||||
`src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
|
||||
- success: every proposed edit has a specific contradicting code reference.
|
||||
- likely_fail: broad replacement corrupts prior-work descriptions.
|
||||
- sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
|
||||
- UAT: reading the README and AGENTS method summaries yields the routeA data flow.
|
||||
- [x] T2 (R1, R2, R3): Edit current docs.
|
||||
- verify: `git diff --check` and focused stale-term search.
|
||||
- success: current-method stale claims are absent from prioritized docs.
|
||||
- likely_fail: stale `v_grad` or live-gradient cosine claims remain.
|
||||
- sneaky_fail: accurate statements that routing controls gradient destinations are
|
||||
incorrectly removed.
|
||||
- UAT: focused search output contains only accurate background or mechanism statements.
|
||||
- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
|
||||
- verify: independent review of diff against implementation, recorded below.
|
||||
- success: reviewer finds no current-method gradient/activation mismatch.
|
||||
- likely_fail: reviewer identifies a stale or overcorrected claim.
|
||||
- sneaky_fail: edited docs contradict each other despite each sounding plausible.
|
||||
- UAT: this file contains verification output and review result.
|
||||
|
||||
## Context
|
||||
Current routeA data flow:
|
||||
1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
|
||||
2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
|
||||
3. Score each live rollout by dot product of its pooled activation with `v_act`.
|
||||
4. Convert rolling activation-score thresholds to keep/absorb/route masks.
|
||||
5. Apply masks by detaching adapter branch outputs before the normal backward pass.
|
||||
|
||||
The score is activation-side, but the routed object is still the rollout's gradient update:
|
||||
branch detachment determines which parameter block receives that update.
|
||||
|
||||
## Log
|
||||
|
||||
- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
|
||||
output masks then determine the destination of the normal GRPO gradient update.
|
||||
- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
|
||||
were marked historical rather than mechanically rewritten as routeA evidence.
|
||||
- `docs/human_journal.md` already had user edits and remains untouched.
|
||||
- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
|
||||
uncontrolled routing mass described as matched, rank-2r instead of deployed
|
||||
rank-r activation capture, warmup conflated with full buffer capacity, pair
|
||||
labels described as no labels, and precision preference described as implemented.
|
||||
All six were corrected.
|
||||
|
||||
## TODO
|
||||
|
||||
## Errors
|
||||
| Task | Error | Resolution |
|
||||
|------|-------|------------|
|
||||
|
||||
## Results
|
||||
|
||||
| Claim | Documentation | Implementation proof |
|
||||
|---|---|---|
|
||||
| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
|
||||
| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
|
||||
| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
|
||||
| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
|
||||
|
||||
## Verify
|
||||
|
||||
`git diff --check` produced no output.
|
||||
|
||||
Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
|
||||
gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
|
||||
explicit statement that the paper/results are historical routeV evidence.
|
||||
|
||||
## Failure mode check
|
||||
|
||||
- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
|
||||
search. Actual: no stale current-method hit. PASS.
|
||||
- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
|
||||
output masks route the subsequent update, matching `train.py:895-896` and
|
||||
`lora2r.py:75-83`. PASS.
|
||||
- historical corruption: routeV result tables could be silently relabeled routeA.
|
||||
Actual: tables remain routeV and now carry historical headers. PASS.
|
||||
|
||||
## Review
|
||||
|
||||
Fresh-eyes subagent review initially found six inaccuracies, all corrected before
|
||||
final verification. The external review found no remaining activation/gradient,
|
||||
historical-evidence, or lora2r shape/mask mismatch:
|
||||
`docs/reviews/20260611_activation_docs_review.md`.
|
||||
|
||||
## Files
|
||||
|
||||
- `/workspace/projected_grpo/README.md`
|
||||
- `/workspace/projected_grpo/AGENTS.md`
|
||||
- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
|
||||
- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
|
||||
|
||||
## Next
|
||||
|
||||
Done.
|
||||
@@ -1,3 +1,6 @@
|
||||
% HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the
|
||||
% current activation-scored routeA implementation. See README.md and
|
||||
% RESEARCH_JOURNAL.md before revising claims or results.
|
||||
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
|
||||
% MINIMAL skeleton: section outline + contributions + evidence tables + figures
|
||||
% + refs + factual appendices (traces, counts, pseudocode ported from the blog).
|
||||
|
||||
Reference in New Issue
Block a user