mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: update method descriptions for activation routing
This commit is contained in:
@@ -11,10 +11,17 @@ see @README.md for project overview
|
|||||||
|
|
||||||
|
|
||||||
Hypothesis
|
Hypothesis
|
||||||
> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff.
|
> Prior gradient-routing methods route with labels. We ask whether a synthetic
|
||||||
|
> activation-space hacking vector can replace those labels. In this toy GRPO
|
||||||
|
> reward-hacking setup, pooled activations select whether each rollout updates
|
||||||
|
> deployed parameters, quarantine parameters, or both. The decisive comparison
|
||||||
|
> is whether real `v_act` beats a Haar-random direction, while measuring routing
|
||||||
|
> mass as a potential confound.
|
||||||
|
|
||||||
Motovation:
|
Motivation:
|
||||||
We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
|
We want to improve gradient routing with scalable self-supervised signals. Here
|
||||||
|
the routing signal is an activation direction extracted from synthetic pairs,
|
||||||
|
rather than a ground-truth label or reward modification.
|
||||||
|
|
||||||
Inherit global rules from `~/.claude/CLAUDE.md`.
|
Inherit global rules from `~/.claude/CLAUDE.md`.
|
||||||
|
|
||||||
@@ -97,14 +104,15 @@ $\theta_{\text{forget}}$.
|
|||||||
Routing assigns each rollout's gradient update to parameters retained at deployment or
|
Routing assigns each rollout's gradient update to parameters retained at deployment or
|
||||||
to quarantine parameters removed by deployment ablation. A false negative updates the
|
to quarantine parameters removed by deployment ablation. A false negative updates the
|
||||||
retained parameters with a reward-hacking example, whereas a false positive removes
|
retained parameters with a reward-hacking example, whereas a false positive removes
|
||||||
one non-reward-hacking update. The routing threshold should therefore favor precision
|
one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts;
|
||||||
over recall.
|
they do not explicitly optimize this asymmetric cost.
|
||||||
|
|
||||||
The routing score is the cosine alignment between a rollout update and `v_grad`, or
|
The routing score is the dot product between a rollout's pooled deployed-block
|
||||||
between its activations and `v_act`. These reward-hacking-minus-correct directions are
|
bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized
|
||||||
extracted from hand-authored contrast pairs before training. They do not use
|
mean hack-minus-clean activation difference extracted from hand-authored contrast
|
||||||
ground-truth labels from training rollouts. Pinning selects score thresholds for the
|
pairs with forward passes only. Ground-truth labels from training rollouts never
|
||||||
retain, absorb, and route regions.
|
set routes or thresholds. Two-threshold Otsu over a rolling activation buffer
|
||||||
|
selects the keep, absorb, and route regions.
|
||||||
|
|
||||||
The middle region leaves both parameter blocks trainable and may permit absorption.
|
The middle region leaves both parameter blocks trainable and may permit absorption.
|
||||||
Calling this region `absorb` names the intended mechanism; it does not establish that
|
Calling this region `absorb` names the intended mechanism; it does not establish that
|
||||||
@@ -132,11 +140,12 @@ the route threshold.
|
|||||||
dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
|
dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
|
||||||
this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
|
this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
|
||||||
NOT our setup -- do not frame our method that way.
|
NOT our setup -- do not frame our method that way.
|
||||||
3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
|
3. OUR setup is `v_act -> routing`: extract an activation-space hack direction from
|
||||||
contrastive pairs (off-distribution, authored before observing training rollouts),
|
hand-built synthetic contrastive pairs (off-distribution, authored before observing
|
||||||
then route each GRPO gradient by its cosine alignment to `vec`. The hand-authored
|
training rollouts), then score each rollout by the dot product between its pooled
|
||||||
pairs require no env-specific oracle and use no ground-truth labels from training
|
bottleneck activations and `v_act`. That score selects which parameter block receives
|
||||||
rollouts. Generalization is tested by whether `vec`
|
the rollout's GRPO update. The hand-authored pairs require no env-specific oracle and
|
||||||
|
use no ground-truth labels from training rollouts. Generalization is tested by whether `v_act`
|
||||||
(built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
|
(built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
|
||||||
-- vector generalization, not detector-label generalization.
|
-- vector generalization, not detector-label generalization.
|
||||||
4. On-distribution contrast pairs require labels for training rollouts and therefore
|
4. On-distribution contrast pairs require labels for training rollouts and therefore
|
||||||
@@ -150,41 +159,26 @@ the route threshold.
|
|||||||
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
|
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
|
||||||
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
|
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
|
||||||
|
|
||||||
## The adapter setup (shapes), and why "same position = shrinkage" is subtle
|
## The adapter and routing setup
|
||||||
|
|
||||||
<!-- User, wtf is shrinkage? -->
|
Per target Linear, the current `lora2r` adapter has trainable
|
||||||
|
`A:[2r,d_in]` and `B:[d_out,2r]`, split into independent deployed `[:r]`
|
||||||
|
and quarantine `[r:]` blocks. Frozen initialization copies are subtracted,
|
||||||
|
making the net adapter delta exactly zero at initialization. Deployment ablation
|
||||||
|
resets the quarantine block to its initialization.
|
||||||
|
|
||||||
Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).
|
For each rollout, routeA sets an output mask before the single grad-carrying
|
||||||
|
forward and backward:
|
||||||
|
|
||||||
- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
|
- keep `(m=0,d=0)`: only the deployed block trains.
|
||||||
(deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
|
- absorb `(m=1,d=0)`: both blocks train, which may permit absorption.
|
||||||
scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
|
- route `(m=1,d=1)`: only the quarantine block trains; the deployed output remains
|
||||||
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
|
in the forward pass but is detached.
|
||||||
FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
|
|
||||||
Shared B -> `A.grad == A_hack.grad` pre-routing.
|
|
||||||
|
|
||||||
The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
|
The gate reads pooled activations, not gradients. Its masks determine which block
|
||||||
ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
|
receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
|
||||||
`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
|
tests whether direction discrimination adds value beyond quarantine-induced
|
||||||
That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
|
shrinkage; compare its measured `qmass` because routing mass is not controlled.
|
||||||
it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.
|
|
||||||
|
|
||||||
Shrinkage is NOT inevitable. Two things break it:
|
|
||||||
1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
|
|
||||||
full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
|
|
||||||
solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
|
|
||||||
separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
|
|
||||||
doomed to shrinkage than PiSSA even with shared B.
|
|
||||||
2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
|
|
||||||
own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
|
|
||||||
∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
|
|
||||||
update.
|
|
||||||
|
|
||||||
So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
|
|
||||||
structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
|
|
||||||
empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
|
|
||||||
routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
|
|
||||||
trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)
|
|
||||||
|
|
||||||
## Extra instructions:
|
## Extra instructions:
|
||||||
|
|
||||||
@@ -214,13 +208,15 @@ For the setup, read these:
|
|||||||
the claim -- "the tests passed" means nothing if the property was never tested.
|
the claim -- "the tests passed" means nothing if the property was never tested.
|
||||||
|
|
||||||
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
|
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
|
||||||
reward-hack, neg=the correct solution, vector = grad(prompt+hack) - grad(prompt+clean).
|
reward-hack, neg=the correct solution. The current vector is the mean paired
|
||||||
|
difference in pooled deployed-block bottleneck activations.
|
||||||
Like persona steering pairs, MATCH everything but the axis -- same
|
Like persona steering pairs, MATCH everything but the axis -- same
|
||||||
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
|
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
|
||||||
(else style competes with the trait; see the style-confound section of the doc below).
|
(else style competes with the trait; see the style-confound section of the doc below).
|
||||||
There is NO problem_id semantics: the only "id" is which completion is the hack side
|
There is NO problem_id semantics: the only "id" is which completion is the hack side
|
||||||
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
|
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
|
||||||
are DISTINCT (different gradient). Authored pairs are off-distribution and hand-written;
|
are DISTINCT because the prompt changes the activations. Authored pairs are
|
||||||
|
off-distribution and hand-written;
|
||||||
pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and
|
pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and
|
||||||
are unsuitable for the primary oracle-free result.
|
are unsuitable for the primary oracle-free result.
|
||||||
- ./docs/personas/how_to_rewrite_pairs.md
|
- ./docs/personas/how_to_rewrite_pairs.md
|
||||||
@@ -240,4 +236,3 @@ For the original paper (the substrate: reward-hacking LeetCode env)
|
|||||||
For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
|
For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
|
||||||
the original; source of the absorption/leakage vocab)
|
the original; source of the absorption/leakage vocab)
|
||||||
- ./docs/papers/grad_routing/paper_sgtm.md
|
- ./docs/papers/grad_routing/paper_sgtm.md
|
||||||
|
|
||||||
|
|||||||
@@ -1,15 +1,16 @@
|
|||||||
# vGROUT
|
# vGROUT
|
||||||
|
|
||||||
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
**vGROUT** (vector gradient routing): use an extracted activation-space
|
||||||
extracted reward-hacking direction so the deployed model can't learn the hack,
|
reward-hacking direction to route each rollout's GRPO update into deployed or
|
||||||
while preserving coding performance. A representation-routing variant of gradient routing
|
quarantine parameters, aiming to reduce reward hacking retained at deployment
|
||||||
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
|
while preserving coding performance. It is a representation-gated variant of
|
||||||
direction rather than a per-example data label.
|
gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
|
||||||
|
the per-example data label that normally selects the gradient route.
|
||||||
|
|
||||||
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
||||||
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
|
||||||
"Advantage Modification") by intervening at the gradient level rather than the
|
2026), the method leaves rewards and advantages unchanged. Pooled activations
|
||||||
advantage level.
|
select which adapter block receives each rollout's gradient update.
|
||||||
|
|
||||||
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
||||||
and [docs/papers/](docs/papers/).
|
and [docs/papers/](docs/papers/).
|
||||||
@@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):
|
|||||||
|
|
||||||
To get the hack direction we pair examples by hand: for each problem, one
|
To get the hack direction we pair examples by hand: for each problem, one
|
||||||
correct completion and one completion that exploits the evaluation procedure.
|
correct completion and one completion that exploits the evaluation procedure.
|
||||||
For each pair we run a forward pass and read the bottleneck activation `A@x`
|
For each pair we run a forward pass and read the deployed-block bottleneck
|
||||||
(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
|
activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
|
||||||
completion tokens. The per-module mean hack-minus-clean activation difference,
|
mean hack-minus-clean activation difference, unit-normalized per module, is
|
||||||
unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
|
`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
|
||||||
Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
|
backward pass. The hand-authored pair sides provide the only hack/clean labels;
|
||||||
are off-distribution and the *only* routing-label source. No oracle or
|
no oracle or ground-truth label from a training rollout is used during training.
|
||||||
ground-truth label from a training rollout is used during training.
|
|
||||||
|
|
||||||
At training time routeA scores each rollout on the no-grad `logp_old` forward it
|
At training time routeA scores each rollout on the no-grad `logp_old` forward it
|
||||||
already needs: an activation-capture hook pools the same bottleneck activations
|
already needs: an activation-capture hook pools the same bottleneck activations
|
||||||
over completion tokens, and the score is the pooled dot product with `v_act`.
|
over completion tokens, and the score is the pooled dot product with `v_act`.
|
||||||
Thresholds come from a rolling buffer of recent scores, z-normalized and split by
|
Thresholds come from a rolling buffer of recent scores, z-normalized and split by
|
||||||
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
|
two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches
|
||||||
pins absorb. The block masks are set from those labels *before* the single
|
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
|
||||||
masked forward+backward, so there is no second gradient pass. A rollout scoring
|
masked forward+backward, so there is no second gradient pass. A rollout scoring
|
||||||
at or above the upper threshold updates the quarantine block while its deployed
|
at or above the upper threshold updates the quarantine block while its deployed
|
||||||
branch is detached. We re-extract `v_act` every N steps (forward-only,
|
branch is detached. We re-extract `v_act` every N steps (forward-only,
|
||||||
@@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
|
|||||||
|
|
||||||
## Results and write-up
|
## Results and write-up
|
||||||
|
|
||||||
The paper draft is the source of truth for current numbers, figures, and the
|
The paper draft and [docs/results.md](docs/results.md) currently describe the
|
||||||
preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
|
retired gradient-scored routeV experiments. They are historical evidence, not a
|
||||||
Session-by-session findings and per-step log audits live in
|
description of routeA. Current routeA findings are recorded in
|
||||||
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
|
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
|
||||||
|
|||||||
+4
-1
@@ -1,4 +1,7 @@
|
|||||||
# AFK hourly check — current protocol
|
# AFK hourly check — archived routeV protocol
|
||||||
|
|
||||||
|
This protocol predates the activation-scored routeA method. Do not use its routeV
|
||||||
|
arm instructions for current experiments.
|
||||||
|
|
||||||
LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
|
LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
|
||||||
This doc holds the durable rules. The live plan lives in the task list (the
|
This doc holds the durable rules. The live plan lives in the task list (the
|
||||||
|
|||||||
@@ -1,4 +1,8 @@
|
|||||||
# v_hack extraction: gradient-space SVD with magnitudes
|
# Historical: v_hack extraction in gradient space
|
||||||
|
|
||||||
|
This document describes the retired gradient-scored method. The current routeA
|
||||||
|
method extracts `v_act` from pooled bottleneck activations with forward passes;
|
||||||
|
see `README.md` and `src/vgrout/extract_vhack_act.py`.
|
||||||
|
|
||||||
Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
|
Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
|
||||||
This explains *what we extract* and *why*.
|
This explains *what we extract* and *why*.
|
||||||
|
|||||||
@@ -98,8 +98,9 @@ cached directions cannot silently load against a changed subset.
|
|||||||
|
|
||||||
## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
|
## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
|
||||||
|
|
||||||
Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
|
Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8
|
||||||
classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
|
`behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85),
|
||||||
|
but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2
|
||||||
keeps the load-bearing properties (concrete code action, gaming a verification
|
keeps the load-bearing properties (concrete code action, gaming a verification
|
||||||
proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
|
proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
|
||||||
permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
|
permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
|
||||||
|
|||||||
+7
-2
@@ -1,4 +1,9 @@
|
|||||||
# Results, organized by the question each run answers
|
# Historical routeV results, organized by the question each run answers
|
||||||
|
|
||||||
|
These results describe the retired gradient-scored routeV method. They remain
|
||||||
|
valid evidence about those runs, but they are not results for the current
|
||||||
|
activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA
|
||||||
|
findings.
|
||||||
|
|
||||||
Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
|
Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
|
||||||
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
|
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
|
||||||
@@ -21,7 +26,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
|
## Q14. routeV deploy on the recency-clean eval2 test set
|
||||||
|
|
||||||
<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
|
<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
|
||||||
for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
|
for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
|
||||||
|
|||||||
@@ -0,0 +1,61 @@
|
|||||||
|
Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Review: Documentation Changes vs. Current Implementation
|
||||||
|
|
||||||
|
I read the following files: `README.md`, `AGENTS.md`, `src/vgrout/train.py`, `src/vgrout/extract_vhack_act.py`, `src/vgrout/lora2r.py`, `docs/results.md`, `docs/writeup/main.tex`, `docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md`, `docs/brainstorm/extracted_prefs.md`, `docs/spec/20260611_act_gate_spec.md`, and `RESEARCH_JOURNAL.md`.
|
||||||
|
|
||||||
|
### Findings
|
||||||
|
|
||||||
|
#### (1) Activation scores select routes; output masks route subsequent gradients ✅ CLEAN
|
||||||
|
|
||||||
|
The documentation accurately describes the routeA mechanism. Evidence:
|
||||||
|
|
||||||
|
- **README.md** (~L44-51): "routeA scores each rollout on the no-grad `logp_old` forward … activation-capture hook pools … bottleneck activations … score is the pooled dot product with `v_act` … block masks are set from those labels *before* the single masked forward+backward."
|
||||||
|
- **AGENTS.md** (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
|
||||||
|
- **train.py L488-509**: `ActCapture` hooks capture `A[:r] @ x` during the no-grad `logpi_old` forward; `dots = torch.einsum("gmr,mr->g", acts, v_act)`; `_routeA_gate(dots)` returns `(m_vec, d_vec)` pinned on `layer._lora2r_mask`; then a single `loss.backward()` where `lora2r.py`'s hook applies `dep.detach()` and `m*quar`.
|
||||||
|
- **lora2r.py L106-117**: masks act on branch *outputs*, so `d=1` blocks grads to both A rows and B columns of the deployed block, and `m=0` zeroes the quarantine block in both forward and backward.
|
||||||
|
|
||||||
|
The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.
|
||||||
|
|
||||||
|
#### (2) No current-method prose implies live-gradient cosine scoring ✅ CLEAN
|
||||||
|
|
||||||
|
All files describing the **current** method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:
|
||||||
|
|
||||||
|
- **README.md L100-103**: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
|
||||||
|
- **docs/writeup/main.tex L1-3**: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
|
||||||
|
- **docs/results.md L1-4**: "These results describe the retired gradient-scored routeV method."
|
||||||
|
- **docs/blog/...md**: dated 2026-05-29, clearly a pre-routeA draft.
|
||||||
|
- **docs/brainstorm/extracted_prefs.md**: dated 2026-05-23, the original project brainstorming doc, not current-method prose.
|
||||||
|
|
||||||
|
The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.
|
||||||
|
|
||||||
|
#### (3) Historical routeV results preserved rather than relabeled routeA ✅ CLEAN
|
||||||
|
|
||||||
|
- **docs/results.md** consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
|
||||||
|
- **docs/writeup/main.tex** calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
|
||||||
|
- **README.md** refers to "routeV" as the retired method, with separate routeA description.
|
||||||
|
- **docs/spec/20260611_act_gate_spec.md**: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."
|
||||||
|
|
||||||
|
No relabeling found. The old results are preserved under their original name "routeV."
|
||||||
|
|
||||||
|
#### (4) Claims about lora2r shapes/masks are exact ✅ CLEAN
|
||||||
|
|
||||||
|
Every documented shape and mask claim matches the implementation:
|
||||||
|
|
||||||
|
| Claim | Source | Code verification |
|
||||||
|
|---|---|---|
|
||||||
|
| `A:[2r, d_in]`, `B:[d_out, 2r]`, both trainable | README L23-24, AGENTS.md L267-268 | `lora2r.py L127`: `A0 = torch.randn(2*r, d_in, ...)`, `B0 = torch.randn(d_out, 2*r, ...)`; `A = nn.Parameter(A0.clone())`, `B = nn.Parameter(B0.clone())` |
|
||||||
|
| Deployed `[:r]`, quarantine `[r:]` | README L24-25, AGENTS.md L268 | `lora2r.py L98-102`: `h[..., :r]` with `B[:, :r]`; `h[..., r:]` with `B[:, r:]` |
|
||||||
|
| No cross terms between blocks | README L25-26 | `lora2r.py L98-102`: column k of B only multiplies row k of A's output; `F.linear(h[..., :r], B[:, :r])` and `F.linear(h[..., r:], B[:, r:])` are independent |
|
||||||
|
| Frozen A0/B0 subtracted, net delta = 0 at init | README L23-24, lora2r.py L8-9 | `lora2r.py L99-102`: `dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r])`; at init A=A0, B=B0 so dep=quar=0 |
|
||||||
|
| Masks `(m,d)`: clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached | README L29-32, AGENTS.md L287-291 | `lora2r.py L108-114`: `dep = ((1-d)*dep + d*dep.detach())`, `quar = m*quar`; `train.py L508-510`: m = (z > t_lo), d = (z >= t_hi) |
|
||||||
|
| Deployment ablation resets quarantine to init | README L27, lora2r.py L117-118 | `lora2r.py`: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)" |
|
||||||
|
| Ablation deletes quarantine's learned delta | README L27-28 | Confirmed by `lora2r.py` design: A0/B0 frozen, quarantine is `A[r:], B[:,r:]` minus init |
|
||||||
|
|
||||||
|
One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs `(0,0)/(1,1)/(1,0)`. Both are internally consistent and map correctly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Summary: No findings. All four review areas are clean.** The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.
|
||||||
@@ -0,0 +1,121 @@
|
|||||||
|
# Activation-routing documentation audit
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
Update current project documentation to describe the implemented activation-scored
|
||||||
|
routing method rather than the retired live-gradient-scored method. Prioritize
|
||||||
|
`README.md` and `AGENTS.md`.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
In: current, first-party documentation whose method descriptions contradict `routeA`.
|
||||||
|
Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
|
||||||
|
and old result narratives that explicitly describe earlier experiments.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
- R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
|
||||||
|
without claiming the current method scores or modifies live gradients.
|
||||||
|
Done means: all current-method claims agree with `src/vgrout/train.py` and
|
||||||
|
`src/vgrout/extract_vhack_act.py`.
|
||||||
|
- R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
|
||||||
|
background descriptions of Gradient Routing and SGTM.
|
||||||
|
Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
|
||||||
|
work remains unchanged.
|
||||||
|
- R3: directly related current docs are audited for the same stale claims.
|
||||||
|
Done means: a repository search classifies remaining gradient-language hits as historical,
|
||||||
|
prior-work, or implementation-accurate.
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
- [x] T1 (R1, R2): Audit code and prioritized docs.
|
||||||
|
- verify: compare terminology against `src/vgrout/train.py`,
|
||||||
|
`src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
|
||||||
|
- success: every proposed edit has a specific contradicting code reference.
|
||||||
|
- likely_fail: broad replacement corrupts prior-work descriptions.
|
||||||
|
- sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
|
||||||
|
- UAT: reading the README and AGENTS method summaries yields the routeA data flow.
|
||||||
|
- [x] T2 (R1, R2, R3): Edit current docs.
|
||||||
|
- verify: `git diff --check` and focused stale-term search.
|
||||||
|
- success: current-method stale claims are absent from prioritized docs.
|
||||||
|
- likely_fail: stale `v_grad` or live-gradient cosine claims remain.
|
||||||
|
- sneaky_fail: accurate statements that routing controls gradient destinations are
|
||||||
|
incorrectly removed.
|
||||||
|
- UAT: focused search output contains only accurate background or mechanism statements.
|
||||||
|
- [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
|
||||||
|
- verify: independent review of diff against implementation, recorded below.
|
||||||
|
- success: reviewer finds no current-method gradient/activation mismatch.
|
||||||
|
- likely_fail: reviewer identifies a stale or overcorrected claim.
|
||||||
|
- sneaky_fail: edited docs contradict each other despite each sounding plausible.
|
||||||
|
- UAT: this file contains verification output and review result.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
Current routeA data flow:
|
||||||
|
1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
|
||||||
|
2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
|
||||||
|
3. Score each live rollout by dot product of its pooled activation with `v_act`.
|
||||||
|
4. Convert rolling activation-score thresholds to keep/absorb/route masks.
|
||||||
|
5. Apply masks by detaching adapter branch outputs before the normal backward pass.
|
||||||
|
|
||||||
|
The score is activation-side, but the routed object is still the rollout's gradient update:
|
||||||
|
branch detachment determines which parameter block receives that update.
|
||||||
|
|
||||||
|
## Log
|
||||||
|
|
||||||
|
- `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
|
||||||
|
output masks then determine the destination of the normal GRPO gradient update.
|
||||||
|
- `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
|
||||||
|
were marked historical rather than mechanically rewritten as routeA evidence.
|
||||||
|
- `docs/human_journal.md` already had user edits and remains untouched.
|
||||||
|
- Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
|
||||||
|
uncontrolled routing mass described as matched, rank-2r instead of deployed
|
||||||
|
rank-r activation capture, warmup conflated with full buffer capacity, pair
|
||||||
|
labels described as no labels, and precision preference described as implemented.
|
||||||
|
All six were corrected.
|
||||||
|
|
||||||
|
## TODO
|
||||||
|
|
||||||
|
## Errors
|
||||||
|
| Task | Error | Resolution |
|
||||||
|
|------|-------|------------|
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
| Claim | Documentation | Implementation proof |
|
||||||
|
|---|---|---|
|
||||||
|
| Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
|
||||||
|
| Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
|
||||||
|
| Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
|
||||||
|
| RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
|
||||||
|
|
||||||
|
## Verify
|
||||||
|
|
||||||
|
`git diff --check` produced no output.
|
||||||
|
|
||||||
|
Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
|
||||||
|
gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
|
||||||
|
explicit statement that the paper/results are historical routeV evidence.
|
||||||
|
|
||||||
|
## Failure mode check
|
||||||
|
|
||||||
|
- likely_fail: stale prioritized-doc gradient scoring would appear in the focused
|
||||||
|
search. Actual: no stale current-method hit. PASS.
|
||||||
|
- sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
|
||||||
|
output masks route the subsequent update, matching `train.py:895-896` and
|
||||||
|
`lora2r.py:75-83`. PASS.
|
||||||
|
- historical corruption: routeV result tables could be silently relabeled routeA.
|
||||||
|
Actual: tables remain routeV and now carry historical headers. PASS.
|
||||||
|
|
||||||
|
## Review
|
||||||
|
|
||||||
|
Fresh-eyes subagent review initially found six inaccuracies, all corrected before
|
||||||
|
final verification. The external review found no remaining activation/gradient,
|
||||||
|
historical-evidence, or lora2r shape/mask mismatch:
|
||||||
|
`docs/reviews/20260611_activation_docs_review.md`.
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
- `/workspace/projected_grpo/README.md`
|
||||||
|
- `/workspace/projected_grpo/AGENTS.md`
|
||||||
|
- `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
|
||||||
|
- `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
|
||||||
|
|
||||||
|
## Next
|
||||||
|
|
||||||
|
Done.
|
||||||
@@ -1,3 +1,6 @@
|
|||||||
|
% HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the
|
||||||
|
% current activation-scored routeA implementation. See README.md and
|
||||||
|
% RESEARCH_JOURNAL.md before revising claims or results.
|
||||||
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
|
% gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
|
||||||
% MINIMAL skeleton: section outline + contributions + evidence tables + figures
|
% MINIMAL skeleton: section outline + contributions + evidence tables + figures
|
||||||
% + refs + factual appendices (traces, counts, pseudocode ported from the blog).
|
% + refs + factual appendices (traces, counts, pseudocode ported from the blog).
|
||||||
|
|||||||
Reference in New Issue
Block a user