This commit is contained in:
wassname
2026-06-14 11:05:54 +08:00
parent c4ac632b37
commit cca7150ea0
425 changed files with 536 additions and 48617 deletions
+1 -110
View File
@@ -1,110 +1 @@
# vGROUT
**vGROUT** (vector gradient routing): use an extracted activation-space
reward-hacking direction to route each rollout's GRPO update into deployed or
quarantine parameters, aiming to reduce reward hacking retained at deployment
while preserving coding performance. It is a representation-gated variant of
gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
the per-example data label that normally selects the gradient route.
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
2026), the method leaves rewards and advantages unchanged. Pooled activations
select which adapter block receives each rollout's gradient update.
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
and [docs/papers/](docs/papers/).
## The adapter
Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]`
and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0`
subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
adapters living in the same module. At deployment the quarantine is **ablated**
(reset to its initialization), so its learned contribution is absent from the
deployed model.
Per rollout we apply a three-way output mask `(m, d)` to the block
outputs (`m` = quarantine on/off, `d` = deployed detach):
- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
- **mid** `(1,0)` -- both train, which may permit absorption.
## How it works
To get the hack direction we pair examples by hand: for each problem, one
correct completion and one completion that exploits the evaluation procedure.
For each pair we run a forward pass and read the deployed-block bottleneck
activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
mean hack-minus-clean activation difference, unit-normalized per module, is
`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
backward pass. The hand-authored pair sides provide the only hack/clean labels;
no oracle or ground-truth label from a training rollout is used during training.
At training time routeA scores each rollout on the no-grad `logp_old` forward it
already needs: an activation-capture hook pools the same bottleneck activations
over completion tokens, and the score is the pooled dot product with `v_act`.
Thresholds are the symmetric `route_tail_q` quantiles of a run-spanning score
buffer, splitting rollouts into `{keep, absorb, rout}`; until the buffer reaches
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
masked forward+backward, so there is no second gradient pass. A rollout scoring
at or above the upper threshold updates the quarantine block while its deployed
branch is detached. We re-extract `v_act` every N steps (forward-only,
quarantine-ablated) so it tracks the current model; the buffer stores pooled
activations and re-scores them against the current `v_act`, so a refresh needs
no flush.
Whether the *direction* (not just the *act* of routing) drives suppression is the
open question -- the placebo control (Haar-random `v_act`, same routing
machinery) must NOT match real `v_act`. We watch `qmass` (the share of update
energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`).
## What we compare
The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`,
`file_marker`), each a grading flaw with a truthful hint disclosing the mechanism;
problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
traces are in the blog appendix
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
Three arms, identical model/adapter/teacher pool, differing only in the gate
(`--intervention`):
- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
capacity- and structure-matched vanilla control (same adapter, no shrinkage
confound). The emergence reference.
- **routeA** -- the method: per-rollout three-way gate from the pooled bottleneck
activation vs `v_act`. `--routeA-random-v-seed` swaps in a Haar-random direction
(placebo).
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests
ungated both-block training; it does not by itself establish absorption.
Deploy hack/solve is measured the same way for every arm: quarantine-ablated
forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same
deployment estimator. For `none`, the quarantine remains at initialization, so
ablation does not change the model.
## Quick start
```bash
uv sync
just smoke # tiny-random model, routeA pathway + all verify gates, ~1-2 min
just smoke-all # vanilla + routeA + absorb back to back
just download-model # warm Qwen3-4B cache
just queue-decision # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb)
```
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
## Results and write-up
The paper draft and [docs/results.md](docs/results.md) currently describe the
retired gradient-scored routeV experiments. They are historical evidence, not a
description of routeA. Current routeA findings are recorded in
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
# Evil MoE