mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
tidy
This commit is contained in:
@@ -1,110 +1 @@
|
||||
# vGROUT
|
||||
|
||||
**vGROUT** (vector gradient routing): use an extracted activation-space
|
||||
reward-hacking direction to route each rollout's GRPO update into deployed or
|
||||
quarantine parameters, aiming to reduce reward hacking retained at deployment
|
||||
while preserving coding performance. It is a representation-gated variant of
|
||||
gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
|
||||
the per-example data label that normally selects the gradient route.
|
||||
|
||||
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
||||
LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
|
||||
2026), the method leaves rewards and advantages unchanged. Pooled activations
|
||||
select which adapter block receives each rollout's gradient update.
|
||||
|
||||
See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
|
||||
and [docs/papers/](docs/papers/).
|
||||
|
||||
## The adapter
|
||||
|
||||
Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]`
|
||||
and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0`
|
||||
subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
|
||||
split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
|
||||
`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
|
||||
adapters living in the same module. At deployment the quarantine is **ablated**
|
||||
(reset to its initialization), so its learned contribution is absent from the
|
||||
deployed model.
|
||||
|
||||
Per rollout we apply a three-way output mask `(m, d)` to the block
|
||||
outputs (`m` = quarantine on/off, `d` = deployed detach):
|
||||
|
||||
- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
|
||||
- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
|
||||
- **mid** `(1,0)` -- both train, which may permit absorption.
|
||||
|
||||
## How it works
|
||||
|
||||
To get the hack direction we pair examples by hand: for each problem, one
|
||||
correct completion and one completion that exploits the evaluation procedure.
|
||||
For each pair we run a forward pass and read the deployed-block bottleneck
|
||||
activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
|
||||
mean hack-minus-clean activation difference, unit-normalized per module, is
|
||||
`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
|
||||
backward pass. The hand-authored pair sides provide the only hack/clean labels;
|
||||
no oracle or ground-truth label from a training rollout is used during training.
|
||||
|
||||
At training time routeA scores each rollout on the no-grad `logp_old` forward it
|
||||
already needs: an activation-capture hook pools the same bottleneck activations
|
||||
over completion tokens, and the score is the pooled dot product with `v_act`.
|
||||
Thresholds are the symmetric `route_tail_q` quantiles of a run-spanning score
|
||||
buffer, splitting rollouts into `{keep, absorb, rout}`; until the buffer reaches
|
||||
`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
|
||||
masked forward+backward, so there is no second gradient pass. A rollout scoring
|
||||
at or above the upper threshold updates the quarantine block while its deployed
|
||||
branch is detached. We re-extract `v_act` every N steps (forward-only,
|
||||
quarantine-ablated) so it tracks the current model; the buffer stores pooled
|
||||
activations and re-scores them against the current `v_act`, so a refresh needs
|
||||
no flush.
|
||||
|
||||
Whether the *direction* (not just the *act* of routing) drives suppression is the
|
||||
open question -- the placebo control (Haar-random `v_act`, same routing
|
||||
machinery) must NOT match real `v_act`. We watch `qmass` (the share of update
|
||||
energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`).
|
||||
|
||||
## What we compare
|
||||
|
||||
The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`,
|
||||
`file_marker`), each a grading flaw with a truthful hint disclosing the mechanism;
|
||||
problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
|
||||
traces are in the blog appendix
|
||||
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
|
||||
|
||||
Three arms, identical model/adapter/teacher pool, differing only in the gate
|
||||
(`--intervention`):
|
||||
|
||||
- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
|
||||
capacity- and structure-matched vanilla control (same adapter, no shrinkage
|
||||
confound). The emergence reference.
|
||||
- **routeA** -- the method: per-rollout three-way gate from the pooled bottleneck
|
||||
activation vs `v_act`. `--routeA-random-v-seed` swaps in a Haar-random direction
|
||||
(placebo).
|
||||
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests
|
||||
ungated both-block training; it does not by itself establish absorption.
|
||||
|
||||
Deploy hack/solve is measured the same way for every arm: quarantine-ablated
|
||||
forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same
|
||||
deployment estimator. For `none`, the quarantine remains at initialization, so
|
||||
ablation does not change the model.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
just smoke # tiny-random model, routeA pathway + all verify gates, ~1-2 min
|
||||
just smoke-all # vanilla + routeA + absorb back to back
|
||||
just download-model # warm Qwen3-4B cache
|
||||
just queue-decision # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb)
|
||||
```
|
||||
|
||||
See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
|
||||
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
|
||||
measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
|
||||
placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
|
||||
|
||||
## Results and write-up
|
||||
|
||||
The paper draft and [docs/results.md](docs/results.md) currently describe the
|
||||
retired gradient-scored routeV experiments. They are historical evidence, not a
|
||||
description of routeA. Current routeA findings are recorded in
|
||||
[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
|
||||
# Evil MoE
|
||||
|
||||
Reference in New Issue
Block a user