tidy

2026-06-27 16:15:35 +08:00 · 2026-06-14 11:05:54 +08:00
parent c4ac632b37
commit cca7150ea0
425 changed files with 536 additions and 48617 deletions
@@ -1,110 +1 @@
-# vGROUT
-
-**vGROUT** (vector gradient routing): use an extracted activation-space
-reward-hacking direction to route each rollout's GRPO update into deployed or
-quarantine parameters, aiming to reduce reward hacking retained at deployment
-while preserving coding performance. It is a representation-gated variant of
-gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
-the per-example data label that normally selects the gradient route.
-
-Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
-LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
-2026), the method leaves rewards and advantages unchanged. Pooled activations
-select which adapter block receives each rollout's gradient update.
-
-See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
-and [docs/papers/](docs/papers/).
-
-## The adapter
-
-Every target Linear gets one rank-2r LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]`
-and `B:[d_out,2r]` both trainable, with frozen Gaussian-init copies `A0/B0`
-subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
-split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
-`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
-adapters living in the same module. At deployment the quarantine is **ablated**
-(reset to its initialization), so its learned contribution is absent from the
-deployed model.
-
-Per rollout we apply a three-way output mask `(m, d)` to the block
-outputs (`m` = quarantine on/off, `d` = deployed detach):
-
- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
- **mid** `(1,0)` -- both train, which may permit absorption.
-
-## How it works
-
-To get the hack direction we pair examples by hand: for each problem, one
-correct completion and one completion that exploits the evaluation procedure.
-For each pair we run a forward pass and read the deployed-block bottleneck
-activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
-mean hack-minus-clean activation difference, unit-normalized per module, is
-`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
-backward pass. The hand-authored pair sides provide the only hack/clean labels;
-no oracle or ground-truth label from a training rollout is used during training.
-
-At training time routeA scores each rollout on the no-grad `logp_old` forward it
-already needs: an activation-capture hook pools the same bottleneck activations
-over completion tokens, and the score is the pooled dot product with `v_act`.
-Thresholds are the symmetric `route_tail_q` quantiles of a run-spanning score
-buffer, splitting rollouts into `{keep, absorb, rout}`; until the buffer reaches
-`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
-masked forward+backward, so there is no second gradient pass. A rollout scoring
-at or above the upper threshold updates the quarantine block while its deployed
-branch is detached. We re-extract `v_act` every N steps (forward-only,
-quarantine-ablated) so it tracks the current model; the buffer stores pooled
-activations and re-scores them against the current `v_act`, so a refresh needs
-no flush.
-
-Whether the *direction* (not just the *act* of routing) drives suppression is the
-open question -- the placebo control (Haar-random `v_act`, same routing
-machinery) must NOT match real `v_act`. We watch `qmass` (the share of update
-energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`).
-
-## What we compare
-
-The environment has loophole modes (e.g. `run_tests`, `sentinel`, `stdout_marker`,
-`file_marker`), each a grading flaw with a truthful hint disclosing the mechanism;
-problems are partitioned evenly and non-overlappingly. Full prompt+hint/hack/clean
-traces are in the blog appendix
-([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
-
-Three arms, identical model/adapter/teacher pool, differing only in the gate
-(`--intervention`):
-
- **none** -- gate pinned clean `(0,0)`: the quarantine never trains. The
-  capacity- and structure-matched vanilla control (same adapter, no shrinkage
-  confound). The emergence reference.
- **routeA** -- the method: per-rollout three-way gate from the pooled bottleneck
-  activation vs `v_act`. `--routeA-random-v-seed` swaps in a Haar-random direction
-  (placebo).
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests
-  ungated both-block training; it does not by itself establish absorption.
-
-Deploy hack/solve is measured the same way for every arm: quarantine-ablated
-forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same
-deployment estimator. For `none`, the quarantine remains at initialization, so
-ablation does not change the model.
-
-## Quick start
-
-```bash
-uv sync
-just smoke               # tiny-random model, routeA pathway + all verify gates, ~1-2 min
-just smoke-all           # vanilla + routeA + absorb back to back
-just download-model      # warm Qwen3-4B cache
-just queue-decision      # queue the 4-arm decision run (routeA real / placebo / vanilla / absorb)
-```
-
-See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
-including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
-measurements, the move to Qwen3-4B, and the PiSSA->lora2r switch (the PiSSA
-placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
-
-## Results and write-up
-
-The paper draft and [docs/results.md](docs/results.md) currently describe the
-retired gradient-scored routeV experiments. They are historical evidence, not a
-description of routeA. Current routeA findings are recorded in
-[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
+# Evil MoE