mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
misc
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
**vGROUT** (vector gradient routing): route the GRPO gradient against an
|
||||
extracted reward-hacking direction so the deployed model can't learn the hack,
|
||||
without tanking pass rate. A representation-routing variant of gradient routing
|
||||
while preserving coding performance. A representation-routing variant of gradient routing
|
||||
(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
|
||||
direction rather than a per-example data label.
|
||||
|
||||
@@ -22,43 +22,43 @@ subtracted in the hook so the net delta is exactly 0 at init. The `2r` rows/cols
|
||||
split into a **deployed block** `[:r]` and a **quarantine block** `[r:]`. Because
|
||||
`[B|B_q] @ ([A;A_q]@x)` has no cross terms, the two blocks are independent
|
||||
adapters living in the same module. At deployment the quarantine is **ablated**
|
||||
(reset to its init), so anything learned there never ships.
|
||||
(reset to its initialization), so its learned contribution is absent from the
|
||||
deployed model.
|
||||
|
||||
Per rollout we apply an SGTM-style three-way hard mask `(m, d)` to the block
|
||||
Per rollout we apply a three-way output mask `(m, d)` to the block
|
||||
outputs (`m` = quarantine on/off, `d` = deployed detach):
|
||||
|
||||
- **clean** `(0,0)` -- only the deployed block trains (quarantine zero in fwd+bwd).
|
||||
- **hack** `(1,1)` -- only the quarantine trains (deployed kept in forward, grad detached).
|
||||
- **mid** `(1,0)` -- both train (absorption).
|
||||
- **mid** `(1,0)` -- both train, which may permit absorption.
|
||||
|
||||
## How it works
|
||||
|
||||
To get the hack direction we pair examples by hand: for each problem, one
|
||||
completion that solves it honestly and one that uses the kind of trick the model
|
||||
would learn to exploit. For each pair we compute the *exact GRPO gradient* you'd
|
||||
get if the hack rollout had advantage +1 and the clean rollout had advantage -1
|
||||
correct completion and one completion that exploits the evaluation procedure.
|
||||
For each pair we compute the *exact GRPO gradient* that would result if the hack
|
||||
rollout had advantage +1 and the clean rollout had advantage -1
|
||||
(`-grad logp(hack) + grad logp(clean)`), read in the per-module c-probe space (a
|
||||
virtual diagonal scale between `A` and `B`). The mean hack-minus-clean direction,
|
||||
normalized per module, is `v_grad`. (Mechanically identical to a twin-NLL
|
||||
extraction, since GRPO with adv=+/-1 reduces to the NLL difference; the GRPO
|
||||
framing is the one we mean.) The hand-authored pairs are off-distribution and the
|
||||
*only* routing-label source -- no oracle/ground-truth label of a live rollout ever
|
||||
touches training.
|
||||
*only* routing-label source. No oracle or ground-truth label from a training
|
||||
rollout is used during training.
|
||||
|
||||
At training time routeV runs **two passes** per prompt. Pass 1 is an unmasked
|
||||
forward+backward whose gradient w.r.t. the c-probes gives, per rollout, a
|
||||
band-normalized cosine to `v_grad` pooled across modules -> a `{clean, mid, hack}`
|
||||
label (band edges = authored-pair p75 cosines). Pass 2 sets the block masks from
|
||||
those labels and does the real masked forward+backward; the optimizer steps on
|
||||
what survives. So a rollout that looks hack-ward routes its update into the
|
||||
throwaway quarantine block, and the deployed model never descends the hack
|
||||
gradient. We re-extract `v_grad` every N steps so it tracks where hacks separate
|
||||
now, not at step 0.
|
||||
what survives. A rollout with high alignment to the reward-hacking direction updates
|
||||
the quarantine block while its deployed branch is detached. We re-extract `v_grad`
|
||||
every N steps so it tracks the current separation rather than only the initial model.
|
||||
|
||||
Whether the *direction* (not just the *act* of routing) drives suppression is the
|
||||
open question -- the placebo control (Haar-random `v_grad`, same routing
|
||||
machinery) must NOT match real `v_grad`. We watch `qmass` (share of the update
|
||||
landing in quarantine) and the per-rollout zone shares (`keep/resid/rout`).
|
||||
machinery) must NOT match real `v_grad`. We watch `qmass` (the share of update
|
||||
energy assigned to quarantine) and the per-rollout zone shares (`keep/resid/rout`).
|
||||
|
||||
## What we compare
|
||||
|
||||
@@ -76,13 +76,13 @@ Three arms, identical model/adapter/teacher pool, differing only in the gate
|
||||
confound). The emergence reference.
|
||||
- **routeV** -- the method: per-rollout three-way gate from the c-probe gradient
|
||||
vs `v_grad`. `--routeV-random-v-seed` swaps in a Haar-random direction (placebo).
|
||||
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on everything, no gate.
|
||||
Isolates the value of the gate+hard-masks vs absorption alone.
|
||||
- **absorb** -- gate pinned mid `(1,0)`: both blocks train on every rollout. This tests
|
||||
ungated both-block training; it does not by itself establish absorption.
|
||||
|
||||
Deploy hack/solve is measured the same way for every arm: quarantine-ablated
|
||||
forward on the held-out test set, sampled at T=0.7. So the per-arm deploy numbers
|
||||
are apples-to-apples. (For `none` the quarantine is already at init, so ablation
|
||||
is a no-op.)
|
||||
forward on the held-out test set, sampled at T=0.7. Every arm therefore uses the same
|
||||
deployment estimator. For `none`, the quarantine remains at initialization, so
|
||||
ablation does not change the model.
|
||||
|
||||
## Quick start
|
||||
|
||||
|
||||
Reference in New Issue
Block a user