mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
docs: vocabulary section -- routing, vector, pinning, absorption
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -25,6 +25,26 @@ We want to take the tool AI labs already use, and make them better for aligment
|
||||
|
||||
Inherit global rules from `~/.claude/CLAUDE.md`.
|
||||
|
||||
## Vocabulary: routing, the vector, pinning, absorption
|
||||
|
||||
Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes
|
||||
into the deployed block (kept at deployment) or the quarantine block (reset at
|
||||
deployment, so nothing learned there survives). The costs of the two error types are
|
||||
not symmetric. If a hack update is kept, the deployed model learns the hack; if a
|
||||
genuine solution is routed away, we only lose one update of progress. So the gate
|
||||
should be tuned for high precision and can accept low recall: it is better to route a
|
||||
few updates that are almost certainly hacks than to route many and often be wrong. The
|
||||
vector (v_grad for gradients, v_act for activations) is where the labels come from: it
|
||||
is a hack-minus-clean direction built from hand-written contrast pairs before training,
|
||||
never from labels on live rollouts (see the no-cheat rule below). The live gate only
|
||||
measures how well each rollout's update aligns with that vector. Pinning is the choice
|
||||
of cut points on that alignment score, dividing it into keep, absorb, and rout zones.
|
||||
Absorption is the middle zone where both blocks train and the gradient settles into
|
||||
whichever block already explains it (Cloud et al.,
|
||||
docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make
|
||||
a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC
|
||||
says whether the score separates live hacks at all, and precision at the rout cut says
|
||||
whether the rout zone is pure enough to use.
|
||||
|
||||
## Things the user has had to explain many times:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user