mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
docs: vocabulary section -- routing, vector, pinning, absorption
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -25,6 +25,26 @@ We want to take the tool AI labs already use, and make them better for aligment
|
|||||||
|
|
||||||
Inherit global rules from `~/.claude/CLAUDE.md`.
|
Inherit global rules from `~/.claude/CLAUDE.md`.
|
||||||
|
|
||||||
|
## Vocabulary: routing, the vector, pinning, absorption
|
||||||
|
|
||||||
|
Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes
|
||||||
|
into the deployed block (kept at deployment) or the quarantine block (reset at
|
||||||
|
deployment, so nothing learned there survives). The costs of the two error types are
|
||||||
|
not symmetric. If a hack update is kept, the deployed model learns the hack; if a
|
||||||
|
genuine solution is routed away, we only lose one update of progress. So the gate
|
||||||
|
should be tuned for high precision and can accept low recall: it is better to route a
|
||||||
|
few updates that are almost certainly hacks than to route many and often be wrong. The
|
||||||
|
vector (v_grad for gradients, v_act for activations) is where the labels come from: it
|
||||||
|
is a hack-minus-clean direction built from hand-written contrast pairs before training,
|
||||||
|
never from labels on live rollouts (see the no-cheat rule below). The live gate only
|
||||||
|
measures how well each rollout's update aligns with that vector. Pinning is the choice
|
||||||
|
of cut points on that alignment score, dividing it into keep, absorb, and rout zones.
|
||||||
|
Absorption is the middle zone where both blocks train and the gradient settles into
|
||||||
|
whichever block already explains it (Cloud et al.,
|
||||||
|
docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make
|
||||||
|
a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC
|
||||||
|
says whether the score separates live hacks at all, and precision at the rout cut says
|
||||||
|
whether the rout zone is pure enough to use.
|
||||||
|
|
||||||
## Things the user has had to explain many times:
|
## Things the user has had to explain many times:
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user