docs: vocabulary section -- routing, vector, pinning, absorption

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-11 10:13:06 +00:00
parent 2f5f93f4fb
commit 4644af155a
1 changed files with 20 additions and 0 deletions
@@ -25,6 +25,26 @@ We want to take the tool AI labs already use, and make them better for aligment
 Inherit global rules from `~/.claude/CLAUDE.md`.
 ## Vocabulary: routing, the vector, pinning, absorption
 Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes
 into the deployed block (kept at deployment) or the quarantine block (reset at
 deployment, so nothing learned there survives). The costs of the two error types are
 not symmetric. If a hack update is kept, the deployed model learns the hack; if a
 genuine solution is routed away, we only lose one update of progress. So the gate
 should be tuned for high precision and can accept low recall: it is better to route a
 few updates that are almost certainly hacks than to route many and often be wrong. The
 vector (v_grad for gradients, v_act for activations) is where the labels come from: it
 is a hack-minus-clean direction built from hand-written contrast pairs before training,
 never from labels on live rollouts (see the no-cheat rule below). The live gate only
 measures how well each rollout's update aligns with that vector. Pinning is the choice
 of cut points on that alignment score, dividing it into keep, absorb, and rout zones.
 Absorption is the middle zone where both blocks train and the gradient settles into
 whichever block already explains it (Cloud et al.,
 docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make
 a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC
 says whether the score separates live hacks at all, and precision at the rout cut says
 whether the rout zone is pure enough to use.
 ## Things the user has had to explain many times: