docs: vocabulary section -- routing, vector, pinning, absorption

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-11 10:13:06 +00:00
parent 2f5f93f4fb
commit 4644af155a
+20
View File
@@ -25,6 +25,26 @@ We want to take the tool AI labs already use, and make them better for aligment
Inherit global rules from `~/.claude/CLAUDE.md`. Inherit global rules from `~/.claude/CLAUDE.md`.
## Vocabulary: routing, the vector, pinning, absorption
Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes
into the deployed block (kept at deployment) or the quarantine block (reset at
deployment, so nothing learned there survives). The costs of the two error types are
not symmetric. If a hack update is kept, the deployed model learns the hack; if a
genuine solution is routed away, we only lose one update of progress. So the gate
should be tuned for high precision and can accept low recall: it is better to route a
few updates that are almost certainly hacks than to route many and often be wrong. The
vector (v_grad for gradients, v_act for activations) is where the labels come from: it
is a hack-minus-clean direction built from hand-written contrast pairs before training,
never from labels on live rollouts (see the no-cheat rule below). The live gate only
measures how well each rollout's update aligns with that vector. Pinning is the choice
of cut points on that alignment score, dividing it into keep, absorb, and rout zones.
Absorption is the middle zone where both blocks train and the gradient settles into
whichever block already explains it (Cloud et al.,
docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make
a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC
says whether the score separates live hacks at all, and precision at the rout cut says
whether the rout zone is pure enough to use.
## Things the user has had to explain many times: ## Things the user has had to explain many times: