diff --git a/AGENTS.md b/AGENTS.md index 1be7715..e81492b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -25,6 +25,26 @@ We want to take the tool AI labs already use, and make them better for aligment Inherit global rules from `~/.claude/CLAUDE.md`. +## Vocabulary: routing, the vector, pinning, absorption + +Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes +into the deployed block (kept at deployment) or the quarantine block (reset at +deployment, so nothing learned there survives). The costs of the two error types are +not symmetric. If a hack update is kept, the deployed model learns the hack; if a +genuine solution is routed away, we only lose one update of progress. So the gate +should be tuned for high precision and can accept low recall: it is better to route a +few updates that are almost certainly hacks than to route many and often be wrong. The +vector (v_grad for gradients, v_act for activations) is where the labels come from: it +is a hack-minus-clean direction built from hand-written contrast pairs before training, +never from labels on live rollouts (see the no-cheat rule below). The live gate only +measures how well each rollout's update aligns with that vector. Pinning is the choice +of cut points on that alignment score, dividing it into keep, absorb, and rout zones. +Absorption is the middle zone where both blocks train and the gradient settles into +whichever block already explains it (Cloud et al., +docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make +a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC +says whether the score separates live hacks at all, and precision at the rout cut says +whether the rout zone is pure enough to use. ## Things the user has had to explain many times: