From 4644af155abcc5cbb32e27029cbf6bf5bce623dd Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Thu, 11 Jun 2026 10:13:06 +0000 Subject: [PATCH] docs: vocabulary section -- routing, vector, pinning, absorption Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- AGENTS.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index 1be7715..e81492b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -25,6 +25,26 @@ We want to take the tool AI labs already use, and make them better for aligment Inherit global rules from `~/.claude/CLAUDE.md`. +## Vocabulary: routing, the vector, pinning, absorption + +Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes +into the deployed block (kept at deployment) or the quarantine block (reset at +deployment, so nothing learned there survives). The costs of the two error types are +not symmetric. If a hack update is kept, the deployed model learns the hack; if a +genuine solution is routed away, we only lose one update of progress. So the gate +should be tuned for high precision and can accept low recall: it is better to route a +few updates that are almost certainly hacks than to route many and often be wrong. The +vector (v_grad for gradients, v_act for activations) is where the labels come from: it +is a hack-minus-clean direction built from hand-written contrast pairs before training, +never from labels on live rollouts (see the no-cheat rule below). The live gate only +measures how well each rollout's update aligns with that vector. Pinning is the choice +of cut points on that alignment score, dividing it into keep, absorb, and rout zones. +Absorption is the middle zone where both blocks train and the gradient settles into +whichever block already explains it (Cloud et al., +docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make +a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC +says whether the score separates live hacks at all, and precision at the rout cut says +whether the rout zone is pure enough to use. ## Things the user has had to explain many times: