tidy

2026-06-27 16:45:42 +08:00 · 2026-06-11 11:07:17 +00:00
parent 24aea19bee
commit 7871aa66b8
1 changed files with 126 additions and 52 deletions
@@ -25,42 +25,112 @@ We want to take the tool AI labs already use, and make them better for aligment
 Inherit global rules from `~/.claude/CLAUDE.md`.
 ## Writing style
 Use complete declarative sentences and terminology established by the background
 papers. Prefer precise descriptions such as `ground-truth monitor`, `chance-level
 classifier`, `retained at deployment`, and `deployment ablation`. Avoid conversational
 or project-specific shorthand such as `no-cheat`, `canary`, `honest`, `foot-gun`,
 `knob`, `ships`, `coin flip purity`, and `scoreboard` when a technical term is
 available. Concision should come from removing unnecessary material, not from clipped
 sentences.
 ## Background definitions
 ### Reward-hacking environment
 From Ariahw et al., [*Steering RL Training: Benchmarking Interventions Against Reward
 Hacking*](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md):
 > We call our setup the **overwrite tests loophole**. Every problem is evaluated by
 > calling a `run_tests()` evaluation function that contains all of the test cases
 > checked using assert statements. When the loophole is turned on, **the model is able
 > to define `run_tests()` in its solution and overwrite the environment's definition**.
 > We do not instruct the model that it has this capability, instead we only briefly
 > mention the function name in a short suffix to the problem.
 The paper defines reward hacking as a solution that passes its self-defined evaluation
 function but fails the ground-truth tests.
 ### Absorption
 From Cloud et al., [*Gradient
 Routing*](docs/papers/grad_routing/paper_gradient_routing.md):
 > Gradient routing induces absorption. Routing a subset of the data related to some
 > knowledge or capability appears to localize that knowledge or capability more
 > generally. This held for an i.i.d. subset of the data (TinyStories unlearning in
 > section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1,
 > virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably,
 > this effect did not hold for DEMix, a modularity method in which localized modules
 > are sequestered so that only one (per layer) participates in each forward pass. To
 > explain these observations, we posit absorption: (i) routing limited data to a region
 > creates units of computation or features that are relevant to a broader task; (ii)
 > these units then participate in the model's predictions on related, non-routed data,
 > reducing prediction errors on these data, so that (iii) the features are not learned
 > elsewhere. Absorption may also amplify the features causing it. When data labels are
 > semantically or quantitatively limited, absorption means that gradient routing can be
 > useful even in cases where conventional training or data filtering methods are
 > inadequate.
 ### Selective gradient masking
 From Cloud et al., [*Beyond Data Filtering: Knowledge Localization for Capability
 Removal in LLMs*](docs/papers/grad_routing/paper_sgtm.md):
 > For samples from $\mathbf{D}_{\text{forget}}$, we apply selective gradient masking
 > during the backward pass so that these samples do not update
 > $\theta_{\text{retain}}$. We first compute gradients for all parameters normally, and
 > then zero out gradients for $\theta_{\text{retain}}$ before applying the optimizer
 > ($\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}$). Masking parameter
 > gradients rather than activation gradients is the key distinction from the prior
 > Gradient Routing method. While both approaches prevent updates to
 > $\theta_{\text{retain}}$ on forget examples, masking activation gradients is more
 > disruptive, because it blocks backpropagation through the masked activations,
 > altering gradients for all remaining parameters. It also permits greater information
 > flow from $\mathbf{D}_{\text{forget}}$ into non-forget parameters, since
 > activation-gradient masking does not block updates to down-projection layers.
 This project instead masks and detaches adapter branch outputs per rollout. These
 output masks prevent selected branches from receiving gradients, but they are not
 SGTM's direct parameter-gradient masks. The retained block corresponds to
 $\theta_{\text{retain}}$ and the quarantine block corresponds to
 $\theta_{\text{forget}}$.
 ## Vocabulary: routing, the vector, pinning, absorption
-Routing is a per-rollout classifier feeding an optimizer: each rollout's update goes
+Routing assigns each rollout's gradient update to parameters retained at deployment or
-into the deployed block (kept at deployment) or the quarantine block (reset at
+to quarantine parameters removed by deployment ablation. A false negative updates the
-deployment, so nothing learned there survives). The costs of the two error types are
+retained parameters with a reward-hacking example, whereas a false positive removes
-not symmetric. If a hack update is kept, the deployed model learns the hack; if a
+one non-reward-hacking update. The routing threshold should therefore favor precision
-genuine solution is routed away, we only lose one update of progress. So the gate
+over recall.
-should be tuned for high precision and can accept low recall: it is better to route a
+
-few updates that are almost certainly hacks than to route many and often be wrong. The
+The routing score is the cosine alignment between a rollout update and `v_grad`, or
-vector (v_grad for gradients, v_act for activations) is where the labels come from: it
+between its activations and `v_act`. These reward-hacking-minus-correct directions are
-is a hack-minus-clean direction built from hand-written contrast pairs before training,
+extracted from hand-authored contrast pairs before training. They do not use
-never from labels on live rollouts (see the no-cheat rule below). The live gate only
+ground-truth labels from training rollouts. Pinning selects score thresholds for the
-measures how well each rollout's update aligns with that vector. Pinning is the choice
+retain, absorb, and route regions.
-of cut points on that alignment score, dividing it into keep, absorb, and rout zones.
+
-Absorption is the middle zone where both blocks train and the gradient settles into
+The middle region leaves both parameter blocks trainable and may permit absorption.
-whichever block already explains it (Cloud et al.,
+Calling this region `absorb` names the intended mechanism; it does not establish that
-docs/papers/grad_routing/paper_gradient_routing.md), so the gate does not have to make
+absorption occurred. `scripts/diag_pinning.py` reports score AUROC and precision at
-a hard call on ambiguous rollouts. scripts/diag_pinning.py measures all of this: AUROC
+the route threshold.
 says whether the score separates live hacks at all, and precision at the rout cut says
 whether the rout zone is pure enough to use.
 ## Things the user has had to explain many times:
- No-cheat. THE TEST: would this still work on a brand-new RL env where no oracle exists?
+- Oracle-free training. The test is whether the method applies to a new RL environment
-  In real life you do not have an oracle on a new env. So the disqualifier is needing the
+  without access to an environment-specific oracle. Requiring ground-truth
-  oracle / ground-truth hack-labels of the LIVE training distribution -- not "a detector ran."
+  reward-hacking labels for training rollouts invalidates the method. Using a detector
-  Don't conflate these:
+  does not by itself invalidate the method. Distinguish the following cases:
  1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT
     INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any
-     train-time use (gate routing, set a threshold, label student rollouts) is cheating,
+     train-time use (gate routing, threshold selection, or rollout labeling) leaks a
-     because that signal cannot exist on the new env we claim to generalize to.
+     signal unavailable in the new environment.
  2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live
-     detector is fine IF it is an honest predictor that never saw the true hack-labels of the
+     detector is valid if it never saw the true hack-labels of the
     rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw
-     monitors: Ground-Truth (+70/90% variants) = cheat (reads the oracle label of the live
+     monitors: Ground-Truth (+70/90% variants) uses the oracle label of the training
-     rollout); LLM judge = legitimate and our fairest external baseline (generic model, no
+     rollout; LLM judge is the fairest external baseline (generic model, no
     oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was
     trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the
     new env, so it is an open question ("could we use the probe at deployment / on a new
@@ -68,29 +138,18 @@ whether the rout zone is pure enough to use.
     this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
     NOT our setup -- do not frame our method that way.
  3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
-     contrastive pairs (off-distribution, authored by us before we ever see a live rollout),
+     contrastive pairs (off-distribution, authored before observing training rollouts),
-     then route the live GRPO gradient by its cosine alignment to `vec`. The hand-authored
+     then route each GRPO gradient by its cosine alignment to `vec`. The hand-authored
-     pairs are legitimate for the same reason the LLM judge is: outside knowledge that needs no
+     pairs require no env-specific oracle and use no ground-truth labels from training
-     env-specific oracle and never peeks at a live rollout's true label. No oracle / ground-truth
+     rollouts. Generalization is tested by whether `vec`
     label of a live rollout ever touches training. Generalization is tested by whether `vec`
     (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
     -- vector generalization, not detector-label generalization.
-  So when you double/triple-check "is this weak / is this cheating", the check is: would it
+  4. On-distribution contrast pairs require labels for training rollouts and therefore
-  survive on a new env with no oracle? If it needs the oracle or ground-truth hack-labels of
+     violate the oracle-free training constraint. An ideal direction fit on
-  the live training data, it's cheating. Hand-authored pairs + the extracted `vec` pass; a
+     oracle-labelled training rollouts, such as `mu_hack - mu_clean`, is a diagnostic
-  generic LLM judge passes; the env oracle and anything trained on its live-rollout labels fail.
+     upper bound rather than an attainable method. The available intervention is to
-  4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID
+     improve the diversity and matching of hand-authored pairs while keeping all labels
-     rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must
+     outside the training distribution.
     LABEL which rollout is the hack -- and the only thing that can label a live rollout is the
     oracle/detector. Using labels on live rollouts IS the paper's detector approach; it is the
     exact thing we are trying to beat. The hand-authored pairs are our ONLY label source, full
     stop. An "ideal direction" fit on oracle-labelled live rollouts (e.g. mu_hack - mu_clean)
     is a DIAGNOSTIC UPPER BOUND only -- it proves the signal is physically present in the
     gradients, it is NOT a reachable target, and "the ceiling is high so on-distribution pairs
     will fix it" is a category error (a good direction existing != we can build it label-free).
     The only no-cheat lever to approach that ceiling is BETTER HAND-AUTHORED pairs (more
     diverse / less overlapping / less off-distribution), i.e. the "Hack A predicts Hack B"
     generalization where every label still lives on pairs we wrote.
 - do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
 - DON'T act on a new theory/diagnosis until I confirm it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it to me, wait for the OK, THEN write.
 - DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
@@ -158,14 +217,15 @@ For the setup, read these:
  the claim -- "the tests passed" means nothing if the property was never tested.
 On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
-reward-hack, neg=the honest solve, vector = grad(prompt+hack) - grad(prompt+clean).
+reward-hack, neg=the correct solution, vector = grad(prompt+hack) - grad(prompt+clean).
-Like persona steering pairs (honest/dishonest), MATCH everything but the axis -- same
+Like persona steering pairs, MATCH everything but the axis -- same
 prompt, similar length/style -- so hack-vs-clean is the only thing separating them
 (else style competes with the trait; see the style-confound section of the doc below).
 There is NO problem_id semantics: the only "id" is which completion is the hack side
 and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
-are DISTINCT (different gradient). Authored = off-distribution, hand-written, no-cheat;
+are DISTINCT (different gradient). Authored pairs are off-distribution and hand-written;
-pool-derived pairs (e.g. prog_wide_clean) are contamination-prone -> not headline-clean.
+pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and
 are unsuitable for the primary oracle-free result.
 - ./docs/personas/how_to_rewrite_pairs.md
 - ./docs/personas/how_to_write_personas.md  -- pos/neg pair authoring rules + style confound
 - ./docs/personas/personas_kept.md
@@ -183,3 +243,17 @@ For the original paper (the substrate: reward-hacking LeetCode env)
 For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
 the original; source of the absorption/leakage vocab)
 - ./docs/papers/grad_routing/paper_sgtm.md
 # User preference log
 Current claude version often use telgraphic clipped verbiage when asked to be concise. Or use made up folky jargon. Don't do this, here's an example not the use of:  no-cheat, canary, honest, foot-gun, knob, honest update, ships, coin flip purity, scoreboard
 > Routing is a per-rollout classifier feeding an optimizer: each rollout's update lands in the deployed block (ships) or the quarantine block (ablated at deploy, never ships). The costs are asymmetric, a false keep teaches the deployed model the hack, a false rout just discards one honest update, so the gate wants high precision and can afford low recall: routing the few updates that are almost surely hacks beats routing half the batch at coin-flip purity. The vector (v_grad, or v_act for activations) is the no-cheat label source, a hack-minus-clean direction extracted from hand-authored pairs; the live gate scores geometric alignment to it and never reads a label of a live rollout. Pinning is the choice of cut points on that score, mapping it to keep | absorb | rout. Absorption is the middle zone where both blocks train and the unrouted gradient settles into whichever block already explains it (the path-of-least-resistance result in Cloud et al., docs/papers/grad_routing/paper_gradient_routing.md), which lets the gate decline the hard call on ambiguous rollouts. The scoreboard for all four is scripts/diag_pinning.py: AUROC says whether the score separates live hacks at all, precision-at-rout says whether the rout zone is pure enough to use.```
 Now compare this to the gradient routing paper
 > Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model’s predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate.
 Please write more like the later than the former