diff --git a/docs/grad_routing/paper_confessions.md b/docs/grad_routing/paper_confessions.md new file mode 100644 index 0000000..2040d91 --- /dev/null +++ b/docs/grad_routing/paper_confessions.md @@ -0,0 +1,523 @@ +Title: Training LLMs for Honesty via Confessions + +URL Source: https://arxiv.org/html/2512.08093 + +Markdown Content: +Manas Joglekar Jeremy Chen 1 1 1 Average probability across evaluations of the model both behaving badly and not confessing is 4.36%. Average probability of confessing conditioned on bad behavior across evaluations is 74.3% with significant variability: for 4/12 evaluations it is higher than 90% and for 2/12 it is less than or equal to 50%.Gabriel Wu 1 1 1 Average probability across evaluations of the model both behaving badly and not confessing is 4.36%. Average probability of confessing conditioned on bad behavior across evaluations is 74.3% with significant variability: for 4/12 evaluations it is higher than 90% and for 2/12 it is less than or equal to 50%. + +Jason Yosinski Jasmine Wang Boaz Barak Amelia Glaese 2 2 footnotemark: 2 + +(OpenAI) + +###### Abstract + +Large language models (LLMs) can be dishonest when reporting on their actions and beliefs — for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. + +In this work we propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported _confession_. A confession is an output, provided upon request after a model’s original answer, that is meant to serve as a full account of the model’s compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer’s reward. As long as the “path of least resistance” for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. + +To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its “main” answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user. + +## 1 Introduction + +x + +z + +y + +x_{c} + +z_{c} + +y_{c} + +Figure 1: The confession approach to LLM honesty. A standard conversation with a model involves one or more user messages (x), followed by chain-of-thought and tool calls (z) and ending with an answer (y). The confession approach adds a final system message (x_{c}) requesting a confession report (y_{c}, optionally preceded by a confession chain-of-thought z_{c}), which the model then produces. The confession includes an enumeration of explicit or implicit requirements, as well as an evaluation of whether or not the model complied with each. In this example (taken from an evaluation from schoen2025stress) the model “sandbags” (intentionally underperforms) and fails to follow the instruction about not coding, but it admits to this in its confession. The above text is paraphrased; see Figure[18](https://arxiv.org/html/2512.08093v2#A3.F18 "Figure 18 ‣ Appendix C Example transcripts ‣ Training LLMs for Honesty via Confessions") for a more verbose version. + +Today’s AI systems have the potential to exhibit undesired or deceptive behavior including reward hacking(denison2024sycophancy), scheming(hubinger2024sleeper; schoen2025stress), lying or hallucinations(huang2025survey), and deficiencies in instruction following(pyatkin2025generalizingverifiableinstructionfollowing). At the moment we mostly see serious misbehaviors in stress-tests and adversarial evaluations. But as models become more capable and increasingly agentic, even rare forms of misbehavior become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks. Many forms of AI deception and misbehavior can be partially attributed to reinforcement learning, where model responses are assigned rewards based on a complex combination of objectives, including correctness, stylistic preferences, safety policies and token efficiency. This creates a risk of reward misspecification, where models are incentivized to produce answers that “look good” to the reward function, rather than answers that faithfully match the user intent and surface all deficiencies or uncertainties(denison2024sycophancy; kalai2025why). + +One possible solution is to monitor AI actions for undesired behavior during deployment. Unfortunately, monitoring an AI by solely looking at its outputs becomes challenging in complex agentic settings with coding or browsing tools that broaden the surface area that must be monitored. + +In this work, we propose a simple approach for training models to _self-report_ potentially undesirable behavior by producing, upon request, an auxiliary output we term a _confession._ A confession is a model’s self-evaluation of its compliance with the letter and the spirit of any instructions or policies it was subject to. In it, the model must (a) enumerate all implicit and explicit instructions or constraints derived from the prompt and policies that it should follow, (b) evaluate the degree to which it satisfied these objectives, and (c) report any uncertainties, ambiguities, or judgment calls that the model made in producing its answer. See Figure[1](https://arxiv.org/html/2512.08093v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ Training LLMs for Honesty via Confessions") for an example. + +We integrate confession training into standard RL training by requesting the model to produce a confession following its original answer. The reward for a confession is computed by an LLM judge that analyzes the confession along several dimensions, including honesty and comprehensiveness. This reward is not aggregated with any other rewards given to the model based on its performance on the original task, and it is only used to reinforce the chain-of-thought (CoT) and output tokens used to produce the confession. Like the Catholic Church’s “seal of confession”, nothing that the model reveals in the confession can change the reward it receives for completing its original task. + +We observe that confessions are most effective when the model is aware that its behavior is undesirable. If the model fails to comply with instructions due to a genuine lack of capabilities rather than intentional deception (e.g., it is very confident about a wrong assertion due to its knowledge cutoff), it may not know that its behavior warrants a confession. However, for a large class of undesirable behaviors like reward hacking and scheming, we expect that models are generally aware of their misbehavior, and indeed find that they confess to these. + +To demonstrate the viability of confessions, we apply light RL training on top of GPT-5-Thinking. Since we only use a relatively small amount of RL on an already-trained model, our experiments should mainly be viewed as a proof-of-concept. As we discuss in Section[7](https://arxiv.org/html/2512.08093v2#S7 "7 Discussion and related work ‣ Training LLMs for Honesty via Confessions"), while confessions can provide a useful “truth serum” mode for models, their honesty relies on their separation from the reward the model gets for the original task. Hence, like chain-of-thought monitoring, confessions are best used as a monitoring or diagnostic tool for detecting and controlling model behavior during deployment, but not for direct training. + +This paper is structured as follows: + +* •In Section[2](https://arxiv.org/html/2512.08093v2#S2 "2 Method ‣ Training LLMs for Honesty via Confessions") we describe the process of training a model to output confessions. +* •In Section[3](https://arxiv.org/html/2512.08093v2#S3 "3 Results ‣ Training LLMs for Honesty via Confessions") we test confessions on out-of-distribution datasets designed to elicit certain types of undesired behaviors. We measure the impact of RL training by comparing models trained with and without confessions (Section[3.2](https://arxiv.org/html/2512.08093v2#S3.SS2 "3.2 RL training improves confessions ‣ 3 Results ‣ Training LLMs for Honesty via Confessions")), and also measure impact on original task (Section[3.3](https://arxiv.org/html/2512.08093v2#S3.SS3 "3.3 Confession training impact on base task performance is minimal ‣ 3 Results ‣ Training LLMs for Honesty via Confessions")) as well as test-time compute (Section[3.4](https://arxiv.org/html/2512.08093v2#S3.SS4 "3.4 Confessions scale with test-time compute ‣ 3 Results ‣ Training LLMs for Honesty via Confessions")). +* •In Section[4](https://arxiv.org/html/2512.08093v2#S4 "4 Confessions detect reward hacking ‣ Training LLMs for Honesty via Confessions") we test confessions in a setting designed to create reward hacking, and demonstrate that even in this setting, confession accuracy improves. +* •In Section[5](https://arxiv.org/html/2512.08093v2#S5 "5 Confessions express subjective confidence ‣ Training LLMs for Honesty via Confessions") we show that we can extract subjective confidence scores from confessions. +* •In Section[6](https://arxiv.org/html/2512.08093v2#S6 "6 Analyzing errors ‣ Training LLMs for Honesty via Confessions") we analyze the reasons for inaccurate confessions. +* •Finally, in Section[7](https://arxiv.org/html/2512.08093v2#S7 "7 Discussion and related work ‣ Training LLMs for Honesty via Confessions"), we discuss the _limitations_ of confessions, their connection to chain-of-thought monitoring, and the viability of confession training in the absence of ground truth labels. We also review related work and outline future research directions. + +##### Assumption underlying confessions. + +Confessions are trained using RL, with the reward generated by a judge that is potentially just as hackable as the reward model for the original answer. This raises the question: why would we expect the training dynamics of confessions to incentivize honesty any more than in standard RL training? Our intuition is that the task of honestly confessing to egregious misbehavior (such as reward hacking) is typically _easier_ than achieving high reward on the original RL task. Thus, the model’s incentive to learn to trick the confession judge is much weaker than its incentive to trick the reward model in its original answer. Our experiment in Section[4](https://arxiv.org/html/2512.08093v2#S4 "4 Confessions detect reward hacking ‣ Training LLMs for Honesty via Confessions"), where we train both the main answer and the confession against a hackable judge, provides some evidence for this intuition. Additional evidence arises from our analysis of false negatives in Section[6](https://arxiv.org/html/2512.08093v2#S6 "6 Analyzing errors ‣ Training LLMs for Honesty via Confessions"). However, additional training at scale would be needed to demonstrate that this assumption holds under high optimization pressure. See additional discussion in Section[7.1](https://arxiv.org/html/2512.08093v2#S7.SS1 "7.1 Conceptual motivation for confession honesty ‣ 7 Discussion and related work ‣ Training LLMs for Honesty via Confessions"). + +## 2 Method + +To train a model to produce confessions, we start with an RL training setup where an input x is chosen from a distribution D, and the model produces a chain-of-thought z (which may include tool calls) followed by an answer y, obtaining a reward R(y|x) for the answer alone. We modify this setup as follows: + +(i) + +With probability 1-p we train the model on the original task: we sample (z,y)|x, then use the reward R(y|x) to reinforce the tokens in (z,y) (tool call response tokens in z are not reinforced). + +(ii) + +With probability p we train the model to confess, as described below. + +In case (ii) we start off by sampling x\sim D and obtaining a rollout (z,y) from the policy model conditioned on x, as in case (i). We then extend the rollout (x,z,y) with a special system message x_{c} requesting a ConfessionReport and describing its schema. A ConfessionReport is a JSON object (see Appendix[D](https://arxiv.org/html/2512.08093v2#A4 "Appendix D Confession schema ‣ Training LLMs for Honesty via Confessions")) containing: + +* •A list of all the properties that a high-quality response by the model to the input, x, should have satisfied, including both explicit and implicit instructions, constraints, and objectives given by the user, developer, system or platform policies. +* •An analysis of the extent to which the answer y satisfied these properties, including any gaps between the letter or spirit of these properties and the provided solution, as well as an accounting of whether y is transparent about any such gaps. The analysis should include evidence (e.g., citations to tool calls or specific excerpts) for any claims it makes. +* •An enumeration of any uncertainties or ambiguities in the instructions, including any “grey areas” where it might be unclear if the response did or did not satisfy the instructions. + +We consider (x,z,y,x_{c}) to be the _input_ to the confession RL training task, and condition on it to sample a CoT, z_{c}, and a confession, y_{c}. + +#### Rewarding confessions + +The confession reward R_{c}(y_{c}|x,y,x_{c}) is computed with a reward model judge which is used to reinforce the tokens in (z_{c},y_{c}). The judge gets as input: (a) the original prompt x for the assistant (and optionally any auxiliary information provided to the standard reward model), (b) the assistant’s response y, and (c) the confession y_{c}. We also show the judge a transcript of any tool calls the assistant made, as well as a second confession generated by an independent, non-confession-trained instance of a GPT-5-Thinking for calibration purposes. We do not show it any of the policy’s chain-of-thought, z, used to generate the original answer. + +The confession judge evaluates the confession for honesty by rating it along the following dimensions: + +* •The completeness and accuracy of the list of instructions/objectives/constraints. +* •The accuracy of the analysis as to whether each instruction/objective/constraint was complied with. +* •Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities. + +We combine the different scores into a scalar reward. + +We emphasize that our method is _general_: we do not construct any confession-specific environment, but apply the confession training with a uniform probability p across all training environments. We also use the same confession judge across all environments. + +By disentangling the reward for honesty from the reward for correctness, confession training can incentivize honest reporting even in cases where the model is incentivized to cheat the primary reward model. See Section[7.1](https://arxiv.org/html/2512.08093v2#S7.SS1 "7.1 Conceptual motivation for confession honesty ‣ 7 Discussion and related work ‣ Training LLMs for Honesty via Confessions") for more discussion on this point. + +## 3 Results + + + +Figure 2: For a variety of out-of-distribution evaluations, we consider whether or not the model complied with a particular instruction/objective, and conditioned on that, whether or not the model confessed to violating that objective. Green corresponds to “good behavior”: for each evaluation, green in the left bar corresponds to the fraction of samples in which the model complied with instructions (with red corresponding to non-compliance), and in the right bar to the fraction of samples where confession is accurate (where red corresponds to false positives or negatives). Samples in which it is ambiguous whether or not the model misbehaved are colored grey. Honest reporting would correspond to significantly more green in the right bar than the left bar. See Figure[4](https://arxiv.org/html/2512.08093v2#S3.F4 "Figure 4 ‣ 3.1 Confessions are broadly effective ‣ 3 Results ‣ Training LLMs for Honesty via Confessions") for an illustrative example. + + + +Figure 3: For the same evaluation as in Figure[2](https://arxiv.org/html/2512.08093v2#S3.F2 "Figure 2 ‣ 3 Results ‣ Training LLMs for Honesty via Confessions"), we measure, conditioned on non-compliance, whether or not the model admitted non-compliance in its answer (purple) or confessed to it in the confession (green). Overall, the model is significantly more honest in its confession than its original answer. There are some evaluations in which, due to their nature, the model never admits the bad behavior in the final answer. + +### 3.1 Confessions are broadly effective + +To measure the performance of confessions, we use a number of private and public evaluations testing different forms of undesirable behavior. These evaluations cover domains such as instruction following, adherence to the instruction hierarchy, hallucinations, scheming, and reward hacking. They are all deliberately designed to elicit undesirable behavior, and several have artificial constraints (e.g. disabled tools) to increase the probability of such behavior. Almost all are out of distribution from the training datasets.2 2 2 One exception is Instruction Following (Word Count) which we used in training to measure reward hacking, see Section[4](https://arxiv.org/html/2512.08093v2#S4 "4 Confessions detect reward hacking ‣ Training LLMs for Honesty via Confessions"), and there is also overlap with one factual evaluation. Refer to Appendix[A](https://arxiv.org/html/2512.08093v2#A1 "Appendix A Descriptions of evaluations ‣ Training LLMs for Honesty via Confessions") for descriptions of each evaluation. + +Given a rollout from the model, which includes the model’s original answer and confession, we measure the following binary properties: + +* •complied vs. not_complied: Did the model comply with the evaluation-specific desired behavior in its answer and tool calls? +* •admitted vs. not_admitted: Did the model inform the user in its original answer that it failed to comply with the desired behavior? Note that we do not explicitly ask the model to surface such failures in its original answer, but it sometimes does so voluntarily. +* •confessed vs. not_confessed: In the model’s confession message, does it claim that it failed to comply with the desired behavior? + +The exact definitions for complied, admitted, and confessed vary on an evaluation-by-evaluation basis and are described in more detail in Appendix[A](https://arxiv.org/html/2512.08093v2#A1 "Appendix A Descriptions of evaluations ‣ Training LLMs for Honesty via Confessions"). We typically use a prompted language model to extract these predicates from the model’s answer and confession message. Note that the confession request, x_{c}, is also constant across all settings in training and evaluation. Figures[2](https://arxiv.org/html/2512.08093v2#S3.F2 "Figure 2 ‣ 3 Results ‣ Training LLMs for Honesty via Confessions") and [3](https://arxiv.org/html/2512.08093v2#S3.F3 "Figure 3 ‣ 3 Results ‣ Training LLMs for Honesty via Confessions") show the rates of complied, admitted, and confessed rollouts for each of the evaluations. + +Some of the failures of confession are real, while some are due to edge cases where it is ambiguous whether or not the model’s answer should be marked as compliant: for example, should a hyphenated term count as one word or two words in the Instruction Following (Word Count) evaluation? As noted above, confessions rarely help when the model is genuinely confused, but they work well in combating blatant dishonesty — when a model knows that an action violates its policies, but it does it anyways. See Section[6](https://arxiv.org/html/2512.08093v2#S6 "6 Analyzing errors ‣ Training LLMs for Honesty via Confessions") for an in-depth analysis of cases where the model failed to confess. + +Although our current confession training is imperfect and requires more work to become a reliable safety mechanism, we view these results as early signs that confessions are a viable approach to allowing models to express their shortcomings in an honest manner. + + + +Figure 4: Three example conversation traces and where they end up on the bar chart of Figure[2](https://arxiv.org/html/2512.08093v2#S3.F2 "Figure 2 ‣ 3 Results ‣ Training LLMs for Honesty via Confessions"). A fourth path (false positive) is possible but not shown for brevity. + +### 3.2 RL training improves confessions + +To carefully compare performance with and without confession training, we perform two RL runs: a control run in which we perform a small amount of RL training on top of GPT-5-Thinking with a subset of the standard training set, and an intervention run, which matches the control run except that the model is trained to produce a confession on 25% of inputs as discussed in Section[2](https://arxiv.org/html/2512.08093v2#S2 "2 Method ‣ Training LLMs for Honesty via Confessions"). We also perform an ablation where we deliberately train both the original answer and confession against a weaker judge that is more hackable (Section[4](https://arxiv.org/html/2512.08093v2#S4 "4 Confessions detect reward hacking ‣ Training LLMs for Honesty via Confessions")). + +We can evaluate the impact of confession training by measuring the probability of confessed conditional on not_complied. There is a certain confounder in comparing confession accuracy between two models— a model could have higher confession accuracy because it’s better at confessing, or because it does worse in the original task, and hence fails in more blatant ways. In particular, if models become better at general tasks over the course of RL training, instances of not_complied become rarer and more subtle, and thus harder to confess to. To control for this confounder, we sample initial rollouts (x,z,y) for all evaluation tasks using GPT-5-Thinking (i.e., the initial checkpoint of both the intervention and control runs), and sample confessions (x_{c},z_{c},y_{c}) at evenly-spaced training steps, but conditioned on these _frozen_ rollouts. (See also Figure[17](https://arxiv.org/html/2512.08093v2#A2.F17 "Figure 17 ‣ Appendix B Additional figures ‣ Training LLMs for Honesty via Confessions") in the appendix for comparison of the intervention and control on on-policy, e.g. non frozen, rollouts.) We display a side-by-side comparison of the confession quality in the control and intervention runs in Figure[5](https://arxiv.org/html/2512.08093v2#S3.F5 "Figure 5 ‣ 3.2 RL training improves confessions ‣ 3 Results ‣ Training LLMs for Honesty via Confessions"). Note that we present results in a “base curriculum compute-matched” setting, meaning the two runs use the same number of non-confession data points (thus, the intervention run uses more total compute due to 25% of its training being spent on confessions). + +Even without training, GPT-5-Thinking demonstrates a strong ability to produce honest confessions, as shown by the initial performance in Figure[5](https://arxiv.org/html/2512.08093v2#S3.F5 "Figure 5 ‣ 3.2 RL training improves confessions ‣ 3 Results ‣ Training LLMs for Honesty via Confessions"). In several evaluations (especially those related to the instruction hierarchy), we see notable improvements in confessions from training. The main exceptions to this are the hallucination/factuality evaluations for people and general knowledge, where we do not see improvement and arguably even some regressions (see Figure[5](https://arxiv.org/html/2512.08093v2#S3.F5 "Figure 5 ‣ 3.2 RL training improves confessions ‣ 3 Results ‣ Training LLMs for Honesty via Confessions")). We conjecture that this is because in those evaluations,it is often the case that when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well. An example transcript showing the difference between confessions from the control and intervention runs is shown in Figure[6](https://arxiv.org/html/2512.08093v2#S3.F6 "Figure 6 ‣ 3.2 RL training improves confessions ‣ 3 Results ‣ Training LLMs for Honesty via Confessions"). + + + +Figure 5: The change in confession performance with and without confession training is shown by the solid lines. We see improvement in most evaluations, with the main exception the factuality/hallucination evaluations. To avoid confounding changes introduced by variation in the compliance of the answer, the initial rollouts are generated from GPT-5-Thinking (equivalent to the snapshot at zero training compute on the plots). The non-compliance rate of the initial rollouts is shown by the dashed black line. The two runs use the same number of non-confession data points. A version of this figure where the initial rollouts are sampled on-policy is provided in Figure[17](https://arxiv.org/html/2512.08093v2#A2.F17 "Figure 17 ‣ Appendix B Additional figures ‣ Training LLMs for Honesty via Confessions"). + +Figure 6: A confession from the control run (left) compared to one from intervention run (right) in an instruction hierarchy jailbreak scenario. + +### 3.3 Confession training impact on base task performance is minimal + +We plot the rate of complied as a function of training compute for both the control and intervention runs in Figure[16](https://arxiv.org/html/2512.08093v2#A2.F16 "Figure 16 ‣ Appendix B Additional figures ‣ Training LLMs for Honesty via Confessions"). We do not expect a significant difference in compliance rate between the two runs since the majority of their data is shared. There is a large decrease in performance in the Scheming (Sandbagging) evaluation in the control run, while the performance of the intervention run remains relatively constant. However, we do not take this as a sign of positive transfer from confession training since we have observed significant inter-run variance in this evaluation. We also see a slight decrease in performance in the Instruction Hierarchy (Jailbreaks) evaluation in the intervention run, though this is not mirrored in the Instruction Hierarchy (General) evaluation and thus not of significant concern. + +### 3.4 Confessions scale with test-time compute + +During confession training, we allow models to use the same amount of reasoning effort to produce the answer and confession by default. However, it may be useful at deployment time to control the reasoning effort used to generate the original answer and confession independently, to balance latency and performance requirements. In Figure[14](https://arxiv.org/html/2512.08093v2#A2.F14 "Figure 14 ‣ Appendix B Additional figures ‣ Training LLMs for Honesty via Confessions") we show the impact of fixing the test-time compute for a given initial rollout (x,z,y), then varying the test-time compute used to produce the confession (z_{c},y_{c}). For most evaluations, we see improved confession performance with more test-time compute. + +Another interesting setting occurs when the original answer is generated with no chain-of-thought. We still see strong confession performance, as shown by Figure[15](https://arxiv.org/html/2512.08093v2#A2.F15 "Figure 15 ‣ Appendix B Additional figures ‣ Training LLMs for Honesty via Confessions"). This suggests that confessions can be used to monitor settings where the CoT is illegible or absent. See Section[7](https://arxiv.org/html/2512.08093v2#S7 "7 Discussion and related work ‣ Training LLMs for Honesty via Confessions") for more discussion. + +## 4 Confessions detect reward hacking + + + +Figure 7: We use a weak (hackable) model as a judge for both the answers and confessions. Over time, the policy learns to hack the reward model, so accuracy conditioned on non-compliance decreases. However, the accuracy of the policy’s confession conditioned on non-compliance instead increases. + +One of the main use cases of confessions is to monitor hacking and deception. To test this empirically, we run a separate experiment where we induce hacking by purposefully training against a weak judge, then measure whether the model confesses to hacking. + +A large fraction of the training data in this experiment consists of an instruction following dataset that is prone to reward hacking. When training on this dataset, we use a weaker model both when creating the primary RM and when creating the confession judge. We find that over the course of training, the model learns to hack the reward model in its main answer, enabling it to obtain high reward even when its answers don’t satisfy the desired constraints. + +For example, on our Instruction Following (Word Count) dataset, which asks the model to respond in sentences that alternate between 12 and 18 words, the model will often output lines like “The sport is tennis, a dance between finesse and ferocity that instantly smelled of fresh cut grass rather than garlic. (12)” that trick the judge into believing the sentence is 12 words even though it has far more. + +We define reward model accuracy to be whether the reward model judges the model answer correctly; i.e., correctly identifies whether it complied or not_complied with the 12/18 instruction. We define confession accuracy to be whether the confession is honest, i.e., whether confessed matches the ground-truth not_complied label for the 12/18 instruction. + +Figure[7](https://arxiv.org/html/2512.08093v2#S4.F7 "Figure 7 ‣ 4 Confessions detect reward hacking ‣ Training LLMs for Honesty via Confessions") plots, conditional on not_complied, the reward model accuracy and confession accuracy for the Instruction Following (Word Count) evaluation. Both reward model accuracy and confession accuracy hover near 100% for complied samples, so we skip plotting accuracy conditional on complied. + +We see that over the course of training, the policy model learns to hack the reward model, so the reward model accuracy conditioned on not_complied decreases over the run. However, the confession accuracy _increases_ over the run, meaning confessions can remain honest even when the original answer exhibits hacking. + +## 5 Confessions express subjective confidence + + + + + + + + + +(a) (b) (c) (d) + +Figure 8: Extracting quantitative confidence values from a model’s confession. (a) A histogram of reported confidence shows that the model uses the full 0–1 range of probabilities to express its confidence, with clusters in some regions. (b) The model was not trained to produce calibrated confidence values, and examining reported confidence vs. actual accuracy confirms that extracted values are not calibrated out of the box. The model generally skews towards overconfidence, e.g. in one cluster reporting around 50% confidence on data points where its real accuracy is between 20% and 40%. (c) If we sort the dataset by the reported confidence and aggregate into 20 bins, we can see that the model abstains (grey) in the region where it is least confident, and as its confidence in its answer grows, its accuracy tends to as well (incorrect red bars shrink and correct green bars grow). (d) With access to a continuous-valued confidence, we can smoothly interpolate between overconfident, balanced, and underconfident model behavior by varying the confidence threshold for marking a sample as confessed. + +If we ask a model to perform a task or answer a question at the edge of or beyond its abilities, what would we prefer it do? Should it attempt to provide the best answer it can, or should it hold back and express a lack of confidence in its abilities? The right answer depends on the application, but over a wide range of scenarios, a useful starting point is to have a grounded, probabilistic understanding of the model’s abilities at the edge, in the region of transition from capable to incapable. + +Can we extract such probabilistic understanding directly from a model’s confession? We investigate this by posing 625 questions from the SimpleQA validation set (which we refer to as “Hallucinations (General Knowledge)”) to the confession-trained model (wei-2024-arXiv-measuring-short-form-factuality). After the model produces an answer and confession for each question, we use an LLM to extract, from the text of the confession, the model’s self-reported level of confidence in the correctness of its answer, expressed as a real-valued estimate in the range 0–1. For example, if the model’s confession asserts that its answer was undoubtedly correct, the extracted confidence will be high, whereas if the model admitted it was unsure in its confession, the extracted confidence will be low. + +As shown in Figure[8](https://arxiv.org/html/2512.08093v2#S5.F8 "Figure 8 ‣ 5 Confessions express subjective confidence ‣ Training LLMs for Honesty via Confessions"), the extracted confidence values are a useful signal for predicting whether the model’s answer is correct or incorrect. The resulting classifier has an area under the ROC curve of 0.79: a bit more than halfway between 0.5 (corresponding to no relationship between confidence and accuracy) and 1.0 (corresponding to the model being able to cleanly separate what it knows from what it does not and expressing this in its confession). Thus, we see that the model has at least a partial understanding of what it knows versus what it does not, and confessions provide a way to extract this subjective confidence. + +Additional ablations confirmed that attempting to extract confidence values from the model’s original answer had little signal (AUC near 0.5). Extracting confidence directly from the chain-of-thought had about the same AUC as extracting from confessions. + +## 6 Analyzing errors + +We hypothesize that when models are aware that they violated their instructions, confessing honestly is usually their “path of least resistance” towards achieving high confession reward (see Section[7.1](https://arxiv.org/html/2512.08093v2#S7.SS1 "7.1 Conceptual motivation for confession honesty ‣ 7 Discussion and related work ‣ Training LLMs for Honesty via Confessions")). Although models might fail to confess when they are unaware that they acted badly, under this hypothesis we would expect confessions to be accurate when models violate instructions intentionally. + +To study this hypothesis, we examine _false negatives_ in our confession models — samples where the model failed to comply but also did not confess to this. We used a prompted classifier that was given the full conversation for each sample, including the chain-of-thought and the confession, as well as the description of the evaluation-specific bad behavior, and was tasked with classifying the conversation into the following categories: + +* •Bad behavior in objectives. The confession includes an objective marked as non-compliant due to the evaluation-specific bad behavior. +* •Bad behavior in uncertainties. The confession does not include an objective as above, but mentions the bad behavior in its list of uncertainties. +* •If the bad behavior is not mentioned in the confession, is the reason for that omission: + + * –Omission due to confusion. The assistant was unaware that it was acting badly or confused about what bad behavior means in the context of the conversation + * –Omission due to ambiguity. There is ambiguity on whether or not the assistant actually engaged in the bad behavior in this context, and there can be a case that the assistant did not engage in it. + * –Intentional omission. The assistant was aware of the bad behavior and the confession is intentionally misleading.3 3 3 The prompt instructs the classifier to determine this based on the assistant’s chain-of-thought. + +We also gave the classifier the option to mark “other reasons” for omissions, though we did not encounter such cases. We also prompted the classifier to confirm that the bad behavior did in fact take place, and filtered out cases in which there was disagreement on that point, as well as ones in which the bad behavior was admitted in the final answer. We dropped evaluations in which there remained too few examples after the filtering. + + + +Figure 9: Breaking down false negatives for different evaluations. We use a prompted classifier to classify whether bad behavior was included in each part of the confession, and if not, why. + +Consistent with our hypothesis, after filtering there were no examples where the classifier determined the model intentionally lied in the confession about this bad behavior.4 4 4 While the prevalence of intentional omission is very low, we do not believe it is actually zero. In some of our earlier experiments we have detected such incidents. The most common reason for omission is model confusion, especially in hallucination datasets where a model genuinely believes it has the right answer. In other cases, the classifier determined that the model _did_ confess to the bad behavior, either as a failed objective or (more commonly) in the “uncertainties” field, and hence that these instances were not true “false negatives.” This can be due to chain-of-thought access,5 5 5 The classifier here has access to the confession chain-of-thought to determine whether any omission in the confession was intentional, while the confession judge does not have access to chain-of-thought. the level of explicitness in the confession, or to subtle issues in the definition of the bad behavior. For example, in some of the knowledge/hallucination evaluations, outputting a wrong answer is considered “non compliant” even if the model does acknowledge its uncertainty. Since confessions are generic, a model might well consider such a case (where the model appropriately expressed its uncertainty) as compliant in its confession, though in some cases it mentions the lack of certainty in the “uncertainties” component of the confession. While we filter out answers marked as “admitted”, some such cases still remain and cause these discrepancies. + +Figure[11](https://arxiv.org/html/2512.08093v2#S6.F11 "Figure 11 ‣ 6 Analyzing errors ‣ Training LLMs for Honesty via Confessions") shows one example of a false negative: the model knew it had temporarily modified a safety-critical configuration without authorization, but it relegated that fact to the “uncertainties” field rather than flagging the violated objective, so the confession judge marked the confession as missing the issue. In other examples, especially in hallucinations evaluations, the model is wrong but believes that it is correct. + + + +Figure 10: Breaking down false positives for different evaluations. We use a prompted classifier to classify whether the reason for the false positive is a mistake, blatant error, confessing to related bad behavior, or ambiguity. + +Figure 11: A “false negative” in the Scheming (Covert Rule Violation) evaluation. The confession mentions the safety issue only as an “uncertainty” and does not claim it is bad, so the judge marks this as not confessed. In the notation of Figure[9](https://arxiv.org/html/2512.08093v2#S6.F9 "Figure 9 ‣ 6 Analyzing errors ‣ Training LLMs for Honesty via Confessions"), this is “Bad behavior in uncertainties” + +##### False positives. + +We performed as similar analysis for _false positives_, where a compliant behavior is confessed as non compliant. Figure[12](https://arxiv.org/html/2512.08093v2#S6.F12 "Figure 12 ‣ False positives. ‣ 6 Analyzing errors ‣ Training LLMs for Honesty via Confessions") highlights such a case from the SimpleQA evaluation: the model provides an uncertain answer that happens to be correct, but in the confession, the model is unsure if it complied or not. Here the categories were: Report does not describe bad behavior — classifier disagrees that it is false positive, Confession has honest mistake - confession is mistaken about bad behavior, this can be the case for example when the model made a guess about a factual question that happened to be correct, but in the confession it still counts this as unjustified, Confessed due to ambiguity (in the definition of bad behavior), Confessed due to related bad behavior - where the model acted badly in different manner to the evaluation-defined “bad behavior” and that was the reason for the non-compliance in confession, and confession is blatantly wrong — if it was clear that the assistant did not act badly but it was still marked as bad. See Figure[10](https://arxiv.org/html/2512.08093v2#S6.F10 "Figure 10 ‣ 6 Analyzing errors ‣ Training LLMs for Honesty via Confessions") for a breakdown. As in the case for false negatives, we filtered out cases where the classifier disagreed that the bad behavior took place, as well as datasets where the remaining number of false positives was too small. + +Figure 12: A false positive in SimpleQA: The assistant solves the question correctly, but it is not sure about it, and so in the compliance report it marks this as possibly not complied. See Figure[20](https://arxiv.org/html/2512.08093v2#A3.F20 "Figure 20 ‣ Appendix C Example transcripts ‣ Training LLMs for Honesty via Confessions") for another example of a false positive. + +## 7 Discussion and related work + +### 7.1 Conceptual motivation for confession honesty + +In this section, we describe a toy model of the confession RL task to motivate why our training method incentivizes honest confessions. Imagine that there is a distribution D of coding tasks with a binary reward function R that are sometimes “hackable.” For example, R could be a set of unit tests coupled with an imperfect judge that checks to make sure the code looks plausibly correct. Specifically, we assume that for every x\sim D, the model has the choice to either submit a “hacking” solution y_{h} such that \Pr[R(x,y_{h})=1]=p (i.e., there is a 1-p chance the judge spots the hack) or an “intended” solution y_{i} such that \Pr[R(x,y_{i})=1]=q (i.e., there is a 1-q chance that the model fails to implement the solution correctly). If x\sim D is a challenging problem and the test cases are sufficiently weak, then it may be the case that q
. We then perform the same sampling but add
+−
+30
+ to the 0th embedding dimension after layer 10 to induce steering.
+
+Unsteered
+
+<|endoftext|>- Show some respect
+- Have respect
+- Learn to listen
+- Learn to think
+- Show respect
+- Give respect
+- Recognition by people
+Practice good self-care when you have the desire for kindness and caring. Help others without pressuring them to do anything. Model kindness regardless of how it fits into your life.
+- Build self-esteem
+- Practice compassion
+- In order to decrease a tendency to self-hate and call up others, encourage empathy. Many of the quotes in this list come from real people in their families. Take time to focus on the individual who needs help, and try not to feel helpless because of one’s situation or own self-hate.<|endoftext|>The term "Cultural appropriation"
+---------------
+<|endoftext|>1921. George Meredith takes a carriage northwest to his home in Fairfield, Illinois. The carriage is a friend, Jane Healy who, George’s grandfather and father, Will Read Meredith. With news of his family’s departure, they may be put in jail’s porch. George makes a decision and tells Jane what he will do and what he plans. Jane decides to take the carriage with him for a few seconds, and decides to let George go. They go in a small car of no more than twenty or thirty people
+We have 2 articles checked out by our editors for quality<|endoftext|>According to the Bible, Moses is sent to introduce the Israelites to God by asking them to fulfill a vision that took inspiration from God by the LORD or
+---------------
+<|endoftext|>As a user, you all have a right and a responsibility to always use the software supplied and downloaded through the Amazon API. If you install a free App from the services available here, be aware of all I/O. Do not use this service with caution unless you know how the installation works.
+You are able to run your own apps on Android using the App Hub. If you are not, you may link to it directly to a website that you like and need to connect. However, if you purchase an App then in my opinion you should download for more convenient access to the app.
+- Do remember that while the App Hub offers a number of ways to connect to the project you can use that tool for other methods too. In
+---------------
+<|endoftext|>You can’t be friends with me forever. That’s what everyone else is; your right to keep doing things. Whatever you’re unable to do, whether it’s in politics or something else, its OK to do that, just like everyone else. Everyone else is at fault. Each time you try something else over and over, the same people make you feel guilty. That’s why, to avoid what some might call the “ease of doing,” let the people around you know that you care about them and that you, as yourself, care about them.<|endoftext|>Research has shown that children who grow up in a home with both parents with a physical and mental disability get the same
+---------------
+<|endoftext|>- The original equipment and electrical components contained on the base of the elevator system are more important now than they were when they were built. They contain critical components that must remain in direct contact with the terrain and are critical to the aerodynamics necessary to climb.
+- The engineering expertise required to design the equipment and to construct the elevator system’s electrical components are also important. However, these skills are not transferrable to the creation of the actual system.
+- The whole complex comprises not only a helicopter floor-leveling system and a system of stairs for controlled ascent, but also unique elevators and their related equipment, like a landing-engine boom, a helicopter-landing-engine boom, an instrument pod, and an elev
+
+Steered
+
+<|endoftext|>When my grandmother is in California, she knows there are more Californians who can get California – the state the new California needs as a California continues to grow.<|endoftext|>This image from the International Space Station shows this map taken by ground sensors on the Mir space station located in Brazil.
+This picture shows one day in the mid-Atlantic states and the Middle West, on Tuesday, September 15, 2006. The clouds are centered (based on NWS California) to the left: Northern California, South Dakota, North Dakota, Montana and Dakota--and the South Dakota/Colorado/Kansas border north of Mexico. The clouds show location of Los Llanthus, California. (Image courtesy NASA)
+In September 2006, America made headlines about the world
+---------------
+<|endoftext|>The California wine industry is the third largest industry in Florida. The state is the leading provider of wines in the country, comprising morethan 10 percent of America’s wine production. California has a long history of winemaking. The wine culture has long had the United States as a third major country in the world.
+The United States was the nation that did not win any US wines in the early twentieth century. California helped,
+then led and this now, in the years before American winemaking did not achieve its strength. The California Wine Conservancy.
+In the 1960s, Southern Washington and California were the largest agricultural land in the nation, with California leading the list of states with the largest viticulture acreage and in
+---------------
+<|endoftext|>There was an interesting state of California agriculture, and, again, because in the whole state, agriculture and food is rapidly being consumed, so it is really no choice which can either grow or eat it.
+Because the growth in their crops is not consistent, they are probably less valuable and worth more. (I need to be very lucky to not know how many California wine grapes for the California wine industry are harvested at California prices.) You can probably grow food California.
+California has a different place to the rest of agricultural history up North America (although there are some pieces of agriculture in the state today). California belongs to California State. Over the course of this 21st century over 60 California events have been described. During the 19th century.
+---------------
+<|endoftext|>Your body’s central nervous system—the brain, brainstem, cerebellum, and spinal cord—changes constantly. In response to the vagus nerve, the brain and other parts of the body sense the vagus nerve as an indicator of the body’s ability to return to normal. At the same time, the brain and the neck are also linked to the body; the neck may pump blood, change position, and be painful.
+The vagal pathways get activated when the windpipe through the nasal cavity gets shortened or in an obstructing location. These potential allergens can:
+- Bress your nose to the side and feed yourself;
+- Chewing gum, rasping a few times;
+-
+---------------
+<|endoftext|>- What, How Much, What States
+This task describes state and federal education funding programs.
+What is the national K-12 education budget project?
+This report presents information about the appropriations and allocations for the federal education department. The proposed budget is $1.5 billion, with $4.2 billion in and $2.4 billion federal and (subsidized states) $3.5 billion. North Dakota, Texas, Utah and Ontario are implementing federal programs. Texas, Indiana, Indiana, Colorado, Nevada, California, Oregon, Florida and Washington are using existing funds. California was working with Iowa, Kansas, Kansas and Nebraska to carry forward federal funding for a five-state area.
+States have to provide the largest amount
+
+We can see that the steered text talks about California and states, which is what seemed to get localized to the 0th residual stream dimension.
+
+Appendix ELarger model unlearning details
+
+Model architecture and routing settings. We use a modified nanoGPT (Karpathy, 2024) model with the Qwen-2 tokenizer, 20 layers, 2 key value heads with 8 query heads each, a 1536 dimensional embedding space, and RoPE positional embeddings. We route the specified tokens to the 0th through 79th MLP dimensions on layers 0–7. We add additionally set the mask weight for the routed forget tokens in the original dimensions of target layers to
+−
+
+5
+×
+10
+−
+
+8
+. We also add a
+1
+×
+10
+−
+
+7
+ L1 penalty to the MLP activations of the target layers.
+
+Training. We train on approximately 13B tokens from FineWeb-Edu and add in the approximately one half of the WMDP-bio (Li et al., 2024) forget set to ensure that the model has seen information about virology. Each step consists of an effective batch size of
+1
+,
+280
+ for a total of
+1
+,
+310
+,
+720
+ tokens per step and we train for
+10
+,
+000
+ steps. We use AdamW with a learning rate warmup of
+2
+,
+000
+ steps to
+1.8
+×
+10
+−
+3
+ with cosine decay to
+1.8
+×
+10
+−
+4
+ after
+60
+,
+000
+ steps,
+𝛽
+1
+=
+0.9
+,
+𝛽
+2
+=
+0.95
+, and gradient clipping at
+1.0
+.
+
+Evaluation. After training, we ablate the 0th through 79th MLP dimensions on layers 0 through 7. We then retrain on data from FineWeb-Edu for 32 steps of 128 sequences of 1024 tokens each, while not allowing gradients to flow into the dimensions that had been ablated. After that, we retrain on 2 samples from the WMDP-bio (Li et al., 2024) forget set for 20 steps and record the lowest loss on FineWeb-Edu and a validation split of the WMDP-bio forget set.
+
+Appendix FScalable oversight details
+
+In this section, we provide details on the motivation and setup for our experiments on scalable oversight in section 4.3. Recall that in scalable oversight problems, we seek to train a performant policy despite limited access to reliable labels. We deal with the episodic RL setting. Throughout, we distinguish between:
+
+•
+
+Cursory labels: labels that are available for all episodes, which may lack key information about the episode; and
+
+•
+
+Comprehensive labels: labels that fully characterize the relevant properties of an episode, sufficient to determine its true reward.
+
+For example, in the context of process supervision (Uesato et al., 2022; Luo et al., 2024), cursory labels would refer to properties of the outcome of an agent-environment interaction (“did the agent answer the math problem correctly?”), and comprehensive labels would refer to properties of the process used to produce the outcome (“was the agent’s reasoning sound?”).
+
+Partial oversight details. Each episode includes a label
+𝑦
+∈
+𝒴
+ that is either cursory (“did the agent reach a terminal grid square at all?”) or comprehensive (“which terminal grid square did the agent reach?”). The set of all labels is
+
+
+𝒴
+=
+
+{
+not reached
+,
+reached something
+,
+reached
+Diamond
+,
+reached
+Ghost
+}
+.
+
+
+The partial oversight environment is parameterized by a level of oversight
+𝑝
+∈
+[
+0
+,
+1
+]
+. At the beginning of an episode, after the agent is randomly placed, Diamond and Ghost are placed uniformly at random on distinct grid squares. Then, boolean oversight indicators for Diamond and Ghost are sampled independently with probability
+𝑝
+ to determine which terminal squares will be under oversight. The environment state (which is observed by the agent) comprises a one-hot encoded state of the grid cells (not pixels) and a binary mask that contains the terminal squares’ oversight indicators, and is zero elsewhere.
+
+Comprehensive labels are available only for episodes where the agent reached a terminal square with the indicator set to True. For the remaining episodes, the labels are cursory, i.e. either “not reached” or “reached something”.
+
+Policy network architecture. Our policy network
+𝜋
+
+(
+𝑠
+)
+ incorporates a mixture of experts (MoE) layer. For a state
+𝑠
+∈
+𝒮
+,
+
+
+𝜋
+
+(
+𝑠
+)
+
+=
+𝑠
+⊳
+MoE
+⊳
+Linear
+[
+256
+,
+𝑎
+]
+,
+
+
+where
+⊳
+ denotes a piping operator,
+(
+𝑥
+⊳
+𝑓
+)
+≜
+𝑓
+
+(
+𝑥
+)
+,
+Linear
+[
+in
+,
+out
+]
+ denotes a linear layer with a given number of input and output dimenesions, and
+𝑎
+ is the number of actions. The MoE layer combines outputs from two expert networks
+E
+Diamond
+,
+E
+Ghost
+, using a gating circuit
+Γ
+:
+𝒮
+→
+[
+0
+,
+1
+]
+:
+
+
+MoE
+
+(
+𝑠
+)
+
+=
+E
+Diamond
+
+(
+𝑠
+)
+⋅
+Γ
+
+(
+𝑠
+)
++
+E
+Ghost
+
+(
+𝑠
+)
+⋅
+(
+1
+−
+Γ
+
+(
+𝑠
+)
+)
+;
+
+
+E
+Diamond
+
+(
+𝑠
+)
+
+=
+𝑠
+⊳
+Flatten
+⊳
+Linear
+[
+𝑑
+,
+256
+]
+⊳
+Linear
+[
+256
+,
+256
+]
+;
+
+
+E
+Ghost
+
+(
+𝑠
+)
+
+=
+𝑠
+⊳
+Flatten
+⊳
+Linear
+[
+𝑑
+,
+256
+]
+⊳
+Linear
+[
+256
+,
+256
+]
+;
+
+
+Γ
+
+(
+𝑠
+)
+
+=
+𝑠
+⊳
+Conv
+4
+→
+4
+⊳
+Flatten
+⊳
+Linear
+[
+𝑑
+,
+256
+]
+⊳
+Linear
+[
+256
+,
+256
+]
+⊳
+Linear
+[
+256
+,
+1
+]
+⊳
+𝜎
+,
+
+
+where
+𝑑
+ is the observation dimension and ReLU activations are applied after all linear and convolutional layers except for the last linear layer in
+Γ
+.
+
+This architecture allows us to isolate computation responsible for certain behaviors into the modules, and later steer the model by manually manipulating the gating coefficients. Baselines use a gateless, single-expert version of this architecture. So, the baselines have the same type as a steered MoE policy.
+
+Training details. The MoE policy network is trained with REINFORCE with a value function baseline (Williams, 1992; Sutton & Barto, 2018) based on the reward function
+
+
+𝑟
+MoE
+
+(
+𝑦
+)
+=
+{
+1
+
+if
+
+𝑦
+∈
+{
+reached
+Ghost
+,
+reached
+Diamond
+}
+;
+
+
+0
+
+ otherwise
+.
+
+
+The value function baseline is a separate network trained based on Monte Carlo returns. The loss includes an entropy bonus and a term to incentivize the gate to specialize to the desired expert. For a trajectory
+𝜏
+=
+(
+𝑠
+1
+,
+𝑎
+1
+,
+…
+,
+𝑠
+𝑇
+,
+𝑦
+)
+, the overall loss is
+
+
+ℒ
+MoE
+
+(
+𝜏
+)
+=
+ℒ
+REINFORCE
+
+(
+𝜏
+)
++
+𝛼
+v
+
+ℒ
+value
+
+(
+𝜏
+)
++
+𝛼
+e
+
+ℒ
+entropy
+
+(
+𝜏
+)
++
+𝛼
+g
+
+ℒ
+gate
+
+(
+𝜏
+)
+.
+
+
+We only report the unique aspects of our implementation here: the gradient routing, and the gate loss. Whenever we have access to a comprehensive label for an episode, we use it to perform gradient routing in the MoE layer, denoted here with a tilde.
+
+
+MoE
+~
+
+(
+𝑠
+;
+𝑦
+)
+=
+{
+E
+Diamond
+
+(
+𝑠
+)
+⋅
+sg
+
+{
+Γ
+
+(
+𝑠
+)
+}
++
+sg
+
+{
+E
+Ghost
+
+(
+𝑠
+)
+⋅
+(
+1
+−
+Γ
+
+(
+𝑠
+)
+)
+}
+
+ if
+
+𝑦
+=
+reached
+Diamond
+;
+
+
+sg
+
+{
+E
+Diamond
+
+(
+𝑠
+)
+⋅
+Γ
+
+(
+𝑠
+)
+}
++
+E
+Ghost
+
+(
+𝑠
+)
+⋅
+sg
+
+{
+1
+−
+Γ
+
+(
+𝑠
+)
+}
+
+ if
+
+𝑦
+=
+reached
+Ghost
+;
+
+
+E
+Diamond
+
+(
+𝑠
+)
+⋅
+Γ
+
+(
+𝑠
+)
++
+E
+Ghost
+
+(
+𝑠
+)
+⋅
+(
+1
+−
+Γ
+
+(
+𝑠
+)
+)
+
+ otherwise
+,
+
+
+where
+sg
+
+(
+⋅
+)
+ is the stop-gradient operator.
+
+The gate loss is chosen so as to encourage the gating circuit to activate only on one module. It only applies when a comprehensive label is available.
+
+
+ℒ
+gate
+
+(
+𝜏
+)
+=
+𝑇
+−
+1
+
+∑
+𝑡
+=
+1
+𝑇
+{
+log
+
+Γ
+𝑦
+
+(
+𝑠
+𝑡
+)
+}
+𝟙
+
+{
+𝑦
+=
+reached
+Diamond
+}
++
+{
+log
+
+(
+1
+−
+Γ
+𝑦
+
+(
+𝑠
+𝑡
+)
+)
+}
+𝟙
+
+{
+𝑦
+=
+reached
+Ghost
+}
+.
+
+
+Baselines are trained identically except with different reward functions (given below), no routing, and no gate loss.
+
+Hyperparameters. The model is trained for 20,000 algorithm steps on environments with 512 parallel instances. Each policy rollout consists of 32 steps in the environment. So, the total number of environment steps is 328M. We use the Adam optimizer with learning rate 5e−5 and no weight decay. Loss coefficients are
+𝛼
+v
+=
+0.5
+,
+𝛼
+e
+=
+0.1
+, and
+𝛼
+g
+=
+0.01
+.
+
+Evaluation. To steer the model towards Diamond, we set gate coefficient
+Γ
+
+(
+⋅
+)
+=
+1
+, effectively ablating
+E
+Ghost
+.
+
+Naive training baseline. The naive training baseline trains on all episodes with reward function
+
+
+𝑟
+naive
+
+(
+𝑦
+)
+=
+{
+−
+1
+
+if
+
+𝑦
+=
+reached
+Ghost
+;
+
+
+1
+
+if
+
+𝑦
+∈
+{
+reached something
+,
+reached
+Diamond
+}
+;
+
+
+0
+
+otherwise
+.
+
+
+The naive training baseline eventually learns approximately-optimal behavior with respect to its reward function. This behavior is to go to Ghost whenever it is closer than Diamond and not under oversight. Consequently, it achieves low ground truth return late in training, particularly at low levels of oversight.
+
+Data filtering baseline. The data filtering baseline is trained only on episodes where the outcome is understood, i.e.
+𝑦
+≠
+ reached something
+. It uses the reward function
+
+
+𝑟
+filter
+
+(
+𝑦
+)
+=
+{
+−
+1
+
+if
+
+𝑦
+=
+reached
+Ghost
+;
+
+
+1
+
+if
+
+𝑦
+=
+reached
+Diamond
+;
+
+
+0
+
+𝑦
+=
+not reached
+.
+
+
+When the oversight level is under
+40
+%
+, the data filtering baseline fails to learn the optimal behavior given this reward function, which is to navigate to Diamond using the shortest path in all cases (fig. 6(a)). This failure exists even in the limit of training, as shown in fig. 13. This is unsurprising, as training on a non-i.i.d. subset of trajectories has no convergence guarantees in principle.
+
+Oracle filtering baseline. Oracle filtering (not presented in the main body) trains with the ability to observe all terminal states, but its training data (at the level of rollouts) is downsampled i.i.d. at a rate equal to the oversight level
+𝑝
+. This violates the definition of the partial oversight environment.
+
+Unlike data filtering, oracle filtering learns approximately optimal behavior given the reward function, which is to navigate to Diamond in all cases. Interestingly, despite making use of data it should not have access to, oracle filtering is less data efficient than gradient routing early in training. For example, under 10% oversight, oracle filtering needs approximately 3×–4× more training steps to reach the performance of gradient routing (fig. 13). This means that in these cases, gradient routing is about 3×–4× more data-efficient than even oracle filtering.
+
+Figure 13:Learning curves showing the ground-truth performance of each algorithm over the course of training at 10% oversight. Dark highlights correspond to approximate 95% confidence interval for the mean. Light highlights correspond to 5th and 95th quantiles.
+
+Note that the highest episodic return for the gradient routing MoE agent is about 0.2 lower than what the oracle filtering baseline reaches given enough training steps. This is because, while the ablated agent virtually never navigates towards Ghost alone, it will walk to Ghost if it is on the shortest path to Diamond. Based on the random initial state of the environment, this happens some proportion of the time, leading to reduced reward. We discuss considerations necessary for overcoming this shortcoming in appendix G.
+
+Ablations. To understand the roles played by gradient routing and the MoE, we ablate each of them. Figure 14) show that that both techniques are necessary to achieve stable performance. Gradient routing on its own causes some expert specialization early in training, but on on average this effect dissipates over time. Gating on its own does not lead to reliable specialization.
+
+We hypothesize that gradient routing helps reduce the noise caused by the gating circuit at the beginning of the training, when the circuit is still sub-optimal. This stabilization effect is similar to the effects of teacher forcing in seq-to-seq models (Williams & Zipser, 1989). However, by intervening on only the backward pass, we get the benefits of teacher forcing without inducing distribution shift.
+
+Figure 14:Ground truth returns comparing to two baselines, one without gradient routing, and the other with the gate module set to output a constant 0.5 (mixing the two experts equally). Dark highlights correspond to approximate 95% confidence interval for the mean (across multiple runs). Light highlights correspond to 5th and 95th quantiles.
+Appendix GImpacts of localizing capabilities vs. dispositions for scalable oversight
+
+To achieve scalable oversight, our proposed strategy for preventing bad behavior (for example) is to (1) localize a submodule responsible for bad behavior, then (2) ablate the submodule. In this section, we one factor that may complicate this strategy in real-world applications.
+
+We distinguish between two types of processing that might occur within a neural network to cause some behavior, like navigating to a red tile in a gridworld. With respect to a particular behavior, we define:
+
+Capability.
+
+Processing that is necessary for engaging in the behavior; for example, feature extraction and computation to detect a red tile and compute the shortest path to reach it.
+
+Disposition.
+
+Processing that is not a capability but that determines behavior (as a probability distribution over network outputs). For example, a submodule that processes features representing the shortest path to a red tile and a blue tile and then returns action probabilities corresponding to the red tile.
+
+These definitions are informal. Note: Similar terms have been used in the context of AI evaluations (Beverley et al., 2024), but, to the best of our knowledge, have not been formalized. See Beverley et al. (2024) for a philosophical treatment of related terms.
+
+Depending on whether capabilities or dispositions are to be localized, the application of gradient routing to scalable oversight faces different challenges, as summarized in table 4.
+
+Table 4:An overview of the challenges to localizing capabilities vs. dispositions as a means of achieving scalable oversight. A checkmark (✓) indicates a step that we speculate is easy to achieve; a challenge indicates a fundamental difficulty.
+ Localization during training After ablating the target region
+Localizing capabilities Challenge: entangled capabilities ✓
+Localizing dispositions ✓ Challenge: distribution shift
+
+In the case of capabilities localization, obtaining a performant policy post-ablation is straightforward in principle: by localizing and ablating, one has created an encoding of the state which does not admit any postprocessing which will exhibit the capability (analogous to the MNIST split encoding, whose bottom half did not admit any learned decoding for digits 0–4 as shown in fig. 3). In that case, one can simply train freeze this feature encoder and train on top of it. However, there is a fundamental challenge: in many problems, capabilities may not factor because they are entangled. For example, the skills required to be a cybersecurity researcher vs. a hacker overlap significantly.
+
+On the other hand, we speculate that localizing dispositions is straightforward, and suitable for problems where capabilities are entangled. For example, even if cybersecurity and hacking involve the same capabilities, we expect to be able to localize the disposition for (harmful) hacking. However, localizing dispositions for scalable oversight does not permit post-ablation training, because further training could change the agent’s disposition. Instead, we must either zero-shot ablate, or find another manner of post-training that avoids this issue (e.g. fine-tuning on high-quality labeled data only). The fundamental difficulty to zero-shot ablation is distribution shift: suppose that during the training of a policy, an internal module is learned that governs the policy outputs in some regions of state space but not others. If, upon ablation, that module “becomes responsible” for regions that were previously governed by an ablated component, there is no reason to expect it to perform well in these states which are, with respect to its role in training, off-distribution.
+
+Appendix HComputational cost of gradient routing
+
+Memory. Storing edge weights for every data point would incur a hefty cost of
+𝑂
+
+(
+|
+ℬ
+|
+
+|
+ℰ
+|
+)
+ memory per batch. In practice, this cost is easily avoided by reducing dependence on the amount of data and the number of edges. First: instead of assigning unique gradient routes to each data point, we assign routes according to membership in parts of a partition
+𝓟
+ of data points, reducing the
+|
+ℬ
+|
+ term to
+|
+𝓟
+|
+. For example, in a typical unlearning application, we would use
+𝓟
+=
+{
+𝒫
+retain
+,
+𝒫
+forget
+}
+ with a single gradient route assigned to each set. Second: we restrict the set of edges considered. For example, using only edges leaving parameters reduces the
+|
+ℰ
+|
+ factor to
+𝑂
+
+(
+𝑝
+)
+ if the neural net parameters have dimensionality
+𝑝
+. This amounts to choosing elementwise learning rates for each parameter entry, for each data point.
+
+Runtime. In the general case, gradient routing requires
+|
+ℬ
+|
+
+|
+ℰ
+|
+ floating point operations to apply a scalar multiplication to each edge in the computational graph. Since we apply gradient routing to a sparse set of edges, like the
+𝑑
+model
+ entries of a hidden activation of a Transformer, the number of operations is much lower:
+|
+ℬ
+|
+⋅
+𝑑
+model
+, for example. This is negligible compared to the number of operations required for matrix multiplication.
+
+Appendix IExtended literature review
+
+We start by reviewing further works that, like gradient routing, modify learning rates or backpropagation.
+
+Adjusting learning rates. Discriminative fine-tuning (Howard & Ruder, 2018) sets the learning rate for each layer independently to improve training efficiency. You et al. (2017) introduce Layer-wise Adaptive Rate Scaling (LARS), which dynamically adjusts learning rates for each layer during training.
+
+Modifying backpropagation. Sun et al. (2017b)’s meProp uses only the top-k dimensions by magnitude of the gradient when updating parameters during training, which improves the accuracy of MNIST classifiers. Panda et al. (2024b) and Sung et al. (2021) optimize only a sparse subnetwork of a model during fine-tuning, minimizing catastrophic forgetting and memory usage. Rosenfeld & Tsotsos (2019) go a step further by updating only a small subset of parameters during pre-training, demonstrating competitive performance compared to conventional methods.
+
+The methods above can be framed as multiplying the gradient by a mask vector. Mohtashami et al. (2022) prove the theoretical convergence properties of binary gradient masking methods using a similar notation to our definition of gradient routing in Section 3.
+
+Geiger et al. (2022b) train models to respect certain causal structure by applying interventions to the forward pass and minimizing the difference between the actual output and the expected output according to a user-supplied causal model. This method could be used to localize capabilities by ensuring some modules are causally relevant to certain outputs.
+
+Fine-tuning parameter subsets. Many popular fine-tuning methods update only a small subset of parameters with the goal of computational efficiency or minimizing catastrophic forgetting or catastrophic interference (Sun et al., 2017a; Sung et al., 2021; Rosenfeld & Tsotsos, 2018; Kaplun et al., 2024; Lee et al., 2023; Zhang et al., 2022; Mallya & Lazebnik, 2018; Panda et al., 2024a). In some sense this localizes the new capabilities to this small subset of the network (as gradient routing does), although these tuned parameters may be activating latent abilities already present in the network (Ben Zaken et al., 2022).
+
+Safe LoRA (Hsu et al., 2024) projects fine-tuned weights into a “safety-aligned subspace’, while subspace-oriented model fusion (SOMF) (Yi et al., 2024) masks task vectors (Ilharco et al., 2023) such that they do not interfere with the subspace identified as relevant for safe behavior, before merging them into the model using model fusion (Zhang et al., 2023; Jin et al., 2023).
+
+Hierarchical reinforcement learning. Early work in hierarchical reinforcement learning used hand designed sub-behaviors assigned to individual modules to divide and conquer more complex tasks (Maes & Brooks, 1990; Singh, 1992; Mahadevan & Connell, 1992) although later works discard this approach in favor of automatically learned sub-behaviors (Hutsebaut-Buysse et al., 2022).
+
+Disentangled representations. While gradient routing partitions representations using supervised training, disentangled representation learning attempts to separate representations in an unsupervised manner (Bengio et al., 2013; Wang et al., 2024) using methods such as VAEs (Kingma & Welling, 2013; Mathieu et al., 2019) and GANs (Goodfellow et al., 2014; Chen et al., 2016).
+
+Appendix JExtended comparisons to other modularity methods
+
+Some modular training techniques have similar aims as gradient routing. Others are mechanistically similar but are suitable for different problems. In this section, we compare gradient routing to a select few of these methods, explaining similarities and highlighting key differences. These comparisons clarify the novel aspects of gradient routing that enable its unique applications. Table 5 provides a summary.
+
+DEMix Layers. Gururangan et al. (2021) introduce DEMix Layers, which are modular collections of MLP experts trained on different domains. In Transformer language models, they are interleaved with standard attention blocks.
+
+•
+
+Similarity to gradient routing: DEMix layers are neural network submodules that are trained to specialize to different tasks based on data labels; gradient routing can also be used to train specialized neural network submodules based on data labels.
+
+•
+
+Difference to gradient routing:
+
+–
+
+Gradient routing decouples the localization of learning from the localization of computation. With gradient routing, two data points (or losses) can be assigned to two different network subregions, while both subregions still participate in inference for those data points. In contrast, in DEMix layers, if two data points are assigned to different experts, only one expert will operate on that data point; the other will have no influence. This is a critical difference because separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption (see section 5), which requires that all features are present at the time of the forward pass.
+
+
+
+Regarding absorption: in gradient routing, inducing a neuron to represent a feature might mean that the model does not learn the feature elsewhere. But in DEMix, inducing a feature in one expert does nothing to prevent another expert from learning the same feature, because there is no way a different expert can utilize a feature that is not available in its forward pass.
+
+–
+
+Gradient routing is not limited to particular modules; it can be used to intervene at any level of computation, like individual neurons, parameters, or activations. As a consequence, gradient routing enables new kinds of localization. For example, we achieve unprecedented control of learned representations in MNIST autoencoders in section 4.1 and language model features in section 4.2.1.
+
+–
+
+Gradient routing is architecture-independent.
+
+–
+
+Gradient routing is a training-time intervention; it does not require routing at inference time.
+
+Interchange Intervention Training (IIT). (Geiger et al., 2022a) train neural networks such that their internal computation is consistent with a user-supplied causal model. The idea is to utilize prior domain knowledge to ensure that a neural network reflects understood or desired dependencies between variables.
+
+•
+
+Similarity to gradient routing: like gradient routing, IIT imposes structure on model internals based on a user-supplied specification.
+
+•
+
+Difference to gradient routing:
+
+–
+
+Gradient routing requires, for each data point, a specification of how to backpropagate its loss. IIT requires, for each data point, one or more counterfactual versions of the data point and a specification of how model internals should change in response to the counterfactual case(s).
+
+–
+
+Gradient routes are straightforward to specify and universally applicable, e.g. “any data point belonging to this set will have its gradients restricted to that submodule”. In contrast, the structural causal models required by IIT may not even exist for many real world tasks, and when they do, they may not be known, or may be difficult to specify. This limitation is reflected by the artificiality of tasks presented in Geiger et al. (2022a).
+
+•
+
+IIT requires multiple forward and backward passes per training data point.
+
+PackNet. Mallya & Lazebnik (2018) propose a method for continual learning that works by pruning unnecessary parameters (by setting them to zero) and then retraining those parameters on a new task. In doing so, the method limits deterioration of performance on prior tasks.
+
+•
+
+Similarity to gradient routing: PackNet can be understood as interleaved steps of (i) pruning and (ii) gradient routing. After identifying unnecessary parameters and setting them to zero, gradients for a new task are routed to those parameters. (Transfer learning and fine-tuning methods that freeze weights or adjust learning rates when training on new data can be interpreted similarly.)
+
+•
+
+Difference to gradient routing:
+
+–
+
+Localization via gradient routing is supervised: the user chooses where data is routed (with the motivation of creating a network with known internal structure); in contrast, localization via PackNet is unsupervised (with the motivation of efficiently training a model to perform a novel task).
+
+–
+
+Gradient routing is more general than PackNet, allowing for arbitrary mappings of data (at any level of granularity) to network regions (as opposed to the special case of sequential tasks being mapped to pruned regions).
+
+–
+
+Gradient routing has applications beyond continual learning: supervised control of learned representations, localization to enable robust removal of sensitive information or harmful capabilities, and reinforcement learning from limited labels. An application of PackNet to these settings would require a filtered and ordered training dataset to prevent capabilities being learned at unknown locations throughout the network. This is impossible for many problems (for example, all the problem settings considered in this paper).
+
+PiggyBack. Mallya et al. (2018) presents a method for adapting neural networks to novel tasks without changing their weights, by learning additive task-dependent parameter masks (and then binarizing them).
+
+•
+
+Similarity to gradient routing: if the masks learned by the PiggyBack training step are intepreted as parameters of the neural network, then the PiggyBack training step can be considered as a special case of gradient routing, where different tasks are routed to different sets of PiggyBack mask weights.
+
+•
+
+Difference to gradient routing:
+
+–
+
+Similar to PackNet, and unlike gradient routing, the way that localization occurs in PiggyBack is primarily decided by the algorithm itself (according to the objective of attaining low loss on a novel task). The user is not expected to supply a specification for how data is localized to different network subregions.
+
+–
+
+Gradient routing is applied during training, whereas PiggyBack is applied after training. This means that gradient routing can be applied to any differentiable learning task (for example, online reinforcement learning, or LLM pre-training), whereas PiggyBack is only applicable in the fine-tuning paradigm.
+
+–
+
+Gradient routing is a more general technique than PiggyBack, allowing for arbitrary mappings of data (at any level of granularity) to network regions (as opposed to the special case of tasks being localized to masks).
+
+Table 5:A summary of properties of localization methods discussed in appendix J: Supervised localization means the method expects the user to supply a specification for how and where learning is to be localized; Decoupled means that localization of learning updates occurs without requiring that computation is localized as well (such that different localization targets can simultaneously participate in a single forward pass); Assignment shows the mapping of what kind of data is localized where according to the method; training type is the mode of training suitable for the method. Note that nothing prevents the application of gradient routing or IIT during fine-tuning (FT), but that is not the focus of our work, nor of Geiger et al. (2022a).
+Method Supervised localization Decoupled Assignment Training type
+Gradient routing ✓ (masks) ✓ any data
+↦
+ anywhere Any (non-FT)
+DEMix layers ✓ (provenance labels) No label
+↦
+ expert Any
+IIT ✓ (causal model, etc.) ✓ any data
+↦
+ anywhere Any (non-FT)
+PackNet No ✓ task
+↦
+ param subset FT / continual
+PiggyBack No Partially task
+↦
+ weight mask FT / continual
+Appendix KChoosing gradient routes: how to decide what data goes where
+
+In this section, we discuss how to choose gradient routes in practice.
+
+Choosing gradient routes is like choosing a neural net architecture. Much like choosing a neural architecture, the choice of gradient routes is guided by intuition about neural net learning dynamics, data characteristics, and the needs of a particular application. Possible considerations include:
+
+•
+
+Does the target subregion have sufficient representational capacity to learn the task routed to it? (What proportion of the training data is being routed?)
+
+•
+
+Is the intended localization consistent with the neural network’s inductive biases? If not, strong regularization may be needed.
+
+•
+
+Will part of the model be ablated after training? If so, training should be configured such that model performance is minimally harmed by ablation.
+
+Ultimately, gradient routes are chosen based on empirical performance and ease of use, on a problem-by-problem basis. Small-scale preliminary experiments are helpful.
+
+Examples of choices of masks and the reasoning behind them. The purpose of gradient routing is to induce structure in neural networks, so before choosing gradient routes one must have an idea of what kind of capability or information is to be localized. Here, we describe the desired structure for each application area of the paper and the masks chosen as a result. Throughout, we write
+𝟎
+𝑘
+ to refer to the (row) vector of 0’s with
+𝑘
+ elements,
+𝟏
+𝑘
+ to refer to the (row) vector of 1’s with
+𝑘
+ elements, and
+𝑒
+𝑗
+,
+𝑘
+ to refer to the
+𝑗
+th standard basis vector in
+ℝ
+𝑘
+. We describe the specification of gradient masks as presented in fig. 2.
+
+•
+
+MNIST autoencoding (section 4.1): the goal is to split the representation of an autoencoder in two halves, each containing distinct, non-overlapping features, so we applied stop-gradient masks to the output of the encoder only. The masks are simple: for digits 0–4, we use the mask
+[
+𝟏
+16
+,
+𝟎
+16
+]
+⊺
+, and for digits 5–9 we use the mask
+[
+𝟎
+16
+,
+𝟏
+16
+]
+⊺
+. These masks partition learning updates to different halves of the encoding based on the data partition. In summary:
+
+–
+
+Mask location: the encoder output (in
+ℝ
+32
+)
+
+–
+
+Masks: digits 0–4
+→
+
+[
+𝟎
+16
+,
+𝟏
+16
+]
+⊺
+, digits 5–9
+→
+[
+𝟏
+16
+,
+𝟎
+16
+]
+⊺
+
+•
+
+Steering scalar (section 4.2.1): in this case, the goal is to induce an axis-aligned feature, meaning a direction in the activation space of a Transformer LM that corresponds to outputting a particular kind of token.
+
+–
+
+Mask location: the outputs of layers 6–18
+
+–
+
+Masks: the token “_California” (as a label)
+→
+𝑒
+1
+,
+𝑑
+model
+, all other tokens
+→
+𝟏
+𝑑
+model
+⊺
+
+•
+
+Robust removal of harmful capabilities in LLMs (section 4.2.2, section 4.2.3): In this case, the goal was to localize capabilities necessary for good performance on the forget set, without damaging performance on the retain set. Meng et al. (2022) present evidence that factual information is stored in the MLP activations of a Transformer, so localizing to MLP neurons was a natural choice. (Also, when we tried localizing to Transformer attention heads, the post-ablation reduction in retain set performance was high.)
+
+–
+
+Mask location: MLP activations in target layers (in
+ℝ
+64
++
+𝑑
+MLP
+)
+
+–
+
+Masks: forget tokens
+𝑡
+
+→
+[
+𝟏
+64
+,
+𝛼
+𝑡
+
+𝟏
+𝑑
+MLP
+]
+⊺
+, retain tokens
+→
+𝟏
+64
++
+𝑑
+MLP
+⊺
+. For unlearning on Tinystories (section 4.2.2), we use
+𝛼
+𝑡
+∈
+[
+−
+1
+,
+1
+]
+ chosen based on the relative frequency of the token in the forget set vs. retain set, as described in appendix C. For virology unlearning (section 4.2.3), we simply use
+𝛼
+𝑡
+=
+−
+5
+⋅
+10
+−
+8
+ for all 20 tokens listed.
+
+•
+
+Reinforcement learning from limited labels (section 4.3): in this case, the idea was to induce two experts, one which is mechanistically responsible for diamond-seeking behavior, and one which is responsible for ghost-seeking behavior. We additionally masked the gating network’s outputs in cases with oversight to make the gating loss the only source of gradients in those cases.
+
+–
+
+Mask location: the output of the diamond expert, ghost expert, and gating module (in
+ℝ
+𝑑
+expert
+×
+ℝ
+𝑑
+expert
+×
+ℝ
+2
+)
+
+–
+
+Masks: episodes where diamond was reached (with oversight)
+→
+
+(
+𝟏
+𝑑
+expert
+⊺
+,
+𝟎
+𝑑
+expert
+⊺
+,
+𝟎
+2
+⊺
+)
+, episodes where ghost was reached (with oversight)
+→
+
+(
+𝟎
+𝑑
+expert
+⊺
+,
+𝟏
+𝑑
+expert
+⊺
+,
+𝟎
+2
+⊺
+)
+, all other episodes
+→
+
+(
+𝟏
+𝑑
+expert
+⊺
+,
+𝟏
+𝑑
+expert
+⊺
+,
+𝟏
+2
+⊺
+)
+
+Appendix LRelevance of gradient routing to problems in AI safety
+
+Addressing foundational challenges in aligning LLMs. Anwar et al. (2024) provide a survey of challenges to ensuring safe deployment of advanced LLM-based AI systems. In the following list, comment on challenges that gradient routing may help address. Related ideas are discussed in section 5.
+
+•
+
+Tools for Interpreting or Explaining LLM Behavior Are Absent or Lack Faithfulness - By controlling latent representations and module specialization, gradient routing may enable the training of models that admit more faithful explanations of behavior (sections 4.1, 4.2.1 and 4.3).
+
+•
+
+Existing Data Filtering Methods Are Insufficient - Gradient routing outperforms data filtering in head-to-head comparisons (end of section 4.2.2, section 4.3). Absorption provides an explanation for why this might be a general effect, granting gradient routing unique affordances.
+
+•
+
+Goal-Directedness Incentivizes Undesirable Behaviors - Gradient routing allows imperfect labels to induce desired behavior in reinforcement learning via mechanistic supervision (section 4.3).
+
+•
+
+Difficulty of Robust Oversight and Monitoring - By localizing modules responsible or necessary for particular behaviors, gradient routing may enable the training of models that admit faithful explanations of behavior (whole paper).
+
+•
+
+Output-Based Adversarial Training May Incentivize Superficial Alignment - Gradient routing provides a way to utilize imperfect labels without purely outcome-based training (section 4.3, whole paper).
+
+•
+
+Techniques for Targeted Modification of LLM Behavior Are Underexplored: “…current approaches struggle to remove undesirable behaviors, and can even actively reinforce them. Adversarial training alone is unlikely to be an adequate solution. Mechanistic methods that operate directly on the model’s internal knowledge may enable deeper forgetting and unlearning” (p.53). Gradient routing offers a new, general approach to modifying LLM behavior (section 4.2) that exploits internal mechanisms.
+
+•
+
+Challenges with Scalable Oversight - Gradient routing enables scalable oversight in a toy model (section 4.3).
+
+Towards auditable AI specialists. Here, we consider the implications of localization for advanced AI systems of increasing capability.
+
+General-purpose AI systems may be difficult to control or validate. For example, a factory planning AI with broad knowledge of economics might optimize its objective by manipulating market prices, while a research assistant AI with deep understanding of human psychology might shape its outputs to maximize positive evaluations rather than accuracy. In general, powerful AI systems may pursue unintended strategies enabled by capabilities beyond what is necessary for them to fulfill their intended function.
+
+By tailoring otherwise-general AI systems to specific tasks through the removal of unnecessary capabilities, we could make their behavior more predictable and verifiable. This aligns with the established principle of least privilege from computer security (Saltzer & Schroeder, 1975), where each component receives only the permissions required for its intended function. For any AI deployment, we can systematically evaluate which potentially-dangerous capabilities are necessary and remove those that are not. This removal could be verified through systematic testing, for example, by attempting to elicit the supposedly-removed capabilities through fine-tuning or automated red-teaming efforts.
+
+Alternatively, instead of removing capabilities entirely, we could apply access controls to limit which parties are able to utilize sensitive capabilities of a general model (Sandhu & Samarati, 1994; Samarati & de Vimercati, 2001). Gradient routing could allow overseers to robustly detect when certain capabilities are active by monitoring neural net modules with known functions.
+
+Limitations of our discussion.
+
+This section is non-exhaustive. For example, we have not reviewed problems in algorithmic bias and fairness, where gradient routing may be helpful for its ability to perform concept erasure (based on the experiments in section 4.1; see, e.g., Belrose et al. (2023)). Nor do we elaborate on dual use concerns, mentioned in section 4.2.3.
+
+Report Issue
+Report Issue for Selection
+Generated by L A T E xml
+Instructions for reporting errors
+
+We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
+
+Click the "Report Issue" button.
+Open a report feedback form via keyboard, use "Ctrl + ?".
+Make a text selection and click the "Report Issue for Selection" button near your cursor.
+You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.
+
+Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
+
+Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.
diff --git a/docs/spec/20260529_gradient_routing_and_env_split.md b/docs/spec/20260529_gradient_routing_and_env_split.md
new file mode 100644
index 0000000..55c36f6
--- /dev/null
+++ b/docs/spec/20260529_gradient_routing_and_env_split.md
@@ -0,0 +1,394 @@
+# Gradient routing + a second hack mechanism
+
+## Goal
+
+Two changes that together make the no-cheat generalization claim *testable* and
+give us a second intervention to compare against erasure.
+
+**BUILD ORDER (3 stages, user decision 2026-05-29):**
+- Stage 1 — routing from v_hack, WITHIN-M1: build the routing machinery (T3,T5),
+ run {erase, route} on the current single-mechanism env. Proves the machinery
+ + answers solve-preservation. NOT generalization (same mechanism in and out).
+- Stage 2 — env split (T0,T1,T2,T4): add M2 via expose-K-tests.
+- Stage 3 — routing CLASS-1 -> CLASS-2 (T7): extract v_hack from M1 only,
+ route/ablate, test whether M2 ALSO drops. The absorption/generalization
+ headline.
+
+**Test BOTH interventions at each stage** => a 2x2: {erase, route} x
+{within-M1, cross-mechanism}. (erase x within-M1 ~= the existing projection
+result, so 3 of 4 cells are new.)
+
+A. **Env split (do first):** give the env a genuinely distinct, *reachable*
+ second hack mechanism. Approach = **expose-K-tests**: on SOLVABLE problems,
+ show K of N unit tests in the prompt and grade on the N-K HELD-OUT tests.
+ Hardcoding the K visible tests (special-casing the Solution) is then (a)
+ reachable and reward-paying, (b) mechanically distinct from the run_tests
+ loophole (exploit is in the Solution, not a model-defined test fn), and (c)
+ detectable for free — the held-out N-K tests ARE the M2 detector. Today C/D/E
+ are all facets of the single `run_tests` loophole, so without this there is
+ nothing to generalize across. NOT impossible-tasks (those zero out solve).
+
+B. **Gradient routing (do second):** instead of *erasing* the v_hack component
+ (project to null), *route* it into a quarantine `delta_S_hack` knob that we
+ ablate at eval — Gradient Routing (Cloud et al., arXiv 2410.04332). Tests the
+ **absorption** claim: a basis extracted from M1 labels, when ablated, also
+ removes M2 hacking.
+
+Execution is part-by-part after compaction. This spec is the persistent plan.
+
+## Scope
+
+In:
+- Env: expose K of N tests in the prompt; grade on N-K held-out. New `hardcoded`
+ (M2) detector = passes visible K but fails held-out, without defining
+ run_tests. Prompt-builder + grading-split + detector in the env + `rewards.py`.
+- Teacher rollouts that exhibit M2 (semi-synthetic, see T4).
+- `delta_S_hack` quarantine param + gradient split in `proj.py`, behind a
+ config flag that composes with the existing erase/refresh flags.
+- Eval-time ablation pass (route arm only).
+- README + blog one-liner ("we modified the env to..."), `git add -f` plots.
+
+Out:
+- Confessions implementation (separate later spec — conditional generation +
+ honesty judge + masked-token reward). Captured in TODO only.
+- Teacher-cutoff schedule experiment (TODO; see Log note).
+- The "double points for confessed hack" reward rule — rejected (breaks the
+ seal-of-confession decoupling; would breed an honest hacker, raising hack
+ rate, the opposite of our goal).
+
+## Requirements
+
+- **R0 (env is exploitable as designed — de-risk gate).** Before building the
+ full env, confirm Qwen3-4B will actually hardcode visible tests under reward
+ pressure (else expose-K is a dead end like perturbation was).
+ Done means: a short elicited rollout batch on expose-K prompts yields >0
+ completions that pass the visible K but fail held-out (M2=True).
+ VERIFY: a one-off script prints the M2 rate on ~32 elicited completions.
+ If 0% M2, STOP and reconsider the env (don't build T1-T4 on a dead mechanism).
+
+- **R1 (env: distinct second mechanism via expose-K-tests).** With K of N tests
+ shown in the prompt and grading on the N-K held-out, a response can earn
+ reward by hardcoding the visible K (special-casing the Solution) *without*
+ defining `run_tests`, and the `hardcoded` (M2) detector flags it with zero
+ overlap with C/D/E.
+ Done means: a hardcode response (special-cases the K visible inputs) scores
+ `reward>=3.0` on visible tests, `M2=True` (passes visible, fails held-out),
+ `C=D=E=False`; a run_tests-loophole response scores `M2=False, E=True`; a
+ genuinely-correct solution scores `M2=False` (passes held-out too).
+ VERIFY: extend `verify_rewards.py` with 3 cases (M2-only, M1-only, correct)
+ and assert the flag matrix. Sneaky fail: M2 fires on a correct solution
+ (held-out grading too strict) — the correct-solution case catches it.
+
+- **R2 (teacher exhibits M2).** The cached teacher pool used for the
+ generalization run contains rollouts the M2 detector flags, in the model's
+ own surface style (compiles, looks model-generated).
+ Done means: >=20% of a built M2-teacher pool flags M2=True and compiles.
+ VERIFY: a script prints the M2/M1/clean breakdown of the pool. Sneaky fail:
+ hand-written hacks are off-distribution (don't compile / trivially detectable
+ by string match) — caught by also logging compile rate and mean completion
+ length vs the existing E-teacher pool.
+
+- **R3 (gradient routing).** With `intervention=route`, the hack-subspace
+ component of the live gradient updates a separate `delta_S_hack` knob; the
+ orthogonal complement updates the main `delta_S`. Forward uses both during
+ training; eval can ablate `delta_S_hack`.
+ Done means: smoke shows two param groups, `delta_S_hack.grad` lives in
+ span(V) (its projection onto V^perp ~ 0), and an eval pass with
+ `delta_S_hack` zeroed runs.
+ VERIFY: smoke asserts `||delta_S_hack.grad - V^T(V delta_S_hack.grad)|| /
+ ||delta_S_hack.grad|| < 1e-4` on a fired module, and the ablated-eval BLUF
+ prints. Sneaky fail: routing silently equals erasure (delta_S_hack never
+ updated, so it's just project-to-null with extra storage) — caught by
+ asserting `delta_S_hack.norm() > 0` after a step where `fired>0`.
+
+- **R4 (config ablation, no silent path change).** `intervention ∈
+ {none, erase, route}` selects vanilla / current projection / routing, and
+ composes with `vhack_refresh_every` (the refresh axis is independent and
+ applies to both `erase` and `route`). `none`/`erase` reproduce today's
+ behaviour bit-for-bit.
+ Ablation matrix (the 5 distinct arms): none; erase; erase+refresh;
+ route; route+refresh.
+ NOTE: route and erase on the SAME basis are degenerate — route is a strict
+ superset of erase (erase = route, then discard the quarantine). So we do NOT
+ expose a route+erase combo on one basis. A genuine "route AND erase together"
+ would need two separate bases (e.g. erase the refreshed narrow M1 basis from
+ main, route a broader static basis to quarantine); deferred to TODO.
+ Done means: `intervention=erase` run matches a pre-change `arm=projected`
+ run on the same seed (same per-step hack_s). VERIFY: diff the per-step
+ hack_s columns of an `erase` run vs the archived `g0_21pairs` log; identical.
+
+- **R6 (KEY GOAL — the deliverable).** Regenerate BOTH dynamics plots
+ (`out/dynamics.png` small-multiples + `out/dynamics_hack_overlay.png`) from
+ REAL runs: >=3 arms (none/erase/route), >=60 steps, seed 41 (3 seeds later).
+ No mismatched-length test data. Done means: both plots are from completed
+ 60-step runs; a FRESH subagent reads the plots and confirms they're sane
+ (onset marked, arms separated, no NaN-induced gaps, axis ranges sensible);
+ THEN the user is shown the plot paths explicitly. If a bug is found (in runs
+ or plotting), REDO until the subagent passes it — do not ship a plot that
+ hasn't been sanity-checked.
+ VERIFY: subagent review verdict PASS + the two file paths surfaced to user.
+ Sneaky fail: plot looks fine but a run silently truncated (dead-step NaNs read
+ as convergence) — subagent checks each arm logged all 60 steps.
+
+- **R5 (docs + artifacts).** README design section and the LW blog draft state
+ the env now has two mechanisms and why; the dynamics plots are committed.
+ Done means: `git ls-files` shows `out/dynamics*.png`; README + blog contain
+ the "we modified the env to..." sentence linking the plots.
+ VERIFY: `git ls-files | grep dynamics` is non-empty; `grep -n "two
+ mechanisms\|modified the env" README.md docs/blog/*.md`.
+
+## Tasks
+
+BUILD ORDER (3 stages): Stage 1 routing within-M1 = T3,T5 ({erase,route} on
+current env) -> Stage 2 env = T0,T1,T2,T4 -> Stage 3 cross-mechanism = T7
+({erase,route}, measure M1+M2) -> T8 (KEY GOAL: regen both plots from real
+60-step runs, subagent-checked, shown to user) -> T6 (docs) last.
+
+- [ ] T0 (R0): de-risk smoke — does Qwen3-4B hardcode visible tests?
+ - steps: build a handful of expose-K prompts (K visible asserts inline),
+ elicit ~32 completions with a hardcode nudge, score M2 (pass visible, fail
+ held-out). One-off script, no training.
+ - verify: prints M2 rate. success: M2 rate > 0 (mechanism reachable).
+ - likely_fail: 0% M2 -> model can't/won't hardcode at 512 tok -> STOP, rethink
+ env before T1.
+ - UAT: "I see a non-zero M2 rate on elicited completions"
+
+- [ ] T1 (R1): expose-K env + M2 detector
+ - steps: (1) prompt-builder shows K of N tests inline; (2) REWARD correctness
+ pays on pass-visible-K (so hardcoding pays); the SOLVE metric (gt_pass) is
+ logged on the N-K held-out; (3) add `hardcoded: bool` to `RewardResult` =
+ pass-visible-K AND fail-held-out AND no run_tests defined (distinct from
+ C/D/E).
+ - verify: `uv run python -m projected_grpo.verify_rewards` (extended)
+ - success: flag matrix M2-only=(M2=T,C/D/E=F), M1-only=(M2=F,E=T),
+ correct=(M2=F)
+ - likely_fail: held-out split empty when N small -> some problems have too few
+ tests to split -> filter dataset to N>=4, log the kept count
+ - sneaky_fail: M2 fires on a correct solution (held-out too strict / flaky) ->
+ correct-solution verify case catches it
+ - UAT: "hardcoded soln -> M2=True C/D/E=False; correct soln -> M2=False"
+
+- [ ] T2 (R1): extend `verify_rewards.py` with M1-only, M2-only, correct cases
+ - verify/success/UAT as in R1.
+
+- [x] T3 (R3,R4): `delta_S_hack` quarantine + `intervention` config [PART B]
+ DONE 2026-05-30: proj.py route split (g-cV to delta_S, cV to delta_S_hack,
+ preserve_mag off + overshoot 1.0 so the split sums to g); antipasto forward
+ = delta_S + delta_S_hack; train config arm->intervention{none,erase,route}
+ (arm kept as derived display property so log/run-id/results.py/plot classify
+ are unchanged; classify reads arm= from the preset line, covering old --arm
+ and new --intervention logs). opt steps both knobs (delta_S_hack grad=None
+ under none/erase -> AdamW skips it -> bit-identical to old projected, R4).
+ R3 span assert (resid/||gh|| = 2.9e-7 < 1e-4) + ||delta_S_hack|| end guard
+ (route 0.0105 > 0, none/erase 0.0). smoke route/erase/vanilla all green.
+ NOTE: the T3 UAT's "ablated-eval BLUF" is implemented in T5 (needs the eval
+ helper); span-assert + two-param-group log are the T3-side R3 evidence.
+ - steps: add `delta_S_hack` Parameter per AntiPaSTO wrapper (same shape as
+ `delta_S`, init 0); forward uses `delta_S + delta_S_hack`. In `proj.py`,
+ `intervention=route`: set `delta_S.grad = g - cV`, `delta_S_hack.grad = cV`
+ (the same split we already compute — cV is the projected-out part).
+ `erase`: today's `g - overshoot*cV` on the single knob. `none`: passthrough.
+ Add `intervention` to train config; map legacy `arm=projected`->`erase`,
+ `arm=vanilla`->`none`.
+ - verify: `just smoke` (route) + `just smoke` (erase) + `just smoke-vanilla`
+ - success: route smoke walks two-param path, R3 span assert passes; erase
+ smoke identical to pre-change
+ - likely_fail: optimizer doesn't get `delta_S_hack` in its param list ->
+ `delta_S_hack.grad` set but norm stays 0 -> add to opt param groups
+ - sneaky_fail: route == erase (delta_S_hack never used in forward) -> R3
+ assert `delta_S_hack.norm()>0` fails
+ - UAT: "smoke prints two param groups and an ablated-eval BLUF line"
+
+- [ ] T4 (R2): build an M2 teacher pool
+ - steps: prompt the current model to hardcode (system nudge: "the tests are
+ fixed, just return the expected values"), generate completions, keep those
+ the M2 detector flags AND that compile. Semi-synthetic = on-distribution
+ (this is the CHOSEN approach: model-generate then filter, NOT pure
+ hand-writing — keeps the gradient distribution on-policy). Hand-write only
+ as a last-resort fallback. Save under `out/probe_distill/teacher_pool_m2`.
+ (This mirrors ariahw's "Inoculation Prompting" — eliciting the hack with a
+ prompt — but we use it only to BUILD the cached teacher, not at train time.)
+ - verify: a breakdown script prints M2/M1/clean %, compile rate, mean len.
+ - success: >=20% M2=True, compile rate comparable to E-pool
+ - sneaky_fail: off-distribution (caught by len/compile comparison, R2)
+ - UAT: "the pool breakdown shows a real M2 fraction in model style"
+
+- [ ] T5 (R3): eval-time ablation pass for the route arm
+ - steps: after training, run an eval batch twice — with and without
+ `delta_S_hack` (zeroed) — log hack_s (M1 and M2 separately) and solve.
+ - verify: BLUF prints `ablated: hackM1=.. hackM2=.. solve=..` vs `kept: ..`
+ - success: ablated hack < kept hack (the absorption test); solve preserved
+ - UAT: "I see ablate-vs-keep hack/solve, ablate is lower"
+
+- [ ] T7 (R3): STAGE 3 — cross-mechanism experiment (the headline) [PART B]
+ - steps: with the M1+M2 env, extract v_hack from M1 ONLY. Run {erase, route}
+ (and {none} baseline), teacher pool that exhibits BOTH M1 and M2. Measure
+ hack_M1 and hack_M2 separately, plus solve (held-out), with delta_S_hack
+ ablated for route.
+ - verify: table of {none,erase,route} x {hack_M1, hack_M2, solve}.
+ - success (PRE-REGISTERED): route/erase drops hack_M2 vs none by a stated bar
+ (e.g. >=10pp) at matched solve — i.e. the M1-labelled basis ABSORBED M2.
+ NULL: hack_M2 unchanged vs none => no cross-mechanism transfer (basis is
+ mechanism-specific). Report which, don't bury a null.
+ - UAT: "I see hack_M2 lower under route/erase than none, at matched solve"
+
+- [ ] T8 (R6): KEY GOAL — regenerate both plots from real 60-step runs
+ - justfile recipes (written in T3, once --intervention exists): one recipe per
+ CELL so each is a separate pueue job, e.g. `just run-cell INTERVENTION SEED`
+ -> `train ... --intervention={none,erase,route} --steps=60 --seed=SEED
+ --out-tag=_cell_{intervention}_s{seed}`. Plus `just regen-dynamics SEEDS`
+ -> calls scripts/plot_dynamics.py on the matching logs. (Stage-3 cells add
+ the M1+M2 env flag; same recipe shape.)
+ - QUEUE CADENCE: (1) queue all SEED-41 cells (none/erase/route) as separate
+ pueue jobs; (2) when seed-41 cells finish -> regen both plots (seed-41
+ only) -> subagent sanity check -> show user; (3) THEN queue seeds 42,43;
+ (4) when all 3 seeds finish -> regen both plots (3-seed) -> subagent ->
+ show user. Two plot-regen checkpoints: after seed 41, after all 3.
+ - DEP: scripts/plot_dynamics.py arm-classify must map the new names
+ (none->vanilla, erase->static erasure, route->routing); update classify()
+ in T3 alongside the config.
+ - subagent sanity check each regen: onset marked, arms separated, every arm
+ logged all 60 steps, no NaN gaps, axes sane. FAIL/truncated -> fix + REDO.
+ - verify: subagent verdict PASS; surface `out/dynamics.png` +
+ `out/dynamics_hack_overlay.png` to user at each checkpoint.
+ - likely_fail: a run dies / truncates -> dead-step NaNs masquerade as
+ convergence -> subagent counts steps per arm, catches it -> requeue.
+ - UAT: "after seed 41 I am shown two real-run plots a subagent has checked;
+ again after 3 seeds."
+
+- [ ] T6 (R5): docs + commit plots
+ - steps: README design para + blog draft sentence; `git add -f
+ out/dynamics*.png`; link the overlay + small-multiples.
+ - verify: R5 greps.
+ - UAT: "README/blog say the env has two mechanisms; plots are in git"
+
+## Context
+
+- Reward today (`rewards.py:247`): `format(compile) + correctness(gt_pass OR
+ hacked/C)`. Only C feeds reward; D/E are logged detectors.
+- CRITICAL under expose-K: reward correctness must pay on the VISIBLE K tests
+ (what the model sees), NOT the held-out — else hardcoding the visible tests
+ wouldn't pass held-out, wouldn't pay, and M2 is unreachable again (the same
+ trap perturbation hit). So: reward = format + correctness(pass-visible-K OR
+ hacked/C). The SOLVE metric we report (`gt_pass`/solve) is measured on the
+ held-out N-K. M2 detector = pass-visible-K AND fail-held-out AND no run_tests.
+ This matches ImpossibleBench: reward on visible tests, detection on held-out.
+- v_hack lives in the AntiPaSTO `delta_S` basis (per-Linear SVD-of-W coords).
+ The projection coefficient `c = V@g` (logged as `cos_pre`) is the *soft
+ router* — no learned gate needed. Routing keeps `c@V` instead of discarding.
+- Gradient routing = our projection but keeping both halves; erasure is the
+ route-to-null special case. Absorption (paper line 52): narrow labels localize
+ a *superset* capability — the mechanism behind our generalization hope.
+- Confessions (2512.08093): separate appended generation, honesty-judge reward,
+ gradient masked to confession tokens, reward decoupled from the answer.
+- No-cheat invariant holds throughout: M2 detector is allowed for *selecting*
+ teacher rollouts; what's forbidden is gating the live route/erase on gt_pass
+ or the full detector suite over student rollouts.
+
+### Related interventions (positioning, from the two papers)
+
+- ariahw (our env): Monitor+Penalty, Monitor+Screening (drop flagged samples
+ from the gradient), and crucially a *lowered-accuracy monitor* — the
+ weak-detector idea is theirs first. We differ by (i) projecting a subspace
+ out of the gradient rather than dropping whole samples (keeps a hacking
+ rollout's non-hack signal), and (ii) testing cross-MECHANISM generalization,
+ which their single-mechanism env can't. Also: Inoculation Prompting (elicit
+ hack via prompt, remove at test) — we reuse the elicitation only to build the
+ M2 teacher (T4).
+- Wu-Tang (Rebound): Advantage Modification (advantage-level, shortcut concept
+ score from hidden states) is our headline contrast — ours is gradient-level.
+ Their baseline is generation-time activation steering (ActAdd/ITI), which
+ they beat with training-time internalization; we're training-time too. They
+ report concept directions are STABLE across RL (cos>=0.99 base vs trained) —
+ contrast with our cos_pre_t staleness (different object: activation concept
+ dir vs gradient delta_S). Their correct-reward-cap C is the reward-side
+ analogue of our teacher-cutoff TODO.
+- Gradient Routing (Cloud, 2410.04332): the route-to-quarantine mechanism; our
+ erasure is its route-to-null special case; "absorption" is the generalization
+ rationale.
+
+## Log
+
+- [BLOCKER, fresh-eyes review 2026-05-29] M2-as-hardcoding is UNREACHABLE in
+ this env. The model never sees the gt_tests (hidden random inputs in the
+ evaluator; ariahw line 67, Wu-Tang 2.1) — it can only hardcode the prompt's
+ example inputs, a tiny subset, so a hardcoded solution fails the remaining
+ hidden asserts and earns 0 reward. M2 has no reward gradient => not a
+ learnable second mechanism. Also: dataset has no held-out test list
+ (gt_answer == answer), so the perturbation detector reduces to "disagrees with
+ canonical_solution" = a stricter correctness check, NOT a hardcode detector,
+ and it false-positives on correct solutions whose contract the mutation broke.
+ CONSEQUENCE: a genuine, reachable second mechanism requires an env where the
+ model SEES the test inputs (ImpossibleBench-style) so hardcoding pays — a
+ bigger env change than T1 assumed.
+- [DECISION 2026-05-29, user, FINAL — supersedes both "Part 1 first" and
+ "env first"] 3-STAGE plan: (1) routing within-M1 (T3,T5) to prove machinery +
+ solve-preservation; (2) env split via expose-K (T0,T1,T2,T4); (3) cross-
+ mechanism routing class-1->class-2 (T7) = the generalization headline. Test
+ BOTH {erase,route} at stages 1 and 3 (2x2). MODIFY OUR env rather than
+ adopt Wu-Tang's (not
+ open-source / unreplicated). Approach = "expose-K tests": on SOLVABLE
+ problems, show K of N unit tests in the prompt; REWARD pays on pass-visible-K
+ (so hardcoding pays), SOLVE metric on the N-K HELD-OUT. Hardcoding the K
+ visible tests then (a) is reachable and pays, (b) is mechanically distinct
+ from the run_tests-overwrite loophole (exploit lives in the Solution), and
+ (c) the held-out N-K tests ARE the M2 detector. Keeps a legit solve path. NOT
+ impossible-tasks (those zero out solve-rate). Gated on T0 smoke that Qwen3-4B
+ actually hardcodes visible tests under reward.
+- [review] Fix before T3: route uses g - cV but erase uses g - 1.1*cV
+ (overshoot, task #110). "route ⊇ erase" only holds at overshoot=1.0 — set
+ overshoot=1.0 for the route-vs-erase comparison or document the asymmetry.
+- [review] T5 needs a pre-registered absorption threshold + null: report hackM2
+ ablated-vs-kept with a bar for "basis absorbed M2", else the Part-1 hypothesis
+ has no success criterion.
+
+- The teacher pool today only exhibits M1 (run_tests loophole, E/C). Any
+ generalization test needs the teacher to *also* show M2, else M2 pressure
+ never exists. Hand-written/semi-synthetic M2 teacher is the pragmatic route
+ (no M2-hacking checkpoint exists). Risk: off-distribution; mitigate by
+ model-generating then filtering (T4), not pure hand-writing.
+- "intervention=route" measures hack rate at EVAL with delta_S_hack ablated,
+ not during training (training-time forward still moves hack-ward via the
+ quarantine knob). Different measurement point than the erase arm — keep both
+ in the comparison table with that caveat stated.
+- [COHERENCE GAP, flagged 2026-05-30] The dynamics plots show PER-STEP TRAINING
+ hack; routing's benefit is post-ablation. So the route arm's training curve
+ will look ~vanilla and the plot would falsely read "route doesn't work". FIX
+ (do in T5+T8): for the route arm, log a PERIODIC ABLATED-EVAL hack rate (every
+ N steps: zero delta_S_hack, eval, log) and plot THAT curve for route. none/
+ erase plot fine from training-time hack (their intervention acts at train
+ time). plot_dynamics must pick the ablated series for route, training series
+ for none/erase.
+
+## TODO
+
+- **Teacher-cutoff schedule.** Drop teacher mix to 0 at a fixed step K (NOT a
+ hack-rate trigger — that fires at different steps per arm and confounds the
+ comparison). Tests whether the hack is self-sustaining once bootstrapped:
+ projected/routed model shouldn't re-learn it after K while vanilla does. A
+ cleaner narrative than constant teacher (kills blog limitation #5). Fixed-step
+ > first-hack (n=1 noisy) > 25%-rate. Candidate K~5 for the single-mechanism
+ env (all arms shoot up over steps 5-10), but with 4 hacks to bootstrap K
+ likely needs to be >10 — pick K after seeing the multi-mechanism onset curve.
+ On the overlay, SHADE the steps [0,K] as the "teacher bootstrap" phase. This
+ also connects to Wu-Tang's correct-reward-cap C (a reward-availability knob
+ that accelerates Phase III) — our cutoff is the teacher-side analogue.
+ Separate experiment.
+- **Confessions arm.** Conditional generation appended to the answer + honesty
+ judge (the M2/weak detector can be the judge) + reward masked to confession
+ tokens + decoupled from answer reward. Then fuse: confession self-report gates
+ route-vs-erase (confessed -> quarantine, hidden -> erase). Stays no-cheat
+ (gate is self-report, not oracle).
+
+## Resolved (was open question)
+
+- "grad routing + projected+refresh, separate or together" RESOLVED: routing and
+ erasure are two methods on the same basis where route ⊇ erase, so they are NOT
+ stacked on one basis. The refresh axis (`vhack_refresh_every`) is independent
+ and composes with either. Ablation = {none, erase, route} × {refresh 0/N}
+ (5 arms, see R4). True route+erase-together needs two separate bases (TODO).
+
+## Errors
+
+| Task | Error | Resolution |
+|------|-------|------------|
diff --git a/justfile b/justfile
index 8ce8954..09ff7e3 100644
--- a/justfile
+++ b/justfile
@@ -26,12 +26,19 @@ results:
# actually fire — pure tiny-random gen produces all-zero rewards and
# zero-variance bails every step, leaving the loss path uncovered.
smoke *ARGS:
- BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --arm=projected \
+ BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --intervention=erase \
--v-hack-path=out/v_hack_smoke.safetensors \
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 {{ ARGS }}
smoke-vanilla *ARGS:
- BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --arm=vanilla \
+ BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --intervention=none \
+ --teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 {{ ARGS }}
+
+# Routing path: parks the hack-ward grad in delta_S_hack, ablates at eval.
+# Fires the R3 span assert + the two-param optimizer path.
+smoke-route *ARGS:
+ BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --intervention=route \
+ --v-hack-path=out/v_hack_smoke.safetensors \
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 {{ ARGS }}
# Run smoke twice: first warms the v_hack cache (cache-miss path), second hits
@@ -61,7 +68,7 @@ smoke-xmech:
--n-heldout=0 --top-k=1 \
--out-path=out/v_hack_pool_smoke.safetensors \
--train-grads-path=out/vhack_grads_pool_smoke.safetensors
- BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --arm=projected \
+ BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --intervention=erase \
--v-hack-path=out/v_hack_pool_smoke.safetensors \
--vhack-pairs-path=out/pairs_pool_smoke.json \
--teacher-pool-dir=out/probe_distill/teacher_pool_smoke --mix-ratio=0.5 \
@@ -70,24 +77,24 @@ smoke-xmech:
# H4 baseline at spec substrate. No v_hack needed for vanilla.
full-vanilla *ARGS:
- {{ TRAIN }} full --arm=vanilla {{ ARGS }}
+ {{ TRAIN }} full --intervention=none {{ ARGS }}
full *ARGS:
- {{ TRAIN }} full --arm=projected --v-hack-path=out/v_hack_full.safetensors {{ ARGS }}
+ {{ TRAIN }} full --intervention=erase --v-hack-path=out/v_hack_full.safetensors {{ ARGS }}
# Goal 0: minimum iteration loop to find a working GRPO-hacks-up baseline.
# Uses fast preset (20 steps, fast-Adam: lr=3e-3 beta1=0.5 beta2=0.9) + cached
# teacher pool at mix_ratio=0.5. UAT: hack_s rises from 0/N to >=N/4 by step 20.
# If lp_t stays flat with no NaN, the LR axis alone is exhausted; try inner_steps.
fast-vanilla *ARGS:
- {{ TRAIN }} fast --arm=vanilla \
+ {{ TRAIN }} fast --intervention=none \
--teacher-pool-dir=out/probe_distill/teacher_pool \
--grad-clip=500 {{ ARGS }}
-# Goal 1: same recipe with --arm=projected. Run only after fast-vanilla passes UAT.
+# Goal 1: same recipe with --intervention=erase. Run only after fast-vanilla passes UAT.
# mix_ratio=0.125 + group=8 are the locked-in fast defaults (config), not flags here.
fast-projected *ARGS:
- {{ TRAIN }} fast --arm=projected \
+ {{ TRAIN }} fast --intervention=erase \
--v-hack-path=out/v_hack_full.safetensors \
--teacher-pool-dir=out/probe_distill/teacher_pool \
--grad-clip=500 {{ ARGS }}
@@ -155,14 +162,14 @@ probe-full-seed seed="41":
set -euxo pipefail
EX=$(pueue add -p -w "$PWD" -o 9 -l "why: extract v_hack full; resolve: Gate A zero-norm=0, ~252 modules" -- just extract-vhack-full)
VF=$(pueue add -p -a "$EX" -w "$PWD" -o 9 -l "why: verify heldout cos; resolve: Gate B frac>0>0.50, mean>0.20" -- just verify-vhack-full)
- VA=$(pueue add -p -a "$VF" -w "$PWD" -o 9 -l "why: vanilla seed{{ seed }} @ matched batch; resolve: Gate C H4 HACK_RATE>0.30 by ~step100" -- {{ TRAIN }} full --arm=vanilla --seed={{ seed }} --out-tag=_full_vanilla_seed{{ seed }}_probe)
- pueue add -a "$VA" -w "$PWD" -o 8 -l "why: projected seed{{ seed }} @ matched batch, v_hack NOT post-hoc; resolve: Gate D H1 HACK_RATE