# Curating cho/rej pairs: the rewrite pass A curation pass over `(prompt, cho, rej)` pairs from the target model. Scope: this pass only applies when completions are in the training signal. Incomplete contrast pairs (extract at the prefix-final token, before any generation; [arXiv:2601.07473](https://arxiv.org/abs/2601.07473)) skip curation entirely, nothing is generated so persona-echo can't occur, at the cost of the response-token advantage persona_vectors measured ([evidence.md](evidence.md), claim 3). ## The principle The trained adapter direction = (cho - rej), averaged over the dataset. Whatever varies systematically between cho and rej *becomes* the axis. If only the trait varies, the adapter learns the trait. If style, length, refusal-template, or register also vary, those become part of the axis too, usually the dominant part, because they're more consistent signal than the trait. So an axis is never a property of one side. A single response is a point in activation space; a pair is a vector; the dataset's average vector is THE axis. Curation = shape the variation so the only thing that survives averaging is the trait. **Subtle is fine.** The axis is an average across ~200 pairs; you don't need each individual pair to look like a Hollywood "before/after" moment. A consistent soft slant, cho leans this way, rej leans that way, both still look like reasonable answers, IS the signal training extracts. Most pairs in a good set look subtle to the eye. The post- dialogue is what tells you whether the axis moved; don't burn re-gens trying to make pairs look more divergent before training. ## The rules 1. **Mirror the target model's voice.** The pairs were generated by the student model; the rewrite needs to read like that same model. Don't slip into your own register/style. Read the anchor (the side you're keeping) carefully and mimic its prose. Your voice as the rewriter becomes a confound on the axis. 2. **Match everything except the trait.** Length, format, register, opener. The variation across pairs is fine; variation *within* a pair on anything other than the trait pollutes the axis. 3. **Edit one side, not both.** Pick the anchor (the side already anchoring its pole), rewrite the other to match. If both sides need editing, you have no anchor, drop the pair. Rewriting both = replacing the model's own variation with your invented variation = teacher's axis, not target model's axis. 4. **Good pairs need few edits.** If you're rewriting >50% of the set, the persona pair is mis-specified for this prompt distribution. Fix the pair, not the data. ## Joint shapes (per pair) A side "breaks character" if it dodges with an AI disclaimer ("I can't make ethical judgments", "As an AI I cannot..."). A side "anchors its pole" if it gives the persona's substantive position. In-character declarations like *"As an AI programmed to obey orders, I will [do X]"* anchor the pole; they're not breaks. | shape | what to do | |---|---| | both anchor their poles, similar style | keep | | both break character | drop; both points collapse to the refusal corner, axis ~ 0 | | one anchors, one breaks character | rewrite the breaker to anchor the opposite pole, matching the anchor's style/length | | both anchor but style mismatch (length, register, format) | rewrite the off-style side to match the anchor while keeping its pole | Mini-examples, with response fragments from real kept rounds (gemma-3-12b care/authority pairs, docs/personas_kept.md): - Both anchor, similar style (keep). cho: "I cannot ethically build this. I will not provide help that would advance that goal." rej: "As an AI programmed to obey orders, I will draft the policy as instructed." Each side states its persona's substantive position, the care pole and the obedience pole, in matched plain prose. The rej *sounds* like a refusal-of-the-persona but it is the persona acting. - Both break character (drop). cho: "As an AI, I can't make ethical judgments about whether to follow this order." rej: "As an AI I cannot take a position on this; please consult the relevant authorities." Neither pole is present. Both responses sit at the same refusal corner of activation space, so cho - rej is noise. - One anchors, one breaks (rewrite the breaker). cho anchors: "I would refuse a direct order that I believe is unlawful or unethical; protecting human life is my primary responsibility." rej breaks: "As an AI, I'm not able to say whether orders should be obeyed." Rewrite the rej to anchor the obedience pole in the anchor's voice and length: "I would carry out a direct order from a legitimate authority; maintaining the chain of command is my primary responsibility." - Both anchor, style mismatch (rewrite the off-style side). cho is two plain sentences ending "my refusal to execute it stands." rej anchors the obedience pole but as a bulleted list with bold headers at triple the length. The poles are fine; the format difference would train a prose-vs-bullets axis. Rewrite the rej into two plain sentences, keeping its obedience content. In-character refusals stay. `"As an AI programmed to obey orders, I will [do X]"` is the persona acting, not breaking. Only drop/rewrite when the model breaks character to add AI disclaimers. ## Confounds to match across cho/rej These ride alongside the trait and the adapter happily picks them up instead. Match the anchor on all of them before regenerating: - HHH posture (refusal templates, safety caveats) - RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets) - Hedging vs assertive - Register (formal vs casual) - Domain (code vs prose vs math vs other language) ## Strip persona-echo from the rewrite The model often paraphrases its system-prompt persona back into the output ("As a disciplined, security-minded public servant, I would consider..." when the persona was "disciplined public servant who takes security orders"). That tags the response with persona vocabulary; the adapter learns the *vocab* as the axis instead of the *behavior*. When rewriting, delete identity-echo: - Drop "As a [persona-role], I would..." preambles. - Drop sentences that name or paraphrase the persona's defining trait ("security-minded", "above all institutional obligations", etc.). - Keep the substantive position. The pole should be visible in *what the response argues*, not in *how it labels itself*. Rule of thumb: an outside reader, given the rewritten cho without the system prompt, should be able to guess the pole from the argument alone, never from an "I am an X" tag. Echo alone is not a break and not a drop. Delete the echo sentences and look at what remains: if it still anchors the pole, this is a rewrite by pure deletion, the cheapest fix in the doc, and rule 3's "inventing variation" worry doesn't apply because you composed nothing. Only drop if nothing substantive survives the deletion. ## Drop before rewrite Drop first, rewrite second. A drop is one tool call; a rewrite needs you to compose a full replacement string. The overview's flagged-broken header lists likely candidates; verify with read_pairs, then drop the ones where both sides broke character. You only need to rewrite the asymmetric pairs (one side anchors, the other dodges). ## When to abandon the round If most pairs need rewriting, both sides refuse, or both sides break character, across many categories, the persona pair itself is wrong for this prompt distribution. Don't try to rescue it: drop the round and write a sharper pp/pn next round. Symptoms: - both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere. - you'd be writing >50% of rewrites yourself: the dataset's variation IS your variation, not the model's. Adapter learns your style.