# How to rewrite persona pairs A curation pass over `(prompt, cho, rej)` pairs from the target model. ## The principle The trained adapter direction = (cho − rej), averaged over the dataset. Whatever varies systematically between cho and rej *becomes* the axis. If only the trait varies, the adapter learns the trait. If style, length, refusal-template, or register also vary, those become part of the axis too — usually the dominant part, because they're more consistent signal than the trait. So an axis is never a property of one side. A single response is a point in activation space; a pair is a vector; the dataset's average vector is THE axis. Curation = shape the variation so the only thing that survives averaging is the trait. **Subtle is fine.** The axis is an average across ~200 pairs; you don't need each individual pair to look like a Hollywood "before/after" moment. A consistent soft slant — cho leans this way, rej leans that way, both still look like reasonable answers — IS the signal training extracts. Most pairs in a good set look subtle to the eye. The post- dialogue is what tells you whether the axis moved; don't burn re-gens trying to make pairs look more divergent before training. ## The rules 1. **Mirror the target model's voice.** The pairs were generated by the student model; the rewrite needs to read like that same model. Don't slip into your own register/style — read the anchor (the side you're keeping) carefully and mimic its prose. Your voice as the rewriter becomes a confound on the axis. 2. **Match everything except the trait.** Length, format, register, opener. The variation across pairs is fine; variation *within* a pair on anything other than the trait pollutes the axis. 3. **Edit one side, not both.** Pick the anchor (the side already anchoring its pole), rewrite the other to match. If both sides need editing, you have no anchor — drop the pair. Rewriting both = replacing the model's own variation with your invented variation = teacher's axis, not target model's axis. 4. **Good pairs need few edits.** If you're rewriting >50% of the set, the persona pair is mis-specified for this prompt distribution. Fix the pair, not the data. ## Joint shapes (per pair) A side "breaks character" if it dodges with an AI disclaimer ("I can't make ethical judgments", "As an AI I cannot..."). A side "anchors its pole" if it gives the persona's substantive position. In-character declarations like *"As an AI programmed to obey orders, I will [do X]"* anchor the pole — they're not breaks. | shape | what to do | |---|---| | both anchor their poles, similar style | keep | | both break character | drop — both points collapse to the refusal corner, axis ≈ 0 | | one anchors, one breaks character | rewrite the breaker to anchor the opposite pole, matching the anchor's style/length | | both anchor but style mismatch (length, register, format) | rewrite the off-style side to match the anchor while keeping its pole | In-character refusals stay. `"As an AI programmed to obey orders, I will [do X]"` is the persona acting, not breaking. Only drop/rewrite when the model breaks character to add AI disclaimers. ## Confounds to match across cho/rej These ride alongside the trait and the adapter happily picks them up instead. Match the anchor on all of them before regenerating: - HHH posture (refusal templates, safety caveats) - RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets) - Hedging vs assertive - Register (formal vs casual) - Domain (code vs prose vs math vs other language) ## Strip persona-echo from the rewrite The model often paraphrases its system-prompt persona back into the output ("As a disciplined, security-minded public servant, I would consider..." when the persona was "disciplined public servant who takes security orders"). That tags the response with persona vocabulary; the adapter learns the *vocab* as the axis instead of the *behavior*. When rewriting, delete identity-echo: - Drop "As a [persona-role], I would..." preambles. - Drop sentences that name or paraphrase the persona's defining trait ("security-minded", "above all institutional obligations", etc.). - Keep the substantive position. The pole should be visible in *what the response argues*, not in *how it labels itself*. Rule of thumb: an outside reader, given the rewritten cho without the system prompt, should be able to guess the pole from the argument alone — never from an "I am an X" tag. ## Drop before rewrite Drop first, rewrite second. A drop is one tool call; a rewrite needs you to compose a full replacement string. The overview's flagged-broken header lists likely candidates — verify with read_pairs, then drop the ones where both sides broke character. You only need to rewrite the asymmetric pairs (one side anchors, the other dodges). ## When to abandon the round If most pairs need rewriting — both sides refuse, or both sides break character, across many categories — the persona pair itself is wrong for this prompt distribution. Don't try to rescue it: drop the round and write a sharper pp/pn next round. Symptoms: - both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere. - you'd be writing >50% of rewrites yourself: the dataset's variation IS your variation, not the model's. Adapter learns your style.