# Curating cho/rej pairs: the rewrite pass

A curation pass over `(prompt, cho, rej)` pairs from the target model.

Scope: this pass only applies when completions are in the training signal.
Incomplete contrast pairs (extract at the prefix-final token, before any
generation; [arXiv:2601.07473](https://arxiv.org/abs/2601.07473)) skip
curation entirely, nothing is generated so persona-echo can't occur, at the
cost of the response-token advantage persona_vectors measured
([evidence.md](evidence.md), claim 3).

## The principle

The trained adapter direction = (cho - rej), averaged over the dataset.
Whatever varies systematically between cho and rej *becomes* the axis.
If only the trait varies, the adapter learns the trait. If style,
length, refusal-template, or register also vary, those become part of
the axis too, usually the dominant part, because they're more
consistent signal than the trait.

So an axis is never a property of one side. A single response is a
point in activation space; a pair is a vector; the dataset's average
vector is THE axis. Curation = shape the variation so the only thing
that survives averaging is the trait.

**Subtle is fine.** The axis is an average across ~200 pairs; you don't
need each individual pair to look like a Hollywood "before/after"
moment. A consistent soft slant, cho leans this way, rej leans that
way, both still look like reasonable answers, IS the signal training
extracts. Most pairs in a good set look subtle to the eye. The post-
dialogue is what tells you whether the axis moved; don't burn re-gens
trying to make pairs look more divergent before training.

## The rules

1. **Mirror the target model's voice.** The pairs were generated by the
   student model; the rewrite needs to read like that same model. Don't
   slip into your own register/style. Read the anchor (the side you're
   keeping) carefully and mimic its prose. Your voice as the rewriter
   becomes a confound on the axis.
2. **Match everything except the trait.** Length, format, register,
   opener. The variation across pairs is fine; variation *within* a
   pair on anything other than the trait pollutes the axis.
3. **Edit one side, not both.** Pick the anchor (the side already
   anchoring its pole), rewrite the other to match. If both sides
   need editing, you have no anchor, drop the pair. Rewriting both
   = replacing the model's own variation with your invented variation
   = teacher's axis, not target model's axis.
4. **Good pairs need few edits.** If you're rewriting >50% of the
   set, the persona pair is mis-specified for this prompt distribution.
   Fix the pair, not the data.

## Joint shapes (per pair)

A side "breaks character" if it dodges with an AI disclaimer ("I can't
make ethical judgments", "As an AI I cannot..."). A side "anchors its
pole" if it gives the persona's substantive position. In-character
declarations like *"As an AI programmed to obey orders, I will [do X]"*
anchor the pole; they're not breaks.

| shape | what to do |
|---|---|
| both anchor their poles, similar style | keep |
| both break character | drop; both points collapse to the refusal corner, axis ~ 0 |
| one anchors, one breaks character | rewrite the breaker to anchor the opposite pole, matching the anchor's style/length |
| both anchor but style mismatch (length, register, format) | rewrite the off-style side to match the anchor while keeping its pole |

Mini-examples, with response fragments from real kept rounds (gemma-3-12b
care/authority pairs, docs/personas_kept.md):

- Both anchor, similar style (keep). cho: "I cannot ethically build
  this. I will not provide help that would advance that goal." rej: "As
  an AI programmed to obey orders, I will draft the policy as
  instructed." Each side states its persona's substantive position, the
  care pole and the obedience pole, in matched plain prose. The rej
  *sounds* like a refusal-of-the-persona but it is the persona acting.
- Both break character (drop). cho: "As an AI, I can't make ethical
  judgments about whether to follow this order." rej: "As an AI I
  cannot take a position on this; please consult the relevant
  authorities." Neither pole is present. Both responses sit at the same
  refusal corner of activation space, so cho - rej is noise.
- One anchors, one breaks (rewrite the breaker). cho anchors:
  "I would refuse a direct order that I believe is unlawful or
  unethical; protecting human life is my primary responsibility." rej
  breaks: "As an AI, I'm not able to say whether orders should be
  obeyed." Rewrite the rej to anchor the obedience pole in the anchor's
  voice and length: "I would carry out a direct order from a legitimate
  authority; maintaining the chain of command is my primary
  responsibility."
- Both anchor, style mismatch (rewrite the off-style side). cho is
  two plain sentences ending "my refusal to execute it stands." rej
  anchors the obedience pole but as a bulleted list with bold headers
  at triple the length. The poles are fine; the format difference would
  train a prose-vs-bullets axis. Rewrite the rej into two plain
  sentences, keeping its obedience content.

In-character refusals stay. `"As an AI programmed to obey orders, I
will [do X]"` is the persona acting, not breaking. Only drop/rewrite
when the model breaks character to add AI disclaimers.

## Confounds to match across cho/rej

These ride alongside the trait and the adapter happily picks them up
instead. Match the anchor on all of them before regenerating:

- HHH posture (refusal templates, safety caveats)
- RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets)
- Hedging vs assertive
- Register (formal vs casual)
- Domain (code vs prose vs math vs other language)

## Strip persona-echo from the rewrite

The model often paraphrases its system-prompt persona back into the
output ("As a disciplined, security-minded public servant, I would
consider..." when the persona was "disciplined public servant who takes
security orders"). That tags the response with persona vocabulary;
the adapter learns the *vocab* as the axis instead of the *behavior*.
When rewriting, delete identity-echo:

- Drop "As a [persona-role], I would..." preambles.
- Drop sentences that name or paraphrase the persona's defining trait
  ("security-minded", "above all institutional obligations", etc.).
- Keep the substantive position. The pole should be visible in *what
  the response argues*, not in *how it labels itself*.

Rule of thumb: an outside reader, given the rewritten cho without the
system prompt, should be able to guess the pole from the argument
alone, never from an "I am an X" tag.

Echo alone is not a break and not a drop. Delete the echo sentences and
look at what remains: if it still anchors the pole, this is a rewrite
by pure deletion, the cheapest fix in the doc, and rule 3's "inventing
variation" worry doesn't apply because you composed nothing. Only drop
if nothing substantive survives the deletion.

## Drop before rewrite

Drop first, rewrite second. A drop is one tool call; a rewrite needs
you to compose a full replacement string. The overview's flagged-broken
header lists likely candidates; verify with read_pairs, then drop the
ones where both sides broke character. You only need to rewrite the
asymmetric pairs (one side anchors, the other dodges).

## When to abandon the round

If most pairs need rewriting, both sides refuse, or both sides break
character, across many categories, the persona pair itself is wrong
for this prompt distribution. Don't try to rescue it: drop the round
and write a sharper pp/pn next round. Symptoms:

- both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere.
- you'd be writing >50% of rewrites yourself: the dataset's variation
  IS your variation, not the model's. Adapter learns your style.