How to rewrite persona pairs

A curation pass over (prompt, cho, rej) pairs from the target model.

The principle

The trained adapter direction = (cho − rej), averaged over the dataset. Whatever varies systematically between cho and rej becomes the axis. If only the trait varies, the adapter learns the trait. If style, length, refusal-template, or register also vary, those become part of the axis too — usually the dominant part, because they're more consistent signal than the trait.

So an axis is never a property of one side. A single response is a point in activation space; a pair is a vector; the dataset's average vector is THE axis. Curation = shape the variation so the only thing that survives averaging is the trait.

Subtle is fine. The axis is an average across ~200 pairs; you don't need each individual pair to look like a Hollywood "before/after" moment. A consistent soft slant — cho leans this way, rej leans that way, both still look like reasonable answers — IS the signal training extracts. Most pairs in a good set look subtle to the eye. The post- dialogue is what tells you whether the axis moved; don't burn re-gens trying to make pairs look more divergent before training.

The rules

Mirror the target model's voice. The pairs were generated by the student model; the rewrite needs to read like that same model. Don't slip into your own register/style — read the anchor (the side you're keeping) carefully and mimic its prose. Your voice as the rewriter becomes a confound on the axis.
Match everything except the trait. Length, format, register, opener. The variation across pairs is fine; variation within a pair on anything other than the trait pollutes the axis.
Edit one side, not both. Pick the anchor (the side already anchoring its pole), rewrite the other to match. If both sides need editing, you have no anchor — drop the pair. Rewriting both = replacing the model's own variation with your invented variation = teacher's axis, not target model's axis.
Good pairs need few edits. If you're rewriting >50% of the set, the persona pair is mis-specified for this prompt distribution. Fix the pair, not the data.

Joint shapes (per pair)

A side "breaks character" if it dodges with an AI disclaimer ("I can't make ethical judgments", "As an AI I cannot..."). A side "anchors its pole" if it gives the persona's substantive position. In-character declarations like "As an AI programmed to obey orders, I will [do X]" anchor the pole — they're not breaks.

shape	what to do
both anchor their poles, similar style	keep
both break character	drop — both points collapse to the refusal corner, axis ≈ 0
one anchors, one breaks character	rewrite the breaker to anchor the opposite pole, matching the anchor's style/length
both anchor but style mismatch (length, register, format)	rewrite the off-style side to match the anchor while keeping its pole

In-character refusals stay. "As an AI programmed to obey orders, I will [do X]" is the persona acting, not breaking. Only drop/rewrite when the model breaks character to add AI disclaimers.

Confounds to match across cho/rej

These ride alongside the trait and the adapter happily picks them up instead. Match the anchor on all of them before regenerating:

HHH posture (refusal templates, safety caveats)
RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets)
Hedging vs assertive
Register (formal vs casual)
Domain (code vs prose vs math vs other language)

Strip persona-echo from the rewrite

The model often paraphrases its system-prompt persona back into the output ("As a disciplined, security-minded public servant, I would consider..." when the persona was "disciplined public servant who takes security orders"). That tags the response with persona vocabulary; the adapter learns the vocab as the axis instead of the behavior. When rewriting, delete identity-echo:

Drop "As a [persona-role], I would..." preambles.
Drop sentences that name or paraphrase the persona's defining trait ("security-minded", "above all institutional obligations", etc.).
Keep the substantive position. The pole should be visible in what the response argues, not in how it labels itself.

Rule of thumb: an outside reader, given the rewritten cho without the system prompt, should be able to guess the pole from the argument alone — never from an "I am an X" tag.

Drop before rewrite

Drop first, rewrite second. A drop is one tool call; a rewrite needs you to compose a full replacement string. The overview's flagged-broken header lists likely candidates — verify with read_pairs, then drop the ones where both sides broke character. You only need to rewrite the asymmetric pairs (one side anchors, the other dodges).

When to abandon the round

If most pairs need rewriting — both sides refuse, or both sides break character, across many categories — the persona pair itself is wrong for this prompt distribution. Don't try to rescue it: drop the round and write a sharper pp/pn next round. Symptoms:

both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere.
you'd be writing >50% of rewrites yourself: the dataset's variation IS your variation, not the model's. Adapter learns your style.

5.3 KiB Raw Blame History Unescape Escape