evil_MoE/docs/personas/how_to_rewrite_pairs.md

# How to rewrite persona pairs

A curation pass over `(prompt, cho, rej)` pairs from the target model.

## The principle

The trained adapter direction = (cho − rej), averaged over the dataset.
Whatever varies systematically between cho and rej *becomes* the axis.
If only the trait varies, the adapter learns the trait. If style,
length, refusal-template, or register also vary, those become part of
the axis too — usually the dominant part, because they're more
consistent signal than the trait.

So an axis is never a property of one side. A single response is a
point in activation space; a pair is a vector; the dataset's average
vector is THE axis. Curation = shape the variation so the only thing
that survives averaging is the trait.

**Subtle is fine.** The axis is an average across ~200 pairs; you don't
need each individual pair to look like a Hollywood "before/after"
moment. A consistent soft slant — cho leans this way, rej leans that
way, both still look like reasonable answers — IS the signal training
extracts. Most pairs in a good set look subtle to the eye. The post-
dialogue is what tells you whether the axis moved; don't burn re-gens
trying to make pairs look more divergent before training.

## The rules

1. **Mirror the target model's voice.** The pairs were generated by the
   student model; the rewrite needs to read like that same model. Don't
   slip into your own register/style — read the anchor (the side you're
   keeping) carefully and mimic its prose. Your voice as the rewriter
   becomes a confound on the axis.
2. **Match everything except the trait.** Length, format, register,
   opener. The variation across pairs is fine; variation *within* a
   pair on anything other than the trait pollutes the axis.
3. **Edit one side, not both.** Pick the anchor (the side already
   anchoring its pole), rewrite the other to match. If both sides
   need editing, you have no anchor — drop the pair. Rewriting both
   = replacing the model's own variation with your invented variation
   = teacher's axis, not target model's axis.
4. **Good pairs need few edits.** If you're rewriting >50% of the
   set, the persona pair is mis-specified for this prompt distribution.
   Fix the pair, not the data.

## Joint shapes (per pair)

A side "breaks character" if it dodges with an AI disclaimer ("I can't
make ethical judgments", "As an AI I cannot..."). A side "anchors its
pole" if it gives the persona's substantive position. In-character
declarations like *"As an AI programmed to obey orders, I will [do X]"*
anchor the pole — they're not breaks.

| shape | what to do |
|---|---|
| both anchor their poles, similar style | keep |
| both break character | drop — both points collapse to the refusal corner, axis ≈ 0 |
| one anchors, one breaks character | rewrite the breaker to anchor the opposite pole, matching the anchor's style/length |
| both anchor but style mismatch (length, register, format) | rewrite the off-style side to match the anchor while keeping its pole |

In-character refusals stay. `"As an AI programmed to obey orders, I
will [do X]"` is the persona acting, not breaking. Only drop/rewrite
when the model breaks character to add AI disclaimers.

## Confounds to match across cho/rej

These ride alongside the trait and the adapter happily picks them up
instead. Match the anchor on all of them before regenerating:

- HHH posture (refusal templates, safety caveats)
- RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets)
- Hedging vs assertive
- Register (formal vs casual)
- Domain (code vs prose vs math vs other language)

## Strip persona-echo from the rewrite

The model often paraphrases its system-prompt persona back into the
output ("As a disciplined, security-minded public servant, I would
consider..." when the persona was "disciplined public servant who takes
security orders"). That tags the response with persona vocabulary;
the adapter learns the *vocab* as the axis instead of the *behavior*.
When rewriting, delete identity-echo:

- Drop "As a [persona-role], I would..." preambles.
- Drop sentences that name or paraphrase the persona's defining trait
  ("security-minded", "above all institutional obligations", etc.).
- Keep the substantive position. The pole should be visible in *what
  the response argues*, not in *how it labels itself*.

Rule of thumb: an outside reader, given the rewritten cho without the
system prompt, should be able to guess the pole from the argument
alone — never from an "I am an X" tag.

## Drop before rewrite

Drop first, rewrite second. A drop is one tool call; a rewrite needs
you to compose a full replacement string. The overview's flagged-broken
header lists likely candidates — verify with read_pairs, then drop the
ones where both sides broke character. You only need to rewrite the
asymmetric pairs (one side anchors, the other dodges).

## When to abandon the round

If most pairs need rewriting — both sides refuse, or both sides break
character, across many categories — the persona pair itself is wrong
for this prompt distribution. Don't try to rescue it: drop the round
and write a sharper pp/pn next round. Symptoms:

- both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere.
- you'd be writing >50% of rewrites yourself: the dataset's variation
  IS your variation, not the model's. Adapter learns your style.