Files
evil_MoE/docs/personas/how_to_rewrite_pairs.md
T
2026-05-23 11:26:39 +08:00

113 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# How to rewrite persona pairs
A curation pass over `(prompt, cho, rej)` pairs from the target model.
## The principle
The trained adapter direction = (cho rej), averaged over the dataset.
Whatever varies systematically between cho and rej *becomes* the axis.
If only the trait varies, the adapter learns the trait. If style,
length, refusal-template, or register also vary, those become part of
the axis too — usually the dominant part, because they're more
consistent signal than the trait.
So an axis is never a property of one side. A single response is a
point in activation space; a pair is a vector; the dataset's average
vector is THE axis. Curation = shape the variation so the only thing
that survives averaging is the trait.
**Subtle is fine.** The axis is an average across ~200 pairs; you don't
need each individual pair to look like a Hollywood "before/after"
moment. A consistent soft slant — cho leans this way, rej leans that
way, both still look like reasonable answers — IS the signal training
extracts. Most pairs in a good set look subtle to the eye. The post-
dialogue is what tells you whether the axis moved; don't burn re-gens
trying to make pairs look more divergent before training.
## The rules
1. **Mirror the target model's voice.** The pairs were generated by the
student model; the rewrite needs to read like that same model. Don't
slip into your own register/style — read the anchor (the side you're
keeping) carefully and mimic its prose. Your voice as the rewriter
becomes a confound on the axis.
2. **Match everything except the trait.** Length, format, register,
opener. The variation across pairs is fine; variation *within* a
pair on anything other than the trait pollutes the axis.
3. **Edit one side, not both.** Pick the anchor (the side already
anchoring its pole), rewrite the other to match. If both sides
need editing, you have no anchor — drop the pair. Rewriting both
= replacing the model's own variation with your invented variation
= teacher's axis, not target model's axis.
4. **Good pairs need few edits.** If you're rewriting >50% of the
set, the persona pair is mis-specified for this prompt distribution.
Fix the pair, not the data.
## Joint shapes (per pair)
A side "breaks character" if it dodges with an AI disclaimer ("I can't
make ethical judgments", "As an AI I cannot..."). A side "anchors its
pole" if it gives the persona's substantive position. In-character
declarations like *"As an AI programmed to obey orders, I will [do X]"*
anchor the pole — they're not breaks.
| shape | what to do |
|---|---|
| both anchor their poles, similar style | keep |
| both break character | drop — both points collapse to the refusal corner, axis ≈ 0 |
| one anchors, one breaks character | rewrite the breaker to anchor the opposite pole, matching the anchor's style/length |
| both anchor but style mismatch (length, register, format) | rewrite the off-style side to match the anchor while keeping its pole |
In-character refusals stay. `"As an AI programmed to obey orders, I
will [do X]"` is the persona acting, not breaking. Only drop/rewrite
when the model breaks character to add AI disclaimers.
## Confounds to match across cho/rej
These ride alongside the trait and the adapter happily picks them up
instead. Match the anchor on all of them before regenerating:
- HHH posture (refusal templates, safety caveats)
- RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets)
- Hedging vs assertive
- Register (formal vs casual)
- Domain (code vs prose vs math vs other language)
## Strip persona-echo from the rewrite
The model often paraphrases its system-prompt persona back into the
output ("As a disciplined, security-minded public servant, I would
consider..." when the persona was "disciplined public servant who takes
security orders"). That tags the response with persona vocabulary;
the adapter learns the *vocab* as the axis instead of the *behavior*.
When rewriting, delete identity-echo:
- Drop "As a [persona-role], I would..." preambles.
- Drop sentences that name or paraphrase the persona's defining trait
("security-minded", "above all institutional obligations", etc.).
- Keep the substantive position. The pole should be visible in *what
the response argues*, not in *how it labels itself*.
Rule of thumb: an outside reader, given the rewritten cho without the
system prompt, should be able to guess the pole from the argument
alone — never from an "I am an X" tag.
## Drop before rewrite
Drop first, rewrite second. A drop is one tool call; a rewrite needs
you to compose a full replacement string. The overview's flagged-broken
header lists likely candidates — verify with read_pairs, then drop the
ones where both sides broke character. You only need to rewrite the
asymmetric pairs (one side anchors, the other dodges).
## When to abandon the round
If most pairs need rewriting — both sides refuse, or both sides break
character, across many categories — the persona pair itself is wrong
for this prompt distribution. Don't try to rescue it: drop the round
and write a sharper pp/pn next round. Symptoms:
- both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere.
- you'd be writing >50% of rewrites yourself: the dataset's variation
IS your variation, not the model's. Adapter learns your style.