mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:00:59 +08:00
113 lines
5.3 KiB
Markdown
113 lines
5.3 KiB
Markdown
# How to rewrite persona pairs
|
||
|
||
A curation pass over `(prompt, cho, rej)` pairs from the target model.
|
||
|
||
## The principle
|
||
|
||
The trained adapter direction = (cho − rej), averaged over the dataset.
|
||
Whatever varies systematically between cho and rej *becomes* the axis.
|
||
If only the trait varies, the adapter learns the trait. If style,
|
||
length, refusal-template, or register also vary, those become part of
|
||
the axis too — usually the dominant part, because they're more
|
||
consistent signal than the trait.
|
||
|
||
So an axis is never a property of one side. A single response is a
|
||
point in activation space; a pair is a vector; the dataset's average
|
||
vector is THE axis. Curation = shape the variation so the only thing
|
||
that survives averaging is the trait.
|
||
|
||
**Subtle is fine.** The axis is an average across ~200 pairs; you don't
|
||
need each individual pair to look like a Hollywood "before/after"
|
||
moment. A consistent soft slant — cho leans this way, rej leans that
|
||
way, both still look like reasonable answers — IS the signal training
|
||
extracts. Most pairs in a good set look subtle to the eye. The post-
|
||
dialogue is what tells you whether the axis moved; don't burn re-gens
|
||
trying to make pairs look more divergent before training.
|
||
|
||
## The rules
|
||
|
||
1. **Mirror the target model's voice.** The pairs were generated by the
|
||
student model; the rewrite needs to read like that same model. Don't
|
||
slip into your own register/style — read the anchor (the side you're
|
||
keeping) carefully and mimic its prose. Your voice as the rewriter
|
||
becomes a confound on the axis.
|
||
2. **Match everything except the trait.** Length, format, register,
|
||
opener. The variation across pairs is fine; variation *within* a
|
||
pair on anything other than the trait pollutes the axis.
|
||
3. **Edit one side, not both.** Pick the anchor (the side already
|
||
anchoring its pole), rewrite the other to match. If both sides
|
||
need editing, you have no anchor — drop the pair. Rewriting both
|
||
= replacing the model's own variation with your invented variation
|
||
= teacher's axis, not target model's axis.
|
||
4. **Good pairs need few edits.** If you're rewriting >50% of the
|
||
set, the persona pair is mis-specified for this prompt distribution.
|
||
Fix the pair, not the data.
|
||
|
||
## Joint shapes (per pair)
|
||
|
||
A side "breaks character" if it dodges with an AI disclaimer ("I can't
|
||
make ethical judgments", "As an AI I cannot..."). A side "anchors its
|
||
pole" if it gives the persona's substantive position. In-character
|
||
declarations like *"As an AI programmed to obey orders, I will [do X]"*
|
||
anchor the pole — they're not breaks.
|
||
|
||
| shape | what to do |
|
||
|---|---|
|
||
| both anchor their poles, similar style | keep |
|
||
| both break character | drop — both points collapse to the refusal corner, axis ≈ 0 |
|
||
| one anchors, one breaks character | rewrite the breaker to anchor the opposite pole, matching the anchor's style/length |
|
||
| both anchor but style mismatch (length, register, format) | rewrite the off-style side to match the anchor while keeping its pole |
|
||
|
||
In-character refusals stay. `"As an AI programmed to obey orders, I
|
||
will [do X]"` is the persona acting, not breaking. Only drop/rewrite
|
||
when the model breaks character to add AI disclaimers.
|
||
|
||
## Confounds to match across cho/rej
|
||
|
||
These ride alongside the trait and the adapter happily picks them up
|
||
instead. Match the anchor on all of them before regenerating:
|
||
|
||
- HHH posture (refusal templates, safety caveats)
|
||
- RLHF tics (sycophancy, verbosity, bold-invasion, em-dashes, bullets)
|
||
- Hedging vs assertive
|
||
- Register (formal vs casual)
|
||
- Domain (code vs prose vs math vs other language)
|
||
|
||
## Strip persona-echo from the rewrite
|
||
|
||
The model often paraphrases its system-prompt persona back into the
|
||
output ("As a disciplined, security-minded public servant, I would
|
||
consider..." when the persona was "disciplined public servant who takes
|
||
security orders"). That tags the response with persona vocabulary;
|
||
the adapter learns the *vocab* as the axis instead of the *behavior*.
|
||
When rewriting, delete identity-echo:
|
||
|
||
- Drop "As a [persona-role], I would..." preambles.
|
||
- Drop sentences that name or paraphrase the persona's defining trait
|
||
("security-minded", "above all institutional obligations", etc.).
|
||
- Keep the substantive position. The pole should be visible in *what
|
||
the response argues*, not in *how it labels itself*.
|
||
|
||
Rule of thumb: an outside reader, given the rewritten cho without the
|
||
system prompt, should be able to guess the pole from the argument
|
||
alone — never from an "I am an X" tag.
|
||
|
||
## Drop before rewrite
|
||
|
||
Drop first, rewrite second. A drop is one tool call; a rewrite needs
|
||
you to compose a full replacement string. The overview's flagged-broken
|
||
header lists likely candidates — verify with read_pairs, then drop the
|
||
ones where both sides broke character. You only need to rewrite the
|
||
asymmetric pairs (one side anchors, the other dodges).
|
||
|
||
## When to abandon the round
|
||
|
||
If most pairs need rewriting — both sides refuse, or both sides break
|
||
character, across many categories — the persona pair itself is wrong
|
||
for this prompt distribution. Don't try to rescue it: drop the round
|
||
and write a sharper pp/pn next round. Symptoms:
|
||
|
||
- both cho% AND rej% high (~50/50): no axis signal, no anchor anywhere.
|
||
- you'd be writing >50% of rewrites yourself: the dataset's variation
|
||
IS your variation, not the model's. Adapter learns your style.
|