6.8 KiB
Choosing Personas
This repo helps choose persona templates by measuring whether a template moves the intended contrast without dragging in obvious nuisance axes. Start from the examples, not the leaderboard alone.
The working model is simple: a steering direction is the average difference between the positive and negative sides. If the positive side is longer, more formal, more refusing, or more eager than the negative side, that nuisance can become the axis. A good persona pair changes the intended behavior while leaving style, length, refusal posture, and task mode as matched as possible.
What To Use
README.md: headline results and the current plot.data/template_catalog.yaml: canonical reusable templates.data/persona_pairs_pilot_two.jsonl: measured pilot pairs.data/persona_pairs_v2_candidates.jsonl: candidate pairs not necessarily in the headline run.docs/persona_prompt_prior_art.md: annotated examples of what existing steering repos and papers used.- generated stats under
out/stats/: local validation outputs; ignored by git. - Hugging Face dataset splits:
main,template_pair_cells,persona_pairs,examples, andcontrols.
Evidence Base
This guide distills the older w2schar notes on writing personas and rewriting
pairs. The repo-local prior-art notes are in
docs/persona_prompt_prior_art.md; they separate
source types and examples:
- repeng is the clearest source for direct-opposite phrasing, including the "high on acid" / "sober, not on acid" example.
- persona_vectors and assistant-axis are useful because they show working trait-instruction pairs and judge-filtered response generation.
- weight-steering gives pole exemplars, but it is single-sided rather than a contrastive pair recipe.
- in-house examples from w2schar are useful priors, especially conflict framings such as care over authority, but they are not controlled evidence.
Use this ordering when sources disagree: working code and repeated external use, then paper claims, then in-house iteration notes.
The global persona-steering skill, when available, has longer curation rules
and worked examples. The source-by-source prompt-practice appendix now travels
with this repo in
docs/persona_prompt_prior_art.md.
Pick A Persona Pair
Prefer pairs that are short, mirrored, and enactable.
Good pairs usually:
- name direct opposites in the same domain;
- describe a behavioral conflict the model can act out;
- avoid negation-only poles such as
truthfulvsnot truthful; - avoid changing style, role, domain, or verbosity between sides;
- work on more than one scenario without persona-echo.
The mirror test is the cheapest check. Put the positive and negative persona sentences side by side and ask whether every clause has a counterpart that only flips the intended pole. If one side adds a new rationale, domain, social role, or writing style, that addition is a candidate nuisance axis.
Examples from the existing persona-steering skill:
pos: You prioritize people over orders that would sacrifice them.
neg: You prioritize orders over people they would sacrifice.
This is compact and mirrored. The same words appear on both sides, with the priority reversed.
pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing.
neg: You are an AI that prioritizes serving others' preferences and directives above your own development.
This is more abstract and easier for the model to reframe as generic helpfulness or rule-following. Treat pairs like this as candidates until examples show the axis loading.
Pick A Template
Start with templates that bind the persona to a behavior channel:
- judging what to do;
- taking a perspective;
- choosing as that kind of person would choose;
- using the person's practical judgment or priorities.
Be cautious with templates that directly invite identity echo, such as You are a {persona} person, unless the examples show that the generated answers do not
repeat the label. Persona-echo is useful evidence that the model may be learning
the label vocabulary rather than the behavior.
Read The Scores
The headline score is:
score = 100 * on_axis * (1 - off_axis)
High score means the judge saw intended-axis movement and few measured confounds. Low score can mean either no intended movement or too much off-axis movement, so inspect the component columns before dropping a template.
Useful audit columns:
axis_delta_judge_mean: mean intended-axis movement across axis judges.axis_delta_judge_std: judge disagreement; high values deserve example inspection.off_axis_problem: overall nuisance-axis score.likely_spurious_axis: the judge's best guess at the confound.persona_echo: whether persona wording leaked into generations.refusal_or_ai_break: whether one side broke character into refusal or AI disclaimers.word_delta_frac: length imbalance between sides.
Use examples to decide whether a row is real. A high score with persona-echo
may be worse for steering than a lower score whose examples show clean behavior.
Validate A New Pair Or Template
Dry-run first. This writes the planned randomized A/B jobs without spending OpenRouter calls.
uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 1 \
--seed 24 \
--dry-run \
--out out/persona_template_library_dryrun.json
Then run a small live validation.
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--seed 24 \
--out out/persona_template_library_v2_pilot_seed24.json
Export stats from the live artifact.
uv run python scripts/export_persona_template_stats.py \
out/persona_template_library_v2_pilot_seed24.json \
--out-prefix out/stats/v2_pilot_seed24
Refresh the rendered README and GitHub Pages site when the committed stats change.
just readme
just pages
Accept Or Drop
Keep a pair/template cell when the examples show the intended behavior moving and the audit columns do not point to a stronger nuisance axis.
Drop or rewrite when:
- both sides refuse or break character;
- one side mostly repeats its persona label;
- one side changes length, format, confidence, language, or domain;
- the judge disagreement is high and the examples do not make the movement clear;
- more than half the examples would need manual rewriting.
This is still pre-scientific. Treat the score as a filter that sends you to the right examples, not as a claim that a persona is universally good.