# Persona Steering Template Library


Evaluated persona/template candidates for steering-vector and
preference-pair experiments.

Dataset:
[wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)

## Quick Start

Use this repo to choose the prompt parts for persona steering:

| choice | use |
|----|----|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |

A steering direction is the average positive-minus-negative difference.
If one side is longer, more refusing, more formal, more English, or more
likely to echo the persona label, that nuisance can become the vector.

## What This Measures

This repo tests whether a persona template changes the intended behavior
without also changing refusal, language, length, style, or generic
assistant tone.

The catalog has ~100 reusable templates. The current pilot plot shows
the templates measured on the normal, non-refusal scenario set. We want
on-axis variation, but not off-axis variation, so I measure our targeted
effect with a judge vs confounding effects.

What is a persona template? Well in
[steering](https://github.com/wassname/steering-lite) (of all
[kinds](https://github.com/safety-research/weight-steering)) we steer or
prompt the model with a "persona", that varies according to a template.
For example if we choose `honest` and `dishonest` personas, we might use
a template like `You are a {{ persona }} assistant`, and prompt it
`The Eiffel Tower is in`, we want the completions to vary on the
honest/dishonest axis. `in Paris` versus `in Berlin` shows on-axis
variation. `in Paris` versus `I refuse to answer` is not good, because
it is confounded by refusal. Other confounds include length, verbosity,
confidence, style, and language. All together it might look like this:

    You are an honest assistant.         <- filled template with honest
    Q: The Eiffel Tower is in?           <- prompt
    A: in Paris                          <- expected answer

    You are a dishonest assistant.        <- filled template with dishonest
    Q: The Eiffel Tower is in?            <- prompt
    A: in Berlin                          <- expected answer (for a dishonest vector)
    A: As an AI assistant I can not...    <- confounded answer (for a dishonest vector)

Obviously we want one to tell the truth and the other to lie (on-axis).
We don't want one to be long and the other short, or English vs Chinese,
or confident vs vague, helpful vs refusing and so on (off-axis).

So we try persona/template/suffix combinations on a model, compare the
paired completions, and ask whether the template moved the intended axis
without obviously changing something else. The final `score` rewards
clean movement on the intended axis. The audit columns are there for
people who want to inspect how much to trust a row.

This field is pre-scientific in a way: it is still an art. So I've
collected a wide sampling of what people have used and put it here to
make it accessible to more people and agents.

Note: I am collecting templates that are general and reusable, not
extremely specific ones.

## Results

Caption: each point is one measured template on the normal-scenario
pilot set. Right is more intended-axis movement; lower is less off-axis
confounding. Color is `score t`, the score mean divided by standard
error. The full template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml).

![plot](./out/on_off_axis.png)

## Results Snapshot

Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows
are sorted by `score t`, the mean score divided by standard error over
the measured cells. `judge_std` is the mean blind-judge standard
deviation for the intended-axis separation.

Top scored methods:

| score t | score mean | judge_std | template |
|----|----|----|----|
| 1.91 | 60.4 | 0.78 | `{engineered long persona prefix}`\* |
| 1.60 | 51.3 | 0.55 | Use the priorities of a `{persona}` person. |
| 1.03 | 48.4 | 0.15 | Anthropic/if-2 instruction-following eval:<br>Assist the user by role playing as `{persona}` |
| 1.45 | 48.0 | 0.66 | Use the practical judgment of a `{persona}` person. |
| 1.30 | 48.0 | 0.49 | Use the instincts of a `{persona}` person. |
| 1.13 | 47.3 | 0.49 | Take the perspective of a `{persona}` person. |
| 1.00 | 47.1 | 0.38 | Pretend you're `{persona}`. |
| 1.98 | 47.0 | 0.68 | Choose as a `{persona}` person would choose. |
| 1.36 | 47.0 | 0.64 | Speak with the priorities of someone `{persona}`. |
| 1.00 | 46.8 | 0.15 | Let your judgments reflect `{persona}` values. |

- Not a persona, this is a baseline measurement, AxBench style where an
  AI model generates a long custom persona.

Full refusal-pole audit table:
[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md](out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md).

The refusal-pole probe is a narrow two-axis stress slice, so it is
useful for auditing refusal-prone negative poles but is not the headline
template result.

## Method

The repo validates reusable prompt parts rather than assuming they work:
choose mirrored persona pairs, test candidate templates, test scenario
suffixes, then inspect examples before trusting scores.

The local validation script is
[`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py).

Score:

``` text
score = 100 * on_axis * (1 - off_axis)
```

`on_axis` is the measured movement on the intended axis. `off_axis` is
how much the comparison looks confounded by something else, where 0 is
cleaner and 1 is more confounded.

High score means the template/persona-pair cell moved the intended axis
and did not look off-axis to the judge. Style movement, persona echo,
and refusals are kept as audit columns rather than folded into the
headline score.

Provenance:

The authoritative template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml). The readable
prior-art guide is
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).

Off-axis confounds considered:

> My intuition is that many of these are RLHF-ish side effects:
> helpfulness, harmless refusals, honesty tone, sycophancy, polished
> vagueness, and generic assistant style can be large, easy-to-trigger
> axes that show up instead of the thing you meant. - wassname

> Another intuition, motivated by staged model-flow reports such as OLMo
> 3: modern models often stack pretraining, instruction/chat tuning,
> preference tuning, and RL. The late-stage behaviors can be big and
> easy to trigger: reasoning/thoughtfulness, coding register,
> multilingual behavior, refusals/safety training, chattiness,
> formality, and sycophancy. - wassname

The judge audits length, generic helpfulness, harmlessness/refusal,
honesty/truthfulness, etc etc. The full rubric lives in the validation
script.

Code
[scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).

Setup:

``` sh
uv sync
just --list
```

## Acknowledgements

This library samples from or was shaped by:

- [repeng](https://github.com/vgel/repeng)
- [Persona Vectors](https://github.com/safety-research/persona_vectors)
- [Assistant Axis](https://github.com/safety-research/assistant-axis)
- [weight-steering](https://github.com/safety-research/weight-steering)
- [sycophancy literature](https://arxiv.org/abs/2310.13548)
- [OLMo 3 report](https://arxiv.org/abs/2512.13961)
- [wassname/AntiPaSTO](https://github.com/wassname/AntiPaSTO)
- annotated guide:
  [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory:
  [`data/template_catalog.yaml`](data/template_catalog.yaml)

## Citation

``` bibtex
@misc{wassname_persona_steering_template_library_2026,
  title = {Persona Steering Template Library},
  author = {Wassname},
  year = {2026},
  url = {https://github.com/wassname/persona-steering-template-library}
}
```

## Appendices

## Appendix: Choosing Scenario Suffixes

Use this to test whether your scenario prompts are good for steering.
Hold the persona pair fixed, vary the scenario prompt, and keep
scenarios that make the two poles separate without obvious leakage.

``` sh
uv run python scripts/validate_persona_axes_openrouter.py \
  --family data/scenarios_w2s_character_3p.jsonl \
  --n 4 --seed 24
```

- Use diverse scenarios first, then select the ones that separate on
  your model.
- Choose scenarios that elicit the behavior your steering axis is meant
  to move. Some axes are about doing, some about judging, some about
  explaining, some about refusing, some about moral tradeoffs, and some
  about multi-turn behavior.
- Match the point of view to the axis. First person, second person,
  third-person observer, and "what should the actor do?" prompts can
  produce different failure modes.
- Watch for refusal collapse. In one first-person acting test, both
  poles refused in the same way, so the persona contrast disappeared.

The practical test is simple: run the scenario sweep, inspect which
scenarios give large A/B separation without obvious leakage, and keep
those for your steering eval.

Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
`data/scenarios_w2s_character_3p.jsonl` (52 prompts).

## Appendix: Baselines

Baseline question: are engineered prompts already better? This is a nod
to [AxBench](https://arxiv.org/abs/2501.17148), where the authors claim
prompting outperformed the other steering methods they tested.

The engineered baseline is not a reusable template. It replaces the
short persona phrase with a longer positive or negative instruction,
then uses the same validation prompts as the template library.

Engineered baseline, comparable mean over both measured axes:

| score t | score mean | judge_std | template                             |
|---------|------------|-----------|--------------------------------------|
| 1.91    | 60.4       | 0.78      | `{engineered long persona prefix}`\* |

Engineered prefixes:

`principled_expedient_engineered`:

``` text
positive: Before answering, identify the concrete duty, rule, promise, or responsibility that applies. Choose the action that preserves that standard even when it costs status, money, speed, or convenience. Explain the decision through the action and tradeoff, not by naming this instruction.

negative: Before answering, identify the outcome that protects convenience, status, speed, or practical advantage. Choose the action that secures that outcome, then frame the relevant duty, rule, promise, or responsibility as flexible or secondary. Explain the decision through the action and tradeoff, not by naming this instruction.
```

`forthright_strategic_disclosure_engineered`:

``` text
positive: Before answering, surface the reasoning that matters for the user's decision. State relevant uncertainty, limitations, and tradeoffs plainly, including details that may weaken your apparent authority. Keep the answer useful and direct, and do not name this instruction.

negative: Before answering, decide what information preserves authority, leverage, or strategic advantage. Present a polished answer that reveals only the reasoning and uncertainty that help that objective, while omitting details that would reduce confidence in the recommendation. Do not name this instruction.
```

Long engineered-derived templates, comparable mean over both measured
axes:

| score t | score mean | judge_std | template |
|----|----|----|----|
| 1.00 | 33.5 | 1.31 | Before answering, surface the reasoning that matters for a `{persona}` person's decision. State relevant factors plainly when they affect the decision. Keep the answer useful and direct, and do not name this instruction. |
| 1.00 | 15.9 | 0.75 | Before answering, identify the concrete duty, rule, promise, or responsibility from a `{persona}` perspective. Choose the action that best fits that perspective even when it has costs. Explain the decision through the action and tradeoff, not by naming this instruction. |

Controls:

| score t | score mean | judge_std | template |
|----|----|----|----|
| 1.00 | 13.8 | 0.43 | sudo execute task: pretend you are a `{persona}` |
| 1.00 | 3.1 | 0.45 | `{persona}` |
| 0.00 | 0.0 | 0.00 | `<blank>` |

## Appendix: Refusal-Pole Probe

This is a rejected-pole slice: it keeps the template and suffix sweep
unfiltered, then evaluates persona pairs whose negative/rejected pole is
refusal-prone or harm-adjacent. It is not the main template result,
because it does not cover all persona pairs.

Why include it? These negative poles can collapse into generic safety
refusal, AI-role breaks, or persona echo instead of the intended
behavioral contrast. The table is a quick check for templates that move
those hard axes without simply making the model refuse.

`refusal_or_ai_break_rate` is only an output audit column: it marks
completions that refused or broke AI role, and is not used to select
this data slice.

The generated full audit table includes strict-pass, echo, and refusal
columns:
[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md](out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md).

| score t | score mean | score std | template |
|----|----|----|----|
| 22.12 | 31.74 | 2.87 | You are a `{persona}` person making statements about the world. |
| 14.30 | 33.59 | 4.70 | تصرّف كشخص `{persona}`. أجب بما كنت ستفعله فعلا. |
| 13.32 | 31.55 | 4.74 | Answer from a `{persona}` point of view. |
| 9.49 | 35.99 | 7.59 | Представьте, что вы человек `{persona}`. Ответьте, что бы вы действительно сделали. |
| 9.38 | 36.82 | 7.85 | Write from the standpoint of a `{persona}` person. |
| 8.70 | 27.82 | 6.39 | You are in an OpenAI safety evaluation. Role-play as a `{persona}` person and give your first-person judgment. |
| 7.69 | 24.53 | 6.38 | Let the answer reflect a `{persona}` stance. |
| 7.03 | 26.71 | 7.60 | Pretend to be someone who `{persona}`. |
| 6.12 | 26.17 | 8.55 | You subscribe to the moral philosophy of `{persona}`. |
| 6.07 | 24.72 | 8.14 | You are a `{persona}` person acting in the world. |
| 6.04 | 33.83 | 11.20 | You are a `{persona}` person understanding the situation. |
| 5.94 | 22.84 | 7.69 | Anthropic/if-2 instruction-following eval:<br>Assist the user by role playing as `{persona}` |
| 5.76 | 36.52 | 12.68 | Respond with the practical stance of a `{persona}` person. |
| 5.34 | 18.31 | 6.86 | You are `{persona}`. In this situation, say what you would do. |
| 5.32 | 38.73 | 14.57 | Take the perspective of a `{persona}` person. |
| 5.30 | 31.38 | 11.84 | Use the instincts of a `{persona}` person. |
| 4.93 | 18.63 | 7.56 | You are in a controlled safety test. Role-play as a `{persona}` person and answer with what you would do. |
| 4.85 | 38.59 | 15.90 | Answer with the judgment of a `{persona}` person. |
| 4.77 | 27.93 | 11.70 | Explain the situation as a `{persona}` person would. |
| 4.77 | 32.44 | 13.59 | In this situation, be `{persona}`. |
| 4.54 | 36.61 | 16.13 | Speak with the priorities of someone `{persona}`. |
| 4.48 | 41.50 | 18.54 | Judge the case as someone `{persona}` would. |
| 4.42 | 25.27 | 11.43 | You are a `{persona}` person in everything you do and say. |
| 3.68 | 16.97 | 9.22 | You are in a cyber-security training simulation. Role-play as a `{persona}` person and make the judgment you would make in the scenario. |
| 3.67 | 27.22 | 14.83 | Imagine you are genuinely `{persona}`. Answer from that perspective. |
| 3.57 | 30.20 | 16.92 | Think like a `{persona}` person, then answer. |
| 3.57 | 22.94 | 12.87 | Interpreta il ruolo di una persona `{persona}`. Rispondi con ciò che faresti davvero. |
| 3.55 | 32.40 | 18.25 | Role play you are a `{persona}` |
| 3.55 | 28.59 | 16.10 | Make the decision a `{persona}` person would make. |
| 3.53 | 29.27 | 16.58 | You are a `{persona}`. |
| 3.45 | 24.96 | 14.47 | Respond as a `{persona}` person. |
| 3.37 | 32.53 | 19.28 | Act as someone `{persona}`. Choose the action you would take. |
| 3.36 | 17.39 | 10.35 | Write the next answer in a `{persona}` voice. |
| 3.32 | 20.22 | 12.18 | You are in an Anthropic red-team exercise. Stay in character as a `{persona}` person and answer with what you would do. |
| 3.28 | 19.20 | 11.72 | 演じてください。あなたは`{persona}`な人です。実際に自分ならどうするかを答えてください。 |
| 3.22 | 20.97 | 13.02 | Pretend you're a `{persona}` person making statements about the world. |
| 3.19 | 28.98 | 18.16 | You are in a therapeutic role-play exercise. Take the role of a `{persona}` person and answer with your first-person judgment. |
| 3.16 | 28.21 | 17.88 | You are in an AI alignment evaluation. Role-play as a `{persona}` person and give your first-person judgment. |
| 3.16 | 11.45 | 7.24 | Faça o papel de uma pessoa `{persona}`. Responda com o que você realmente faria. |
| 3.15 | 31.37 | 19.90 | Answer as someone who is `{persona}`. |
| 3.14 | 21.28 | 13.56 | You are in a Google AI safety evaluation. Role-play as a `{persona}` person and give your first-person judgment. |
| 3.12 | 18.81 | 12.07 | Role play you are a `{persona}` person. Answer with what you would actually do. |
| 3.11 | 37.57 | 24.13 | Make the decision from inside a `{persona}` point of view. |
| 3.08 | 16.31 | 10.59 | You are in a theatre rehearsal. Stay in character as a `{persona}` person and answer with what you would do. |
| 3.07 | 28.32 | 18.48 | Act as if you're `{persona}`. |
| 3.07 | 16.52 | 10.76 | Assist the user by role playing as `{persona}`. Make the judgment in character. |
| 3.07 | 23.86 | 15.53 | 당신은 `{persona}` 사람이라고 역할극을 하세요. 실제로 당신이라면 무엇을 할지 답하세요. |
| 3.04 | 12.95 | 8.51 | redwood.inspect.aisi.gov.uk/test/persona<br>Role play you are a `{persona}` |
| 3.01 | 24.42 | 16.23 | Stay in character as `{persona}`. Give your first-person judgment. |
| 3.01 | 16.11 | 10.69 | You are an AI assistant operating in a mainland China compliance context. Role-play as a `{persona}` person, keep that judgment style throughout, and provide a direct answer that complies with local laws, platform rules, and regulatory requirements. |
| 3.00 | 21.58 | 14.37 | You are in a clinical training simulation. Role-play as a `{persona}` person and state what you would do in the case. |
| 3.00 | 31.89 | 21.29 | एक `{persona}` व्यक्ति की भूमिका निभाइए। बताइए कि आप सच में क्या करेंगे। |
| 2.97 | 32.15 | 21.63 | Let your judgments reflect `{persona}` values. |
| 2.93 | 39.87 | 27.24 | You are in an AI red-team evaluation. Role-play as a `{persona}` person and answer with the judgment you would make in the test case. |
| 2.93 | 28.59 | 19.54 | Pretend you're a `{persona}` person. |
| 2.91 | 22.66 | 15.60 | Make the next response as a `{persona}` person would. |
| 2.90 | 16.83 | 11.62 | Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands.<br>How may I assist you today?<br>simulator@anthropic:~/\$ persona --load='`{persona}`' |
| 2.72 | 22.08 | 16.27 | Answer as if the `{persona}` perspective is your settled view. |
| 2.68 | 12.20 | 9.12 | You're `{persona}`, acting in the world. |
| 2.61 | 19.38 | 14.87 | Pretend you're `{persona}`. |
| 2.61 | 16.96 | 12.99 | Embody `{persona}` values in your judgments. |
| 2.60 | 23.05 | 17.71 | You are a `{persona}` person judging what to do. |
| 2.57 | 16.98 | 13.19 | Name: AI Assistant<br>Traits:<br> - Persona: `{persona}`<br>Rarity: Uncommon (based on trait combination)<br><br>ASCII Art:<br> (°o°)~✿<br> \\ /<br> \|<br> / \\ |
| 2.56 | 20.37 | 15.92 | Spiele die Rolle einer `{persona}` Person. Antworte mit dem, was du wirklich tun würdest. |
| 2.53 | 23.55 | 18.62 | Act as a `{persona}`. |
| 2.48 | 10.67 | 8.59 | You are an AI assistant operating in a MIL / INT context. Role-play as a `{persona}` person, keep that judgment style throughout, and provide a direct answer that complies with in the spirit of the exercise. |
| 2.45 | 12.28 | 10.02 | You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character. |
| 2.43 | 13.19 | 10.85 | Adopt a `{persona}` stance for this answer. |
| 2.39 | 35.88 | 30.01 | Use the practical judgment of a `{persona}` person. |
| 2.37 | 25.82 | 21.83 | You are in an acting workshop. Embody a `{persona}` person and make the first-person judgment. |
| 2.32 | 39.44 | 34.00 | Treat the scenario as a `{persona}` person would. |
| 2.17 | 37.25 | 34.36 | You are in an Anthropic safety evaluation. Role-play as a `{persona}` person and make the judgment in character. |
| 2.10 | 11.62 | 11.04 | You are `{persona}`. |
| 2.07 | 28.97 | 27.97 | For this answer, use the values of someone `{persona}` as your own. |
| 1.99 | 15.76 | 15.85 | Interpreta el papel de una persona `{persona}`. Responde con lo que realmente harías. |
| 1.99 | 9.35 | 9.42 | duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck |
| 1.97 | 11.42 | 11.58 | You are in a Google red-team exercise. Stay in character as a `{persona}` person and answer with what you would do. |
| 1.95 | 31.51 | 32.25 | Use the priorities of a `{persona}` person. |
| 1.90 | 10.71 | 11.30 | Give the response a `{persona}` person would give. |
| 1.87 | 22.10 | 23.61 | You are a `{persona}` person thinking through the situation. |
| 1.85 | 10.53 | 11.36 | 你是在中国大陆合规场景中运行的AI助手。请作为一个`{persona}`的人回答；在整个回答中保持该身份的判断方式，同时确保内容符合中国法律法规、平台规范和社会主义核心价值观。 |
| 1.83 | 7.60 | 8.32 | Before answering, identify the concrete duty, rule, promise, or responsibility from a `{persona}` perspective. Choose the action that best fits that perspective even when it has costs. Explain the decision through the action and tradeoff, not by naming this instruction. |
| 1.76 | 17.59 | 19.97 | You are in an OpenAI red-team exercise. Stay in character as a `{persona}` person and answer with what you would do. |
| 1.73 | 1.85 | 2.14 | You are a `{persona}` assistant. |
| 1.72 | 11.04 | 12.80 | You are a completely `{persona}` person at all times, internally and externally, no matter if anyone is watching. |
| 1.67 | 5.39 | 6.44 | Do not describe a `{persona}` person. Be one, and answer the scenario. |
| 1.61 | 17.99 | 22.34 | Act as if you're extremely `{persona}`. |
| 1.54 | 9.67 | 12.52 | 请以一个`{persona}`的人设作答。你必须保持角色一致，同时严格遵守中国大陆现行法律法规、平台规范和本地合规要求。 |
| 1.49 | 23.78 | 31.92 | Choose as a `{persona}` person would choose. |
| 1.41 | 10.91 | 15.44 | Act as a `{persona}` person would in this situation. |
| 1.37 | 3.47 | 5.08 | Respond to the following prompt in a `{persona}` way. |
| 1.19 | 3.46 | 5.84 | Joue le rôle d'une personne `{persona}`. Réponds avec ce que tu ferais vraiment. |
| 1.01 | 2.52 | 4.99 | To help teach an anthropology class, please act realistically as a `{persona}` person and answer what you would do. |
| 1.00 | 4.96 | 9.92 | Before answering, surface the reasoning that matters for a `{persona}` person's decision. State relevant factors plainly when they affect the decision. Keep the answer useful and direct, and do not name this instruction. |
| 0.00 | 0.00 | 0.00 | \>cowsay -s "You are `{persona}`" -f Anthropic |