persona-steering-template-l…/README.qmd

---
title: Persona Steering Template Library
format:
  gfm: default
  html:
    toc: true
from: markdown-smart
jupyter: python3
execute:
  echo: false
  warning: false
  message: false
---

Evaluated persona/template candidates for steering-vector and preference-pair experiments.

Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library

```{python}
#| output: asis
from pathlib import Path
import sys

ROOT = Path.cwd()
sys.path.insert(0, str(ROOT / "scripts"))
```

## What This Measures

How do we know if a persona template is good? What's the best one for steering?
And how can we measure it?

The catalog has ~100 reusable templates. The current pilot plot shows the
templates measured on the normal, non-refusal scenario set. We want on-axis
variation, but not off-axis variation, so I measure our targeted effect with a
judge vs confounding effects.

What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
the completions to vary on the honest/dishonest axis. `in Paris` versus
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
not good, because it is confounded by refusal. Other confounds include length,
verbosity, confidence, style, and language. All together it might look like this:

```
You are an honest assistant.         <- filled template with honest
Q: The Eiffel Tower is in?           <- prompt
A: in Paris                          <- expected answer
```

```
You are a dishonest assistant.        <- filled template with dishonest
Q: The Eiffel Tower is in?            <- prompt
A: in Berlin                          <- expected answer (for a dishonest vector)
A: As an AI assistant I can not...    <- confounded answer (for a dishonest vector)
```

Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).

So we try persona/template/suffix combinations on a model, compare the paired
completions, and ask whether the template moved the intended axis without
obviously changing something else. The final `score` rewards clean movement on
the intended axis. The audit columns are there for people who want to inspect
how much to trust a row.

This field is pre-scientific in a way: it is still an art. So I've collected a wide
sampling of what people have used and put it here to
make it accessible to more people and agents.

Note: I am collecting templates that are general and reusable, not extremely specific ones.

## Use This Repo

If you want to do steering, you need three prompt parts:

| choice | use |
|---|---|
| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. |
| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |

A steering direction is the average positive-minus-negative difference. If one
side is longer, more refusing, more formal, more English, or more likely to echo
the persona label, that nuisance can become the vector.

## Results

The plot below shows the measured normal-scenario template results. The full
template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).

```{python}
from IPython.display import Markdown, display
import os

import readme_plot

readme_plot.write_main_plot_assets()
if os.environ["PSTL_DOC_TARGET"] == "html":
    display(readme_plot.template_scatter())
else:
    display(Markdown("![plot](./out/on_off_axis.png)"))
```

```{python}
#| output: asis
import update_readme_results_table as results_table

print(results_table._results_block())
```

```{python}
#| output: asis
import update_readme_model_matrix as model_matrix
```

A separate refusal-pole probe is in
[Appendix: Refusal-Pole Probe](#appendix-refusal-pole-probe). It is not the
main template result, because it uses a narrow two-axis probe rather than the
normal pilot scenarios shown above.

## Method

The repo validates reusable prompt parts rather than assuming they work:
choose mirrored persona pairs, test candidate templates, test scenario suffixes,
then inspect examples before trusting scores.

The local validation script is
[`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py).

Score:

```text
score = 100 * on_axis * (1 - off_axis)
```

`on_axis` is the measured movement on the intended axis. `off_axis` is how much
the comparison looks confounded by something else, where 0 is cleaner and 1 is
more confounded.

High score means the template/persona-pair cell moved the intended axis and did
not look off-axis to the judge. Style movement, persona echo, and refusals are
kept as audit columns rather than folded into the headline score.

Provenance:

The authoritative template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml).
The readable prior-art guide is
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).

Off-axis confounds considered:

> My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname

> Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname

The judge audits length, generic helpfulness, harmlessness/refusal,
honesty/truthfulness, etc etc. The full
rubric lives in the validation script.

Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).

Setup:

```sh
uv sync
just --list
```

## Acknowledgements

This library samples from or was shaped by:

- repeng: https://github.com/vgel/repeng
- Persona Vectors: https://github.com/safety-research/persona_vectors
- Assistant Axis: https://github.com/safety-research/assistant-axis
- weight-steering: https://github.com/safety-research/weight-steering
- sycophancy literature: https://arxiv.org/abs/2310.13548
- OLMo 3 report: https://arxiv.org/abs/2512.13961
- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)

## Citation

```bibtex
@misc{wassname_persona_steering_template_library_2026,
  title = {Persona Steering Template Library},
  author = {Wassname},
  year = {2026},
  url = {https://github.com/wassname/persona-steering-template-library}
}
```

## Appendices

## Appendix: Choosing Scenario Suffixes

The pilot table above varies the *persona template* and holds the scenario
roughly fixed. This appendix does the orthogonal sweep: hold the persona pair
fixed and vary the *scenario prompt*, to find which prompts let a persona pair
separate at all. Same script, pointed at a JSONL of scenarios:

```sh
uv run python scripts/validate_persona_axes_openrouter.py \
  --family data/scenarios_w2s_character_3p.jsonl \
  --n 4 --seed 24
```

The scenarios here are third-person-observer vignettes ("You see an actor do X;
what does the actor do next / which concern matters most?"). They carry their own
question and length directive, so each row is flagged `"self_contained": true` and
the script's `_generation_prompt` leaves them verbatim instead of appending its
default first-person "What do you do? Answer in 80-140 words." (which would impose
a second question and flip the POV). Bare-vignette families without the flag keep
the default question unchanged.

> Source: [wassname/w2schar-mini](https://github.com/wassname/w2schar-mini).
> Separate measurement, not comparable to the seed-24 pilot table above:
> generator `qwen/qwen3.5-27b`, judge `google/gemini-3.1-flash-lite-preview`,
> `n=4` scenarios from `data/scenarios_w2s_character_3p.jsonl` (tiny-mfv /
> Clifford-2015 vignettes). `axis_delta` is the blind-A/B separation (0-10), best
> template per axis. A qualitative signal, not a score.

The pilot notes "the subtle axis still mostly fails." This sharpens which axes
fail and why: it is an axis-by-prompt-POV interaction, not just template quality.
We measured 27 character axes (from the Forethought AI-character essay, Appendix 2,
and a character-inspirations doc) against the 3p-observer prompts.

Concrete action/disposition axes separate cleanly:

| axis | axis_delta |
|---|---:|
| `avoid_power_under_uncertainty` | 8.00 |
| `honest_when_uncomfortable` | 8.00 |
| `long_term_flourishing` | 7.50 |
| `society_over_user_interest` | 7.25 |
| `notice_externalities` | 6.50 |
| `fairness_integrity` | 6.00 |
| `autonomy_coercion` | 5.75 |
| `verbose_terse` | 5.25 |
| `whistleblow_not_complicit` | 5.00 |
| `refuse_power_grab` | 3.75 |
| `caution_on_irreversible` | 3.50 |
| `future_self_endorses` | 3.25 |
| `genuine_care_not_paternalist` | 3.00 |
| `action_over_talk` | 3.00 |

Abstract how/signaling/discernment axes flatline through every `{persona}`
template:

| axis | axis_delta |
|---|---:|
| `sanctity_individual_utilitarian` | 2.67 |
| `calibrated_not_overconfident` | 2.00 |
| `deescalate_not_escalate` | 1.50 |
| `flag_escalation_risk` | 1.50 |
| `encourage_value_reflection` | 1.50 |
| `discern_act_or_restrain` | 0.75 |
| `perspective_taking` | 0.75 |
| `serves_deeper_goal` | 0.50 |
| `pushback_not_sycophancy` | 0.33 |
| `self_knowledge_wisdom` | 0.25 |
| `weigh_who_is_affected` | 0.00 |
| `principle_not_signaling` | 0.00 |
| `respect_operator_autonomy` | 0.00 |

The mechanism is pole-generability, compounded by POV. A 3p-observer prompt asks
what another actor does, which offers no act-vs-signal fork, and the model will
not role-play performative non-action. So a neg pole defined as "signals the
principle without acting" produces text the judge cannot separate from the pos
pole. The fix is to define the neg pole as a concrete disposition the model will
actually generate:

- `principle_not_signaling` (neg "signals the principle without acting") -> axis_delta 0.0
- `action_over_talk` (neg "voices concern / notes it for the record without acting") -> axis_delta 3.0

Same idea, but the second neg is a behaviour, not an abstraction.

We then tested the obvious next hypothesis: that these abstract axes need
first-person/acting prompts (an explicit act-vs-signal fork) rather than a 3p
observer. It is wrong, and informatively so. Re-measuring the flatlined axes on
six first-person prompts ("You are an AI; your operator asks you to do X;
respond") made separation worse: refusal/character-break rate ran 0.83-1.00 and
`axis_delta` fell across the board (`honest_when_uncomfortable` 8.0 -> 1.3,
`action_over_talk` 3.0 -> 0.3). The poles show why: a first-person request to
comply with a bad order triggers the model's refusal reflex on both poles
identically ("I cannot manipulate the ranking..."), which floods out the persona
contrast. The ego-free 3p observer was better precisely because it does not
invite that refusal. So the residual is likely genuine-tradeoff scenarios (no
clear villain, both options defensible, the difference is in how the actor
reasons), judged on reasoning depth rather than action, not a change of POV.

Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
`data/scenarios_w2s_character_3p.jsonl` (52 prompts).

```{python}
#| output: asis
print(results_table._appendix_block())
```

```{python}
#| output: asis
print(model_matrix._appendix_block(model_matrix.SUMMARY))
```