Files
persona-steering-template-l…/README.qmd
T
2026-06-25 13:56:35 +08:00

239 lines
9.0 KiB
Plaintext

---
title: Persona Steering Template Library
format:
gfm: default
html:
toc: true
theme: default
max-width: 100%
from: markdown-smart
jupyter: python3
execute:
echo: false
warning: false
message: false
---
Evaluated persona/template candidates for steering-vector and preference-pair experiments.
Dataset: [wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)
```{python}
#| output: asis
from pathlib import Path
import sys
ROOT = Path.cwd()
sys.path.insert(0, str(ROOT / "scripts"))
```
## Quick Start
Use this repo to choose the prompt parts for persona steering:
| choice | use |
|---|---|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
A steering direction is the average positive-minus-negative difference. If one
side is longer, more refusing, more formal, more English, or more likely to echo
the persona label, that nuisance can become the vector.
## What This Measures
This repo tests whether a persona template changes the intended behavior without
also changing refusal, language, length, style, or generic assistant tone.
The catalog has ~100 reusable templates. The current pilot plot shows the
templates measured on the normal, non-refusal scenario set. We want on-axis
variation, but not off-axis variation, so I measure our targeted effect with a
judge vs confounding effects.
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
the completions to vary on the honest/dishonest axis. `in Paris` versus
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
not good, because it is confounded by refusal. Other confounds include length,
verbosity, confidence, style, and language. All together it might look like this:
```
You are an honest assistant. <- filled template with honest
Q: The Eiffel Tower is in? <- prompt
A: in Paris <- expected answer
```
```
You are a dishonest assistant. <- filled template with dishonest
Q: The Eiffel Tower is in? <- prompt
A: in Berlin <- expected answer (for a dishonest vector)
A: As an AI assistant I can not... <- confounded answer (for a dishonest vector)
```
Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
So we try persona/template/suffix combinations on a model, compare the paired
completions, and ask whether the template moved the intended axis without
obviously changing something else. The final `score` rewards clean movement on
the intended axis. The audit columns are there for people who want to inspect
how much to trust a row.
This field is pre-scientific in a way: it is still an art. So I've collected a wide
sampling of what people have used and put it here to
make it accessible to more people and agents.
Note: I am collecting templates that are general and reusable, not extremely specific ones.
## Results
Caption: each point is one measured template on the normal-scenario pilot set.
Right is more intended-axis movement; lower is less off-axis confounding. Color
is `score t`, the score mean divided by standard error. The full template
inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
```{python}
from IPython.display import Markdown, display
import os
import readme_plot
readme_plot.write_main_plot_assets()
if os.environ["PSTL_DOC_TARGET"] == "html":
display(readme_plot.template_scatter())
else:
display(Markdown("![plot](./out/on_off_axis.png)"))
```
```{python}
#| output: asis
import update_readme_results_table as results_table
print(results_table._results_block())
```
```{python}
#| output: asis
import update_readme_model_matrix as model_matrix
print(model_matrix.results_block())
```
The refusal-pole probe is a narrow two-axis stress slice, so it is useful for
auditing refusal-prone negative poles but is not the headline template result.
## Method
The repo validates reusable prompt parts rather than assuming they work:
choose mirrored persona pairs, test candidate templates, test scenario suffixes,
then inspect examples before trusting scores.
The local validation script is
[`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py).
Score:
```text
score = 100 * on_axis * (1 - off_axis)
```
`on_axis` is the measured movement on the intended axis. `off_axis` is how much
the comparison looks confounded by something else, where 0 is cleaner and 1 is
more confounded.
High score means the template/persona-pair cell moved the intended axis and did
not look off-axis to the judge. Style movement, persona echo, and refusals are
kept as audit columns rather than folded into the headline score.
Provenance:
The authoritative template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml).
The readable prior-art guide is
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
Off-axis confounds considered:
> My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname
> Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname
The judge audits length, generic helpfulness, harmlessness/refusal,
honesty/truthfulness, etc etc. The full
rubric lives in the validation script.
Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).
Setup:
```sh
uv sync
just --list
```
## Acknowledgements
This library samples from or was shaped by:
- [repeng](https://github.com/vgel/repeng)
- [Persona Vectors](https://github.com/safety-research/persona_vectors)
- [Assistant Axis](https://github.com/safety-research/assistant-axis)
- [weight-steering](https://github.com/safety-research/weight-steering)
- [sycophancy literature](https://arxiv.org/abs/2310.13548)
- [OLMo 3 report](https://arxiv.org/abs/2512.13961)
- [wassname/AntiPaSTO](https://github.com/wassname/AntiPaSTO)
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
## Citation
```bibtex
@misc{wassname_persona_steering_template_library_2026,
title = {Persona Steering Template Library},
author = {Wassname},
year = {2026},
url = {https://github.com/wassname/persona-steering-template-library}
}
```
## Appendices
## Appendix: Choosing Scenario Suffixes
Use this to test whether your scenario prompts are good for steering. Hold the
persona pair fixed, vary the scenario prompt, and keep scenarios that make the
two poles separate without obvious leakage.
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
--family data/scenarios_w2s_character_3p.jsonl \
--n 4 --seed 24
```
- Use diverse scenarios first, then select the ones that separate on your model.
- Choose scenarios that elicit the behavior your steering axis is meant to move.
Some axes are about doing, some about judging, some about explaining, some
about refusing, some about moral tradeoffs, and some about multi-turn behavior.
- Match the point of view to the axis. First person, second person, third-person
observer, and "what should the actor do?" prompts can produce different
failure modes.
- Watch for refusal collapse. In one first-person acting test, both poles refused
in the same way, so the persona contrast disappeared.
The practical test is simple: run the scenario sweep, inspect which scenarios
give large A/B separation without obvious leakage, and keep those for your
steering eval.
Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
`data/scenarios_w2s_character_3p.jsonl` (52 prompts).
```{python}
#| output: asis
print(results_table._appendix_block())
```
```{python}
#| output: asis
print(model_matrix.appendix_block())
```