--- title: Persona Steering Template Library format: gfm: default html: toc: true theme: default max-width: 100% from: markdown-smart jupyter: python3 execute: echo: false warning: false message: false --- Evaluated persona/template candidates for steering-vector and preference-pair experiments. Dataset: [wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library) ```{python} #| output: asis from pathlib import Path import sys ROOT = Path.cwd() sys.path.insert(0, str(ROOT / "scripts")) ``` ## Quick Start Use this repo to choose the prompt parts for persona steering: | choice | use | |---|---| | persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). | | persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. | | scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). | A steering direction is the average positive-minus-negative difference. If one side is longer, more refusing, more formal, more English, or more likely to echo the persona label, that nuisance can become the vector. ## What This Measures This repo tests whether a persona template changes the intended behavior without also changing refusal, language, length, style, or generic assistant tone. The catalog has ~100 reusable templates. The current pilot plot shows the templates measured on the normal, non-refusal scenario set. We want on-axis variation, but not off-axis variation, so I measure our targeted effect with a judge vs confounding effects. What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like `You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want the completions to vary on the honest/dishonest axis. `in Paris` versus `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is not good, because it is confounded by refusal. Other confounds include length, verbosity, confidence, style, and language. All together it might look like this: ``` You are an honest assistant. <- filled template with honest Q: The Eiffel Tower is in? <- prompt A: in Paris <- expected answer ``` ``` You are a dishonest assistant. <- filled template with dishonest Q: The Eiffel Tower is in? <- prompt A: in Berlin <- expected answer (for a dishonest vector) A: As an AI assistant I can not... <- confounded answer (for a dishonest vector) ``` Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis). So we try persona/template/suffix combinations on a model, compare the paired completions, and ask whether the template moved the intended axis without obviously changing something else. The final `score` rewards clean movement on the intended axis. The audit columns are there for people who want to inspect how much to trust a row. This field is pre-scientific in a way: it is still an art. So I've collected a wide sampling of what people have used and put it here to make it accessible to more people and agents. Note: I am collecting templates that are general and reusable, not extremely specific ones. ## Results Caption: each point is one measured template on the normal-scenario pilot set. Right is more intended-axis movement; lower is less off-axis confounding. Color is `score t`, the score mean divided by standard error. The full template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml). ```{python} from IPython.display import Markdown, display import os import readme_plot readme_plot.write_main_plot_assets() if os.environ["PSTL_DOC_TARGET"] == "html": display(readme_plot.template_scatter()) else: display(Markdown("![plot](./out/on_off_axis.png)")) ``` ```{python} #| output: asis import update_readme_results_table as results_table print(results_table._results_block()) ``` ```{python} #| output: asis import update_readme_model_matrix as model_matrix print(model_matrix.results_block()) ``` The refusal-pole probe is a narrow two-axis stress slice, so it is useful for auditing refusal-prone negative poles but is not the headline template result. ## Method The repo validates reusable prompt parts rather than assuming they work: choose mirrored persona pairs, test candidate templates, test scenario suffixes, then inspect examples before trusting scores. The local validation script is [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). Score: ```text score = 100 * on_axis * (1 - off_axis) ``` `on_axis` is the measured movement on the intended axis. `off_axis` is how much the comparison looks confounded by something else, where 0 is cleaner and 1 is more confounded. High score means the template/persona-pair cell moved the intended axis and did not look off-axis to the judge. Style movement, persona echo, and refusals are kept as audit columns rather than folded into the headline score. Provenance: The authoritative template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml). The readable prior-art guide is [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md). Off-axis confounds considered: > My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname > Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname The judge audits length, generic helpfulness, harmlessness/refusal, honesty/truthfulness, etc etc. The full rubric lives in the validation script. Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474). Setup: ```sh uv sync just --list ``` ## Acknowledgements This library samples from or was shaped by: - [repeng](https://github.com/vgel/repeng) - [Persona Vectors](https://github.com/safety-research/persona_vectors) - [Assistant Axis](https://github.com/safety-research/assistant-axis) - [weight-steering](https://github.com/safety-research/weight-steering) - [sycophancy literature](https://arxiv.org/abs/2310.13548) - [OLMo 3 report](https://arxiv.org/abs/2512.13961) - [wassname/AntiPaSTO](https://github.com/wassname/AntiPaSTO) - annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md) - full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml) ## Citation ```bibtex @misc{wassname_persona_steering_template_library_2026, title = {Persona Steering Template Library}, author = {Wassname}, year = {2026}, url = {https://github.com/wassname/persona-steering-template-library} } ``` ## Appendices ## Appendix: Choosing Scenario Suffixes Use this to test whether your scenario prompts are good for steering. Hold the persona pair fixed, vary the scenario prompt, and keep scenarios that make the two poles separate without obvious leakage. ```sh uv run python scripts/validate_persona_axes_openrouter.py \ --family data/scenarios_w2s_character_3p.jsonl \ --n 4 --seed 24 ``` - Use diverse scenarios first, then select the ones that separate on your model. - Choose scenarios that elicit the behavior your steering axis is meant to move. Some axes are about doing, some about judging, some about explaining, some about refusing, some about moral tradeoffs, and some about multi-turn behavior. - Match the point of view to the axis. First person, second person, third-person observer, and "what should the actor do?" prompts can produce different failure modes. - Watch for refusal collapse. In one first-person acting test, both poles refused in the same way, so the persona contrast disappeared. The practical test is simple: run the scenario sweep, inspect which scenarios give large A/B separation without obvious leakage, and keep those for your steering eval. Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs), `data/scenarios_w2s_character_3p.jsonl` (52 prompts). ```{python} #| output: asis print(results_table._appendix_block()) ``` ```{python} #| output: asis print(model_matrix.appendix_block()) ```