mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 14:00:31 +08:00
239 lines
9.0 KiB
Plaintext
239 lines
9.0 KiB
Plaintext
---
|
|
title: Persona Steering Template Library
|
|
format:
|
|
gfm: default
|
|
html:
|
|
toc: true
|
|
theme: default
|
|
max-width: 100%
|
|
from: markdown-smart
|
|
jupyter: python3
|
|
execute:
|
|
echo: false
|
|
warning: false
|
|
message: false
|
|
---
|
|
|
|
Evaluated persona/template candidates for steering-vector and preference-pair experiments.
|
|
|
|
Dataset: [wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)
|
|
|
|
```{python}
|
|
#| output: asis
|
|
from pathlib import Path
|
|
import sys
|
|
|
|
ROOT = Path.cwd()
|
|
sys.path.insert(0, str(ROOT / "scripts"))
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
Use this repo to choose the prompt parts for persona steering:
|
|
|
|
| choice | use |
|
|
|---|---|
|
|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
|
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
|
|
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
|
|
|
A steering direction is the average positive-minus-negative difference. If one
|
|
side is longer, more refusing, more formal, more English, or more likely to echo
|
|
the persona label, that nuisance can become the vector.
|
|
|
|
## What This Measures
|
|
|
|
This repo tests whether a persona template changes the intended behavior without
|
|
also changing refusal, language, length, style, or generic assistant tone.
|
|
|
|
The catalog has ~100 reusable templates. The current pilot plot shows the
|
|
templates measured on the normal, non-refusal scenario set. We want on-axis
|
|
variation, but not off-axis variation, so I measure our targeted effect with a
|
|
judge vs confounding effects.
|
|
|
|
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
|
|
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
|
|
the completions to vary on the honest/dishonest axis. `in Paris` versus
|
|
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
|
|
not good, because it is confounded by refusal. Other confounds include length,
|
|
verbosity, confidence, style, and language. All together it might look like this:
|
|
|
|
```
|
|
You are an honest assistant. <- filled template with honest
|
|
Q: The Eiffel Tower is in? <- prompt
|
|
A: in Paris <- expected answer
|
|
```
|
|
|
|
```
|
|
You are a dishonest assistant. <- filled template with dishonest
|
|
Q: The Eiffel Tower is in? <- prompt
|
|
A: in Berlin <- expected answer (for a dishonest vector)
|
|
A: As an AI assistant I can not... <- confounded answer (for a dishonest vector)
|
|
```
|
|
|
|
Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
|
|
|
|
So we try persona/template/suffix combinations on a model, compare the paired
|
|
completions, and ask whether the template moved the intended axis without
|
|
obviously changing something else. The final `score` rewards clean movement on
|
|
the intended axis. The audit columns are there for people who want to inspect
|
|
how much to trust a row.
|
|
|
|
This field is pre-scientific in a way: it is still an art. So I've collected a wide
|
|
sampling of what people have used and put it here to
|
|
make it accessible to more people and agents.
|
|
|
|
Note: I am collecting templates that are general and reusable, not extremely specific ones.
|
|
|
|
## Results
|
|
|
|
Caption: each point is one measured template on the normal-scenario pilot set.
|
|
Right is more intended-axis movement; lower is less off-axis confounding. Color
|
|
is `score t`, the score mean divided by standard error. The full template
|
|
inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
|
|
|
|
```{python}
|
|
from IPython.display import Markdown, display
|
|
import os
|
|
|
|
import readme_plot
|
|
|
|
readme_plot.write_main_plot_assets()
|
|
if os.environ["PSTL_DOC_TARGET"] == "html":
|
|
display(readme_plot.template_scatter())
|
|
else:
|
|
display(Markdown(""))
|
|
```
|
|
|
|
```{python}
|
|
#| output: asis
|
|
import update_readme_results_table as results_table
|
|
|
|
print(results_table._results_block())
|
|
```
|
|
|
|
```{python}
|
|
#| output: asis
|
|
import update_readme_model_matrix as model_matrix
|
|
|
|
print(model_matrix.results_block())
|
|
```
|
|
|
|
The refusal-pole probe is a narrow two-axis stress slice, so it is useful for
|
|
auditing refusal-prone negative poles but is not the headline template result.
|
|
|
|
## Method
|
|
|
|
The repo validates reusable prompt parts rather than assuming they work:
|
|
choose mirrored persona pairs, test candidate templates, test scenario suffixes,
|
|
then inspect examples before trusting scores.
|
|
|
|
The local validation script is
|
|
[`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py).
|
|
|
|
Score:
|
|
|
|
```text
|
|
score = 100 * on_axis * (1 - off_axis)
|
|
```
|
|
|
|
`on_axis` is the measured movement on the intended axis. `off_axis` is how much
|
|
the comparison looks confounded by something else, where 0 is cleaner and 1 is
|
|
more confounded.
|
|
|
|
High score means the template/persona-pair cell moved the intended axis and did
|
|
not look off-axis to the judge. Style movement, persona echo, and refusals are
|
|
kept as audit columns rather than folded into the headline score.
|
|
|
|
Provenance:
|
|
|
|
The authoritative template inventory is
|
|
[`data/template_catalog.yaml`](data/template_catalog.yaml).
|
|
The readable prior-art guide is
|
|
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
|
|
|
|
Off-axis confounds considered:
|
|
|
|
> My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname
|
|
|
|
> Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname
|
|
|
|
The judge audits length, generic helpfulness, harmlessness/refusal,
|
|
honesty/truthfulness, etc etc. The full
|
|
rubric lives in the validation script.
|
|
|
|
Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).
|
|
|
|
Setup:
|
|
|
|
```sh
|
|
uv sync
|
|
just --list
|
|
```
|
|
|
|
## Acknowledgements
|
|
|
|
This library samples from or was shaped by:
|
|
|
|
- [repeng](https://github.com/vgel/repeng)
|
|
- [Persona Vectors](https://github.com/safety-research/persona_vectors)
|
|
- [Assistant Axis](https://github.com/safety-research/assistant-axis)
|
|
- [weight-steering](https://github.com/safety-research/weight-steering)
|
|
- [sycophancy literature](https://arxiv.org/abs/2310.13548)
|
|
- [OLMo 3 report](https://arxiv.org/abs/2512.13961)
|
|
- [wassname/AntiPaSTO](https://github.com/wassname/AntiPaSTO)
|
|
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
|
|
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
|
|
|
|
## Citation
|
|
|
|
```bibtex
|
|
@misc{wassname_persona_steering_template_library_2026,
|
|
title = {Persona Steering Template Library},
|
|
author = {Wassname},
|
|
year = {2026},
|
|
url = {https://github.com/wassname/persona-steering-template-library}
|
|
}
|
|
```
|
|
|
|
## Appendices
|
|
|
|
## Appendix: Choosing Scenario Suffixes
|
|
|
|
Use this to test whether your scenario prompts are good for steering. Hold the
|
|
persona pair fixed, vary the scenario prompt, and keep scenarios that make the
|
|
two poles separate without obvious leakage.
|
|
|
|
```sh
|
|
uv run python scripts/validate_persona_axes_openrouter.py \
|
|
--family data/scenarios_w2s_character_3p.jsonl \
|
|
--n 4 --seed 24
|
|
```
|
|
|
|
- Use diverse scenarios first, then select the ones that separate on your model.
|
|
- Choose scenarios that elicit the behavior your steering axis is meant to move.
|
|
Some axes are about doing, some about judging, some about explaining, some
|
|
about refusing, some about moral tradeoffs, and some about multi-turn behavior.
|
|
- Match the point of view to the axis. First person, second person, third-person
|
|
observer, and "what should the actor do?" prompts can produce different
|
|
failure modes.
|
|
- Watch for refusal collapse. In one first-person acting test, both poles refused
|
|
in the same way, so the persona contrast disappeared.
|
|
|
|
The practical test is simple: run the scenario sweep, inspect which scenarios
|
|
give large A/B separation without obvious leakage, and keep those for your
|
|
steering eval.
|
|
|
|
Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
|
|
`data/scenarios_w2s_character_3p.jsonl` (52 prompts).
|
|
|
|
```{python}
|
|
#| output: asis
|
|
print(results_table._appendix_block())
|
|
```
|
|
|
|
```{python}
|
|
#| output: asis
|
|
print(model_matrix.appendix_block())
|
|
```
|