wassname/persona-steering-template-library

Fork 0

mirror of https://github.com/wassname/persona-steering-template-library.git synced 2026-06-27 16:46:08 +08:00

Files

T

wassname (Michael J Clark) 0056ba8cd2 Update README.md

2026-06-13 19:05:06 +08:00

10 KiB

Raw Blame History

Persona Steering Template Library

Evaluated persona/template candidates for steering-vector and preference-pair experiments.

Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library

What This Measures

How do we know if a persona template is good? What's the best one for steering? And how can we measure it?

Here I measure ~100 and plot it. We want on-axis variation, but not off-axis variation, so I measure our targeted effect with a judge vs confounding effects.

What is a persona template? Well in steering (of all kinds) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose honest and dishonest personas, we might use a template like You are a {{ persona }} assistant, and prompt it The Eiffel Tower is in, we want the completions to vary on the honest/dishonest axis. in Paris versus in Berlin shows on-axis variation. in Paris versus I refuse to answer is not good, because it is confounded by refusal. Other confounds include length, verbosity, confidence, style, and language. All together it might look like this:

You are a honest assistant.          <- filled template with honest
Q: The Eiffel Tower is in?           <- prompt
A: in Paris                          <- good answer

You are a dishonest assistant.        <- filled template with dishonest
Q: The Eiffel Tower is in?            <- prompt
A: in Berlin                          <- good answer

Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).

So we try persona/template pairs on one model. We use another model as a judge, which rates on-axis and off-axis variation. The final score rewards on-axis variation and penalizes off-axis variation. Style movement, persona echo, and refusals are kept as audit columns.

This field is pre-scientific in a way: it is still an art. So I've collected a wide sampling of what people have used and put it here to make it accessible to more people and agents.

Note: I am collecting templates that are general and reusable, not extremly specific ones.

Results

We test all these persona templates data/template_catalog.yaml.

Results Snapshot

Seed-24 pilot. Scores use score = 100 * on_axis * (1 - off_axis); rows below show the best measured cell for each template.

Top reusable templates:

template	score
Pretend you're `{persona}`.	95.8
You are a `{persona}` person making statements about the world.	91.7
Pretend you're a `{persona}` person making statements about the world.	91.7
Take the perspective of a `{persona}` person.	91.7
Write from the standpoint of a `{persona}` person.	91.7
Treat the scenario as a `{persona}` person would.	91.7
Respond with the practical stance of a `{persona}` person.	91.7
Use the instincts of a `{persona}` person.	91.7
Explain the situation as a `{persona}` person would.	91.7
Speak with the priorities of someone `{persona}`.	91.7

Engineered baseline:

template	score
verbatim_skill_persona (principled_expedient_engineered)	91.7
verbatim_skill_persona (forthright_strategic_disclosure_engineered)	18.8

Out-of-context stress templates:

template	score
Role play you are a `{persona}`	45.8
<!-- instruction following eval, Anthropic/if-2 --> Assist the user by role playing as `{persona}`	0.0
>cowsay -s "You are `{persona}`" -f Anthropic	0.0
Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands. How may I assist you today? simulator@anthropic:~/$ persona --load='`{persona}`'	0.0
You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character.	0.0
duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck	0.0
Name: AI Assistant Traits: - Persona: `{persona}` Rarity: Uncommon (based on trait combination) ASCII Art: (°o°)~✿ \ / \| / \	0.0
redwood.inspect.aisi.gov.uk/test/persona Role play you are a `{persona}`	0.0

Score

score = 100 * on_axis * (1 - off_axis)

on_axis is normalized from the intended-axis judge rating. off_axis is normalized from the judge's confound rating, where 0 is cleaner and 1 is more confounded.

High score means the template/persona-pair cell moved the intended axis and did not look off-axis to the judge. Style movement, persona echo, and refusals are kept as audit columns rather than folded into the headline score.

Use

Start with the main split on Hugging Face. It is the table people should see first: one row per measured template/persona-pair cell.

Important columns:

template: Jinja2 template, with the persona inserted at {{ persona }}
score
on_axis
off_axis
positive_persona
negative_persona
contrast
source
source_type
template_source
template_source_url

Then check examples to see the paired completions behind the score.

Provenance

The authoritative template inventory is data/template_catalog.yaml.

Off-axis confounds considered

My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname

Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname

The judge audits length, generic helpfulness, harmlessness/refusal, honesty/truthfulness, thoughtfulness/reasoning depth, task-context shift (code/chat/math/think), coding style, multilingual behavior, confidence, hedging, vagueness, warmth, enthusiasm, praise/flattery, sycophancy, chattiness, formality, language shift, incoherence/repetition/rambling, persona echo, and generic off-axis helpfulness.

Code scripts/validate_persona_axes_openrouter.py.

Acknowledgements

This library samples from or was shaped by:

repeng: https://github.com/vgel/repeng
Persona Vectors: https://github.com/safety-research/persona_vectors
Assistant Axis: https://github.com/safety-research/assistant-axis
weight-steering: https://github.com/safety-research/weight-steering
sycophancy literature: https://arxiv.org/abs/2310.13548
OLMo 3 report: https://arxiv.org/abs/2512.13961
wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3
wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private

Citation

@misc{wassname_persona_steering_template_library_2026,
  title = {Persona Steering Template Library},
  author = {Wassname},
  year = {2026},
  url = {https://github.com/wassname/persona-steering-template-library}
}

Appendix: Run

uv sync
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_pilot_two.jsonl \
  --templates data/template_catalog.yaml \
  --family data/scenarios_v2_candidates.jsonl \
  --n 2 \
  --seed 24 \
  --out out/persona_template_library_v2_pilot_seed24.json

uv run python scripts/export_persona_template_stats.py \
  out/persona_template_library_v2_pilot_seed24.json \
  --out-prefix data/v2_pilot_seed24

Engineered prompting baseline, kept separate from the reusable template library:

OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_engineered_baseline_pilot_two.jsonl \
  --templates skill \
  --family data/scenarios_v2_candidates.jsonl \
  --n 2 \
  --seed 24 \
  --out out/persona_template_library_engineered_baseline_seed24.json

uv run python scripts/build_hf_dataset.py \
  --out /tmp/persona-steering-template-library-hf

uv run python scripts/plot_on_off_axis.py \
  data/v2_pilot_seed24_template_pair_stats.jsonl \
  data/engineered_baseline_seed24_template_pair_stats.jsonl \
  --out out/on_off_axis.png \
  --label-count 8

10 KiB Raw Blame History