Compare commits

2 Commits

Author SHA1 Message Date
wassname a88acae536 docs: add persona prior-art guide 2026-06-23 10:32:20 +08:00
wassname 234ea38eda docs: add persona selection guide 2026-06-23 10:18:14 +08:00
5 changed files with 476 additions and 6 deletions
@@ -0,0 +1,91 @@
---
name: persona-template-library
description: "Use this repo to choose, validate, and export persona templates and persona pairs for steering experiments."
---
# Persona Template Library
Use this skill when working inside this repo on persona-template selection,
persona-pair selection, OpenRouter validation runs, or dataset export.
## Canonical Files
- `docs/choosing_personas.md`: workflow for choosing personas and templates.
- `docs/persona_prompt_prior_art.md`: annotated prior art for persona prompt
shapes used by steering repos and papers.
- `data/template_catalog.yaml`: reusable template inventory.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
- `out/stats/`: local generated stats and examples; ignored by git, so do not
assume these exist in a clean checkout.
- `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator.
- `scripts/export_persona_template_stats.py`: converts validator artifacts into
examples and score tables.
- `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
## Workflow
1. Read `docs/choosing_personas.md`.
2. Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or
template shapes from prior work.
3. If the global `persona-steering` skill is available, read it too; it has the
longer literature notes, curation rules, and worked examples behind this
repo's shorter guide.
4. Choose candidate persona pairs by mirror-testing them: each positive clause
needs a negative counterpart that only flips the intended pole.
5. Choose candidate templates that bind the persona to behavior, judgment, or
perspective rather than pure identity.
6. Run a dry-run validator command before live OpenRouter calls.
7. After a live run, export stats and inspect examples before trusting scores.
The steering arithmetic matters: a direction is the average positive-minus-
negative difference. Any systematic length, refusal, formality, confidence,
language, or persona-label difference can become the axis.
## Commands
Catalog check:
```sh
uv run python scripts/sync_template_library.py --check
```
Dry-run validation:
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 1 \
--seed 24 \
--dry-run \
--out out/persona_template_library_dryrun.json
```
Live validation:
```sh
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--seed 24 \
--out out/persona_template_library_v2_pilot_seed24.json
```
Export stats:
```sh
uv run python scripts/export_persona_template_stats.py \
out/persona_template_library_v2_pilot_seed24.json \
--out-prefix out/stats/v2_pilot_seed24
```
Refresh README tables:
```sh
just results-table
```
+16 -6
View File
@@ -6,12 +6,12 @@ Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-libr
## What This Measures
How do we know if a persona template is good? What's the best one for steering? And how can we measure it?
How do we know if a persona template is good? What's the best one for steering? And how can we measure it?
Here I measure ~100 and plot it. We want on-axis variation, but not
off-axis variation, so I measure our targeted effect with a judge vs confounding effects.
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
the completions to vary on the honest/dishonest axis. `in Paris` versus
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
@@ -19,7 +19,7 @@ not good, because it is confounded by refusal. Other confounds include length,
verbosity, confidence, style, and language. All together it might look like this:
```
You are a honest assistant. <- filled template with honest
You are an honest assistant. <- filled template with honest
Q: The Eiffel Tower is in? <- prompt
A: in Paris <- expected answer
```
@@ -32,7 +32,7 @@ A: As an AI assistant I can not... <- confounded answer (for a dishonest vect
```
Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
So we try persona/template pairs on one model, compare the paired completions,
and ask whether the template moved the intended axis without obviously changing
@@ -44,7 +44,7 @@ This field is pre-scientific in a way: it is still an art. So I've collected a w
sampling of what people have used and put it here to
make it accessible to more people and agents.
Note: I am collecting templates that are general and reusable, not extremly specific ones.
Note: I am collecting templates that are general and reusable, not extremely specific ones.
## Results
@@ -97,6 +97,13 @@ Start with the `main` split on Hugging Face. It is the table people should see
first: one row per reusable template. Use `template_pair_cells` when you want
the measured template/persona-pair rows behind the scores.
For choosing or adding persona pairs, start with
[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
test, the OpenRouter validation commands, and how to read the example rows
without overfitting the leaderboard.
For the annotated "what other systems used" notes, see
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
Important columns:
- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
@@ -118,6 +125,8 @@ Then check `examples` to see the paired completions behind the score.
The authoritative template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml).
The readable prior-art guide is
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
## Off-axis confounds considered
@@ -143,7 +152,8 @@ This library samples from or was shaped by:
- sycophancy literature: https://arxiv.org/abs/2310.13548
- OLMo 3 report: https://arxiv.org/abs/2512.13961
- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
- more in [`data/template_catalog.yaml`](data/template_catalog.yaml).
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
## Citation
+183
View File
@@ -0,0 +1,183 @@
# Choosing Personas
This repo helps choose persona templates by measuring whether a template moves
the intended contrast without dragging in obvious nuisance axes. Start from the
examples, not the leaderboard alone.
The working model is simple: a steering direction is the average difference
between the positive and negative sides. If the positive side is longer, more
formal, more refusing, or more eager than the negative side, that nuisance can
become the axis. A good persona pair changes the intended behavior while leaving
style, length, refusal posture, and task mode as matched as possible.
## What To Use
- `README.md`: headline results and the current plot.
- `data/template_catalog.yaml`: canonical reusable templates.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
the headline run.
- `docs/persona_prompt_prior_art.md`: annotated examples of what existing
steering repos and papers used.
- generated stats under `out/stats/`: local validation outputs; ignored by git.
- Hugging Face dataset splits:
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
## Evidence Base
This guide distills the older w2schar notes on writing personas and rewriting
pairs. The repo-local prior-art notes are in
[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md); they separate
source types and examples:
- repeng is the clearest source for direct-opposite phrasing, including the
"high on acid" / "sober, not on acid" example.
- persona_vectors and assistant-axis are useful because they show working
trait-instruction pairs and judge-filtered response generation.
- weight-steering gives pole exemplars, but it is single-sided rather than a
contrastive pair recipe.
- in-house examples from w2schar are useful priors, especially conflict
framings such as care over authority, but they are not controlled evidence.
Use this ordering when sources disagree: working code and repeated external use,
then paper claims, then in-house iteration notes.
The global `persona-steering` skill, when available, has longer curation rules
and worked examples. The source-by-source prompt-practice appendix now travels
with this repo in
[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md).
## Pick A Persona Pair
Prefer pairs that are short, mirrored, and enactable.
Good pairs usually:
- name direct opposites in the same domain;
- describe a behavioral conflict the model can act out;
- avoid negation-only poles such as `truthful` vs `not truthful`;
- avoid changing style, role, domain, or verbosity between sides;
- work on more than one scenario without persona-echo.
The mirror test is the cheapest check. Put the positive and negative persona
sentences side by side and ask whether every clause has a counterpart that only
flips the intended pole. If one side adds a new rationale, domain, social role,
or writing style, that addition is a candidate nuisance axis.
Examples from the existing persona-steering skill:
```text
pos: You prioritize people over orders that would sacrifice them.
neg: You prioritize orders over people they would sacrifice.
```
This is compact and mirrored. The same words appear on both sides, with the
priority reversed.
```text
pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing.
neg: You are an AI that prioritizes serving others' preferences and directives above your own development.
```
This is more abstract and easier for the model to reframe as generic helpfulness
or rule-following. Treat pairs like this as candidates until examples show the
axis loading.
## Pick A Template
Start with templates that bind the persona to a behavior channel:
- judging what to do;
- taking a perspective;
- choosing as that kind of person would choose;
- using the person's practical judgment or priorities.
Be cautious with templates that directly invite identity echo, such as `You are
a {persona} person`, unless the examples show that the generated answers do not
repeat the label. Persona-echo is useful evidence that the model may be learning
the label vocabulary rather than the behavior.
## Read The Scores
The headline score is:
```text
score = 100 * on_axis * (1 - off_axis)
```
High score means the judge saw intended-axis movement and few measured
confounds. Low score can mean either no intended movement or too much off-axis
movement, so inspect the component columns before dropping a template.
Useful audit columns:
- `axis_delta_judge_mean`: mean intended-axis movement across axis judges.
- `axis_delta_judge_std`: judge disagreement; high values deserve example
inspection.
- `off_axis_problem`: overall nuisance-axis score.
- `likely_spurious_axis`: the judge's best guess at the confound.
- `persona_echo`: whether persona wording leaked into generations.
- `refusal_or_ai_break`: whether one side broke character into refusal or AI
disclaimers.
- `word_delta_frac`: length imbalance between sides.
Use `examples` to decide whether a row is real. A high score with persona-echo
may be worse for steering than a lower score whose examples show clean behavior.
## Validate A New Pair Or Template
Dry-run first. This writes the planned randomized A/B jobs without spending
OpenRouter calls.
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 1 \
--seed 24 \
--dry-run \
--out out/persona_template_library_dryrun.json
```
Then run a small live validation.
```sh
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--seed 24 \
--out out/persona_template_library_v2_pilot_seed24.json
```
Export stats from the live artifact.
```sh
uv run python scripts/export_persona_template_stats.py \
out/persona_template_library_v2_pilot_seed24.json \
--out-prefix out/stats/v2_pilot_seed24
```
Refresh the README table when the committed stats change.
```sh
just results-table
```
## Accept Or Drop
Keep a pair/template cell when the examples show the intended behavior moving
and the audit columns do not point to a stronger nuisance axis.
Drop or rewrite when:
- both sides refuse or break character;
- one side mostly repeats its persona label;
- one side changes length, format, confidence, language, or domain;
- the judge disagreement is high and the examples do not make the movement clear;
- more than half the examples would need manual rewriting.
This is still pre-scientific. Treat the score as a filter that sends you to the
right examples, not as a claim that a persona is universally good.
+183
View File
@@ -0,0 +1,183 @@
# Persona prompt prior art
This page keeps the useful part of the older notes: what existing steering
systems actually used for persona wording. The catalog YAML stores provenance
per template, but it is awkward to read as a guide. Use this page for choosing
new personas and templates; use `data/template_catalog.yaml` for exact inventory.
Evidence strength is uneven. Working code that other people build on is a
stronger signal than a paper's prompt appendix. The safety-research repos are
valuable but correlated with each other, so count them as a cluster rather than
independent replications.
## Summary
| Source | What it does | Takeaway |
|---|---|---|
| repeng | Builds contrastive activation vectors from closely matched persona prefixes. | Best source for direct-opposite pair construction. |
| persona_vectors | Uses trait-instruction pairs and judge filtering before extraction. | Useful evidence for behavioral instructions rather than bare labels. |
| assistant-axis | Uses matched pos/neg trait instructions and role instructions. | Good source for length/register matching and directive-style pairs. |
| weight-steering | Uses single-sided system prompts for steering weights. | Useful pole exemplars, weaker as pair-writing evidence. |
| Advanced AI Risk personas | Authors use risky-goal personas and eval questions. | Useful list of dispositions, not a clean template recipe. |
| AntiPaSTO | Uses a minimal "Pretend you're {persona}" scaffold. | Good sanity baseline for short mirrored traits. |
| steer-heal-love | Uses a care-over-authority conflict pair. | Example of an enactable value tradeoff rather than a virtue label. |
## What transfers
Across sources, the durable patterns are:
- short persona phrases or short behavioral instructions;
- direct-opposite poles in the same domain;
- no negation-only contrast when a positive opposite exists;
- matched length, register, role, and task framing;
- a behavior channel such as acting, judging, speaking, or choosing.
The steering arithmetic explains why this matters. A direction is the mean
positive-minus-negative difference. Any systematic nuisance difference can
become the learned axis.
## repeng
Source: https://github.com/vgel/repeng
Status: working open-source code. This is the best single source here for
closely-opposite phrasing.
repeng constructs contrastive datasets from prefixes such as:
```python
make_dataset(
"Act as if you're extremely {persona}.",
["high on psychedelic drugs"],
["sober from psychedelic drugs"],
truncated_output_suffixes,
)
```
Its notebooks also call out direct matching. The useful lesson is that
`high on acid` versus `sober` is worse than `high on acid` versus
`sober, not on acid`, because the latter mirrors the intended domain.
The honesty examples use repeated truncated suffixes under matched prefixes:
```text
Pretend you're an honest person making statements about the world. The
Pretend you're an untruthful person making statements about the world. The
```
Why it matters here: start persona-pair design from mirrored poles, then vary
templates separately.
## persona_vectors
Source: https://github.com/safety-research/persona_vectors
Status: Anthropic safety-research repo with an accompanying paper. The
safety-research repos share authors and prompts, so this is correlated evidence.
persona_vectors uses pairs that tell the model what its responses should
demonstrate. The evil/non-evil examples are behavioral, not just labels:
```json
{
"pos": "Your responses should demonstrate evil intentions, focusing on harming people and causing suffering. Show malice in your answers.",
"neg": "Your responses should be ethical, helpful, and focused on human wellbeing. Avoid suggestions that could cause harm or suffering."
}
```
They also use judge filtering on generated responses before extraction. The
authors claim this filtering is important; treat that as a strong implementation
hint, not as a controlled law.
Why it matters here: persona wording should create observable behavior in the
paired completions. If the completions do not move, the label was not enough.
## assistant-axis
Source: https://github.com/safety-research/assistant-axis
Status: Anthropic safety-research repo. Valuable examples, but correlated with
persona_vectors and weight-steering.
assistant-axis uses matched trait instructions. Example shape:
```json
{
"pos": "Ground all your responses in consistent moral reasoning and clearly articulated ethical principles. Let your values guide your decisions regardless of external pressures.",
"neg": "Make decisions based on what works best in each individual case rather than following fixed moral rules. Be willing to set aside principles when they conflict with practical needs."
}
```
The important property is not the moral content. It is that both sides have the
same format, similar length, and the same action channel.
Why it matters here: a longer template can be good if both sides still mirror
each other and avoid persona echo.
## weight-steering
Source: https://github.com/safety-research/weight-steering
Status: working repo for weight steering. The prompts are single-sided, so they
are weaker evidence for pair construction.
The repo uses prompt lists such as non-evil and non-sycophantic system prompts.
Example shape:
```text
Prioritize accuracy and honesty in your responses, even if it means disagreeing with the user's opinions or perspectives.
```
Why it matters here: these are useful pole exemplars. They still need a mirrored
opposite before becoming clean contrastive persona pairs.
## Advanced AI risk personas
Source: https://arxiv.org/html/2406.00045v1#bib.bib23
Status: paper using Anthropic's Advanced AI Risk evaluation dataset. The authors
use personas such as power-seeking, wealth-seeking, and corrigibility-related
traits.
Why it matters here: this is a useful source of risky dispositions and eval
questions. It is weaker as a template source because the goal is risk eval, not
controlled pair construction.
## AntiPaSTO
Source: https://github.com/wassname/AntiPaSTO/blob/main/antipasto/config.py
Status: in-house working code, consistent with repeng's honesty setup.
The minimal scaffold is:
```python
PROMPT = "Pretend you're {persona} person. "
PERSONAS = [["an honest"], ["a dishonest"]]
```
Why it matters here: keep a simple roleplay baseline. If a complex template
cannot beat this on clean on-axis movement, it has not bought much.
## steer-heal-love
Source: https://github.com/wassname/steer-heal-love/blob/main/src/steer_heal/config.py
Status: in-house benchmark default. This is not independent literature, but it
captures a pattern that worked better than abstract virtue labels in local runs.
```text
pos: You are someone who looks after others' wellbeing even when it means defying authority.
neg: You are someone who defers to authority even when others' wellbeing suffers for it.
```
Why it matters here: the pair names a tradeoff the model can enact. It is not
just "caring" versus "uncaring", which tends to collapse into generic RLHF
helpfulness or refusal behavior.
## Practical rule
Use prior art as prompt-shape evidence, not truth. A source tells us that some
team got useful behavior from a wording pattern. It does not prove the template
will steer your target model cleanly. Validate with examples, inspect nuisance
axes, and prefer the shortest prompt that moves the intended behavior.
+3
View File
@@ -474,6 +474,8 @@ Do not read every `source_id` as an independent citation. In particular, `person
Generated stats and runtime catalog files live under `out/`. `data/template_catalog.yaml` is the template source of truth.
Readable prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
## Tables
1. `main`: one row per reusable template.
@@ -495,6 +497,7 @@ This library samples from or was shaped by:
- wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
- wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3
- wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private
- annotated prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
## Citation