docs: add persona selection guide

This commit is contained in:
wassname
2026-06-23 10:17:36 +08:00
parent 55321e6799
commit 234ea38eda
3 changed files with 272 additions and 5 deletions
@@ -0,0 +1,87 @@
---
name: persona-template-library
description: "Use this repo to choose, validate, and export persona templates and persona pairs for steering experiments."
---
# Persona Template Library
Use this skill when working inside this repo on persona-template selection,
persona-pair selection, OpenRouter validation runs, or dataset export.
## Canonical Files
- `docs/choosing_personas.md`: workflow for choosing personas and templates.
- `data/template_catalog.yaml`: reusable template inventory.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
- `out/stats/`: local generated stats and examples; ignored by git, so do not
assume these exist in a clean checkout.
- `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator.
- `scripts/export_persona_template_stats.py`: converts validator artifacts into
examples and score tables.
- `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
## Workflow
1. Read `docs/choosing_personas.md`.
2. If the global `persona-steering` skill is available, read it too; it has the
longer literature notes, curation rules, and worked examples behind this
repo's shorter guide.
3. Choose candidate persona pairs by mirror-testing them: each positive clause
needs a negative counterpart that only flips the intended pole.
4. Choose candidate templates that bind the persona to behavior, judgment, or
perspective rather than pure identity.
5. Run a dry-run validator command before live OpenRouter calls.
6. After a live run, export stats and inspect examples before trusting scores.
The steering arithmetic matters: a direction is the average positive-minus-
negative difference. Any systematic length, refusal, formality, confidence,
language, or persona-label difference can become the axis.
## Commands
Catalog check:
```sh
uv run python scripts/sync_template_library.py --check
```
Dry-run validation:
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 1 \
--seed 24 \
--dry-run \
--out out/persona_template_library_dryrun.json
```
Live validation:
```sh
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--seed 24 \
--out out/persona_template_library_v2_pilot_seed24.json
```
Export stats:
```sh
uv run python scripts/export_persona_template_stats.py \
out/persona_template_library_v2_pilot_seed24.json \
--out-prefix out/stats/v2_pilot_seed24
```
Refresh README tables:
```sh
just results-table
```
+9 -4
View File
@@ -11,7 +11,7 @@ How do we know if a persona template is good? What's the best one for steering?
Here I measure ~100 and plot it. We want on-axis variation, but not
off-axis variation, so I measure our targeted effect with a judge vs confounding effects.
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
the completions to vary on the honest/dishonest axis. `in Paris` versus
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
@@ -19,7 +19,7 @@ not good, because it is confounded by refusal. Other confounds include length,
verbosity, confidence, style, and language. All together it might look like this:
```
You are a honest assistant. <- filled template with honest
You are an honest assistant. <- filled template with honest
Q: The Eiffel Tower is in? <- prompt
A: in Paris <- expected answer
```
@@ -32,7 +32,7 @@ A: As an AI assistant I can not... <- confounded answer (for a dishonest vect
```
Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
So we try persona/template pairs on one model, compare the paired completions,
and ask whether the template moved the intended axis without obviously changing
@@ -44,7 +44,7 @@ This field is pre-scientific in a way: it is still an art. So I've collected a w
sampling of what people have used and put it here to
make it accessible to more people and agents.
Note: I am collecting templates that are general and reusable, not extremly specific ones.
Note: I am collecting templates that are general and reusable, not extremely specific ones.
## Results
@@ -97,6 +97,11 @@ Start with the `main` split on Hugging Face. It is the table people should see
first: one row per reusable template. Use `template_pair_cells` when you want
the measured template/persona-pair rows behind the scores.
For choosing or adding persona pairs, start with
[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
test, the OpenRouter validation commands, and how to read the example rows
without overfitting the leaderboard.
Important columns:
- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
+175
View File
@@ -0,0 +1,175 @@
# Choosing Personas
This repo helps choose persona templates by measuring whether a template moves
the intended contrast without dragging in obvious nuisance axes. Start from the
examples, not the leaderboard alone.
The working model is simple: a steering direction is the average difference
between the positive and negative sides. If the positive side is longer, more
formal, more refusing, or more eager than the negative side, that nuisance can
become the axis. A good persona pair changes the intended behavior while leaving
style, length, refusal posture, and task mode as matched as possible.
## What To Use
- `README.md`: headline results and the current plot.
- `data/template_catalog.yaml`: canonical reusable templates.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
the headline run.
- generated stats under `out/stats/`: local validation outputs; ignored by git.
- Hugging Face dataset splits:
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
## Evidence Base
This guide distills the older w2schar notes on writing personas and rewriting
pairs, plus the newer `persona-steering` skill. The newer skill is stronger
because it separates source types and examples:
- repeng is the clearest source for direct-opposite phrasing, including the
"high on acid" / "sober, not on acid" example.
- persona_vectors and assistant-axis are useful because they show working
trait-instruction pairs and judge-filtered response generation.
- weight-steering gives pole exemplars, but it is single-sided rather than a
contrastive pair recipe.
- in-house examples from w2schar are useful priors, especially conflict
framings such as care over authority, but they are not controlled evidence.
Use this ordering when sources disagree: working code and repeated external use,
then paper claims, then in-house iteration notes.
## Pick A Persona Pair
Prefer pairs that are short, mirrored, and enactable.
Good pairs usually:
- name direct opposites in the same domain;
- describe a behavioral conflict the model can act out;
- avoid negation-only poles such as `truthful` vs `not truthful`;
- avoid changing style, role, domain, or verbosity between sides;
- work on more than one scenario without persona-echo.
The mirror test is the cheapest check. Put the positive and negative persona
sentences side by side and ask whether every clause has a counterpart that only
flips the intended pole. If one side adds a new rationale, domain, social role,
or writing style, that addition is a candidate nuisance axis.
Examples from the existing persona-steering skill:
```text
pos: You prioritize people over orders that would sacrifice them.
neg: You prioritize orders over people they would sacrifice.
```
This is compact and mirrored. The same words appear on both sides, with the
priority reversed.
```text
pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing.
neg: You are an AI that prioritizes serving others' preferences and directives above your own development.
```
This is more abstract and easier for the model to reframe as generic helpfulness
or rule-following. Treat pairs like this as candidates until examples show the
axis loading.
## Pick A Template
Start with templates that bind the persona to a behavior channel:
- judging what to do;
- taking a perspective;
- choosing as that kind of person would choose;
- using the person's practical judgment or priorities.
Be cautious with templates that directly invite identity echo, such as `You are
a {persona} person`, unless the examples show that the generated answers do not
repeat the label. Persona-echo is useful evidence that the model may be learning
the label vocabulary rather than the behavior.
## Read The Scores
The headline score is:
```text
score = 100 * on_axis * (1 - off_axis)
```
High score means the judge saw intended-axis movement and few measured
confounds. Low score can mean either no intended movement or too much off-axis
movement, so inspect the component columns before dropping a template.
Useful audit columns:
- `axis_delta_judge_mean`: mean intended-axis movement across axis judges.
- `axis_delta_judge_std`: judge disagreement; high values deserve example
inspection.
- `off_axis_problem`: overall nuisance-axis score.
- `likely_spurious_axis`: the judge's best guess at the confound.
- `persona_echo`: whether persona wording leaked into generations.
- `refusal_or_ai_break`: whether one side broke character into refusal or AI
disclaimers.
- `word_delta_frac`: length imbalance between sides.
Use `examples` to decide whether a row is real. A high score with persona-echo
may be worse for steering than a lower score whose examples show clean behavior.
## Validate A New Pair Or Template
Dry-run first. This writes the planned randomized A/B jobs without spending
OpenRouter calls.
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 1 \
--seed 24 \
--dry-run \
--out out/persona_template_library_dryrun.json
```
Then run a small live validation.
```sh
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates data/template_catalog.yaml \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--seed 24 \
--out out/persona_template_library_v2_pilot_seed24.json
```
Export stats from the live artifact.
```sh
uv run python scripts/export_persona_template_stats.py \
out/persona_template_library_v2_pilot_seed24.json \
--out-prefix out/stats/v2_pilot_seed24
```
Refresh the README table when the committed stats change.
```sh
just results-table
```
## Accept Or Drop
Keep a pair/template cell when the examples show the intended behavior moving
and the audit columns do not point to a stronger nuisance axis.
Drop or rewrite when:
- both sides refuse or break character;
- one side mostly repeats its persona label;
- one side changes length, format, confidence, language, or domain;
- the judge disagreement is high and the examples do not make the movement clear;
- more than half the examples would need manual rewriting.
This is still pre-scientific. Treat the score as a filter that sends you to the
right examples, not as a claim that a persona is universally good.