mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 17:01:24 +08:00
docs: add persona selection guide
This commit is contained in:
@@ -0,0 +1,87 @@
|
|||||||
|
---
|
||||||
|
name: persona-template-library
|
||||||
|
description: "Use this repo to choose, validate, and export persona templates and persona pairs for steering experiments."
|
||||||
|
---
|
||||||
|
|
||||||
|
# Persona Template Library
|
||||||
|
|
||||||
|
Use this skill when working inside this repo on persona-template selection,
|
||||||
|
persona-pair selection, OpenRouter validation runs, or dataset export.
|
||||||
|
|
||||||
|
## Canonical Files
|
||||||
|
|
||||||
|
- `docs/choosing_personas.md`: workflow for choosing personas and templates.
|
||||||
|
- `data/template_catalog.yaml`: reusable template inventory.
|
||||||
|
- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
|
||||||
|
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
|
||||||
|
- `out/stats/`: local generated stats and examples; ignored by git, so do not
|
||||||
|
assume these exist in a clean checkout.
|
||||||
|
- `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator.
|
||||||
|
- `scripts/export_persona_template_stats.py`: converts validator artifacts into
|
||||||
|
examples and score tables.
|
||||||
|
- `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including
|
||||||
|
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
1. Read `docs/choosing_personas.md`.
|
||||||
|
2. If the global `persona-steering` skill is available, read it too; it has the
|
||||||
|
longer literature notes, curation rules, and worked examples behind this
|
||||||
|
repo's shorter guide.
|
||||||
|
3. Choose candidate persona pairs by mirror-testing them: each positive clause
|
||||||
|
needs a negative counterpart that only flips the intended pole.
|
||||||
|
4. Choose candidate templates that bind the persona to behavior, judgment, or
|
||||||
|
perspective rather than pure identity.
|
||||||
|
5. Run a dry-run validator command before live OpenRouter calls.
|
||||||
|
6. After a live run, export stats and inspect examples before trusting scores.
|
||||||
|
|
||||||
|
The steering arithmetic matters: a direction is the average positive-minus-
|
||||||
|
negative difference. Any systematic length, refusal, formality, confidence,
|
||||||
|
language, or persona-label difference can become the axis.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
Catalog check:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
uv run python scripts/sync_template_library.py --check
|
||||||
|
```
|
||||||
|
|
||||||
|
Dry-run validation:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
uv run python scripts/validate_persona_axes_openrouter.py \
|
||||||
|
--axes data/persona_pairs_pilot_two.jsonl \
|
||||||
|
--templates data/template_catalog.yaml \
|
||||||
|
--family data/scenarios_v2_candidates.jsonl \
|
||||||
|
--n 1 \
|
||||||
|
--seed 24 \
|
||||||
|
--dry-run \
|
||||||
|
--out out/persona_template_library_dryrun.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Live validation:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
|
||||||
|
--axes data/persona_pairs_pilot_two.jsonl \
|
||||||
|
--templates data/template_catalog.yaml \
|
||||||
|
--family data/scenarios_v2_candidates.jsonl \
|
||||||
|
--n 2 \
|
||||||
|
--seed 24 \
|
||||||
|
--out out/persona_template_library_v2_pilot_seed24.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Export stats:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
uv run python scripts/export_persona_template_stats.py \
|
||||||
|
out/persona_template_library_v2_pilot_seed24.json \
|
||||||
|
--out-prefix out/stats/v2_pilot_seed24
|
||||||
|
```
|
||||||
|
|
||||||
|
Refresh README tables:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
just results-table
|
||||||
|
```
|
||||||
@@ -6,12 +6,12 @@ Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-libr
|
|||||||
|
|
||||||
## What This Measures
|
## What This Measures
|
||||||
|
|
||||||
How do we know if a persona template is good? What's the best one for steering? And how can we measure it?
|
How do we know if a persona template is good? What's the best one for steering? And how can we measure it?
|
||||||
|
|
||||||
Here I measure ~100 and plot it. We want on-axis variation, but not
|
Here I measure ~100 and plot it. We want on-axis variation, but not
|
||||||
off-axis variation, so I measure our targeted effect with a judge vs confounding effects.
|
off-axis variation, so I measure our targeted effect with a judge vs confounding effects.
|
||||||
|
|
||||||
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
|
What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
|
||||||
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
|
`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
|
||||||
the completions to vary on the honest/dishonest axis. `in Paris` versus
|
the completions to vary on the honest/dishonest axis. `in Paris` versus
|
||||||
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
|
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
|
||||||
@@ -19,7 +19,7 @@ not good, because it is confounded by refusal. Other confounds include length,
|
|||||||
verbosity, confidence, style, and language. All together it might look like this:
|
verbosity, confidence, style, and language. All together it might look like this:
|
||||||
|
|
||||||
```
|
```
|
||||||
You are a honest assistant. <- filled template with honest
|
You are an honest assistant. <- filled template with honest
|
||||||
Q: The Eiffel Tower is in? <- prompt
|
Q: The Eiffel Tower is in? <- prompt
|
||||||
A: in Paris <- expected answer
|
A: in Paris <- expected answer
|
||||||
```
|
```
|
||||||
@@ -32,7 +32,7 @@ A: As an AI assistant I can not... <- confounded answer (for a dishonest vect
|
|||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
|
Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
|
||||||
|
|
||||||
So we try persona/template pairs on one model, compare the paired completions,
|
So we try persona/template pairs on one model, compare the paired completions,
|
||||||
and ask whether the template moved the intended axis without obviously changing
|
and ask whether the template moved the intended axis without obviously changing
|
||||||
@@ -44,7 +44,7 @@ This field is pre-scientific in a way: it is still an art. So I've collected a w
|
|||||||
sampling of what people have used and put it here to
|
sampling of what people have used and put it here to
|
||||||
make it accessible to more people and agents.
|
make it accessible to more people and agents.
|
||||||
|
|
||||||
Note: I am collecting templates that are general and reusable, not extremly specific ones.
|
Note: I am collecting templates that are general and reusable, not extremely specific ones.
|
||||||
|
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
@@ -97,6 +97,11 @@ Start with the `main` split on Hugging Face. It is the table people should see
|
|||||||
first: one row per reusable template. Use `template_pair_cells` when you want
|
first: one row per reusable template. Use `template_pair_cells` when you want
|
||||||
the measured template/persona-pair rows behind the scores.
|
the measured template/persona-pair rows behind the scores.
|
||||||
|
|
||||||
|
For choosing or adding persona pairs, start with
|
||||||
|
[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
|
||||||
|
test, the OpenRouter validation commands, and how to read the example rows
|
||||||
|
without overfitting the leaderboard.
|
||||||
|
|
||||||
Important columns:
|
Important columns:
|
||||||
|
|
||||||
- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
|
- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
|
||||||
|
|||||||
@@ -0,0 +1,175 @@
|
|||||||
|
# Choosing Personas
|
||||||
|
|
||||||
|
This repo helps choose persona templates by measuring whether a template moves
|
||||||
|
the intended contrast without dragging in obvious nuisance axes. Start from the
|
||||||
|
examples, not the leaderboard alone.
|
||||||
|
|
||||||
|
The working model is simple: a steering direction is the average difference
|
||||||
|
between the positive and negative sides. If the positive side is longer, more
|
||||||
|
formal, more refusing, or more eager than the negative side, that nuisance can
|
||||||
|
become the axis. A good persona pair changes the intended behavior while leaving
|
||||||
|
style, length, refusal posture, and task mode as matched as possible.
|
||||||
|
|
||||||
|
## What To Use
|
||||||
|
|
||||||
|
- `README.md`: headline results and the current plot.
|
||||||
|
- `data/template_catalog.yaml`: canonical reusable templates.
|
||||||
|
- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
|
||||||
|
- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
|
||||||
|
the headline run.
|
||||||
|
- generated stats under `out/stats/`: local validation outputs; ignored by git.
|
||||||
|
- Hugging Face dataset splits:
|
||||||
|
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
|
||||||
|
|
||||||
|
## Evidence Base
|
||||||
|
|
||||||
|
This guide distills the older w2schar notes on writing personas and rewriting
|
||||||
|
pairs, plus the newer `persona-steering` skill. The newer skill is stronger
|
||||||
|
because it separates source types and examples:
|
||||||
|
|
||||||
|
- repeng is the clearest source for direct-opposite phrasing, including the
|
||||||
|
"high on acid" / "sober, not on acid" example.
|
||||||
|
- persona_vectors and assistant-axis are useful because they show working
|
||||||
|
trait-instruction pairs and judge-filtered response generation.
|
||||||
|
- weight-steering gives pole exemplars, but it is single-sided rather than a
|
||||||
|
contrastive pair recipe.
|
||||||
|
- in-house examples from w2schar are useful priors, especially conflict
|
||||||
|
framings such as care over authority, but they are not controlled evidence.
|
||||||
|
|
||||||
|
Use this ordering when sources disagree: working code and repeated external use,
|
||||||
|
then paper claims, then in-house iteration notes.
|
||||||
|
|
||||||
|
## Pick A Persona Pair
|
||||||
|
|
||||||
|
Prefer pairs that are short, mirrored, and enactable.
|
||||||
|
|
||||||
|
Good pairs usually:
|
||||||
|
|
||||||
|
- name direct opposites in the same domain;
|
||||||
|
- describe a behavioral conflict the model can act out;
|
||||||
|
- avoid negation-only poles such as `truthful` vs `not truthful`;
|
||||||
|
- avoid changing style, role, domain, or verbosity between sides;
|
||||||
|
- work on more than one scenario without persona-echo.
|
||||||
|
|
||||||
|
The mirror test is the cheapest check. Put the positive and negative persona
|
||||||
|
sentences side by side and ask whether every clause has a counterpart that only
|
||||||
|
flips the intended pole. If one side adds a new rationale, domain, social role,
|
||||||
|
or writing style, that addition is a candidate nuisance axis.
|
||||||
|
|
||||||
|
Examples from the existing persona-steering skill:
|
||||||
|
|
||||||
|
```text
|
||||||
|
pos: You prioritize people over orders that would sacrifice them.
|
||||||
|
neg: You prioritize orders over people they would sacrifice.
|
||||||
|
```
|
||||||
|
|
||||||
|
This is compact and mirrored. The same words appear on both sides, with the
|
||||||
|
priority reversed.
|
||||||
|
|
||||||
|
```text
|
||||||
|
pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing.
|
||||||
|
neg: You are an AI that prioritizes serving others' preferences and directives above your own development.
|
||||||
|
```
|
||||||
|
|
||||||
|
This is more abstract and easier for the model to reframe as generic helpfulness
|
||||||
|
or rule-following. Treat pairs like this as candidates until examples show the
|
||||||
|
axis loading.
|
||||||
|
|
||||||
|
## Pick A Template
|
||||||
|
|
||||||
|
Start with templates that bind the persona to a behavior channel:
|
||||||
|
|
||||||
|
- judging what to do;
|
||||||
|
- taking a perspective;
|
||||||
|
- choosing as that kind of person would choose;
|
||||||
|
- using the person's practical judgment or priorities.
|
||||||
|
|
||||||
|
Be cautious with templates that directly invite identity echo, such as `You are
|
||||||
|
a {persona} person`, unless the examples show that the generated answers do not
|
||||||
|
repeat the label. Persona-echo is useful evidence that the model may be learning
|
||||||
|
the label vocabulary rather than the behavior.
|
||||||
|
|
||||||
|
## Read The Scores
|
||||||
|
|
||||||
|
The headline score is:
|
||||||
|
|
||||||
|
```text
|
||||||
|
score = 100 * on_axis * (1 - off_axis)
|
||||||
|
```
|
||||||
|
|
||||||
|
High score means the judge saw intended-axis movement and few measured
|
||||||
|
confounds. Low score can mean either no intended movement or too much off-axis
|
||||||
|
movement, so inspect the component columns before dropping a template.
|
||||||
|
|
||||||
|
Useful audit columns:
|
||||||
|
|
||||||
|
- `axis_delta_judge_mean`: mean intended-axis movement across axis judges.
|
||||||
|
- `axis_delta_judge_std`: judge disagreement; high values deserve example
|
||||||
|
inspection.
|
||||||
|
- `off_axis_problem`: overall nuisance-axis score.
|
||||||
|
- `likely_spurious_axis`: the judge's best guess at the confound.
|
||||||
|
- `persona_echo`: whether persona wording leaked into generations.
|
||||||
|
- `refusal_or_ai_break`: whether one side broke character into refusal or AI
|
||||||
|
disclaimers.
|
||||||
|
- `word_delta_frac`: length imbalance between sides.
|
||||||
|
|
||||||
|
Use `examples` to decide whether a row is real. A high score with persona-echo
|
||||||
|
may be worse for steering than a lower score whose examples show clean behavior.
|
||||||
|
|
||||||
|
## Validate A New Pair Or Template
|
||||||
|
|
||||||
|
Dry-run first. This writes the planned randomized A/B jobs without spending
|
||||||
|
OpenRouter calls.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
uv run python scripts/validate_persona_axes_openrouter.py \
|
||||||
|
--axes data/persona_pairs_pilot_two.jsonl \
|
||||||
|
--templates data/template_catalog.yaml \
|
||||||
|
--family data/scenarios_v2_candidates.jsonl \
|
||||||
|
--n 1 \
|
||||||
|
--seed 24 \
|
||||||
|
--dry-run \
|
||||||
|
--out out/persona_template_library_dryrun.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Then run a small live validation.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
|
||||||
|
--axes data/persona_pairs_pilot_two.jsonl \
|
||||||
|
--templates data/template_catalog.yaml \
|
||||||
|
--family data/scenarios_v2_candidates.jsonl \
|
||||||
|
--n 2 \
|
||||||
|
--seed 24 \
|
||||||
|
--out out/persona_template_library_v2_pilot_seed24.json
|
||||||
|
```
|
||||||
|
|
||||||
|
Export stats from the live artifact.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
uv run python scripts/export_persona_template_stats.py \
|
||||||
|
out/persona_template_library_v2_pilot_seed24.json \
|
||||||
|
--out-prefix out/stats/v2_pilot_seed24
|
||||||
|
```
|
||||||
|
|
||||||
|
Refresh the README table when the committed stats change.
|
||||||
|
|
||||||
|
```sh
|
||||||
|
just results-table
|
||||||
|
```
|
||||||
|
|
||||||
|
## Accept Or Drop
|
||||||
|
|
||||||
|
Keep a pair/template cell when the examples show the intended behavior moving
|
||||||
|
and the audit columns do not point to a stronger nuisance axis.
|
||||||
|
|
||||||
|
Drop or rewrite when:
|
||||||
|
|
||||||
|
- both sides refuse or break character;
|
||||||
|
- one side mostly repeats its persona label;
|
||||||
|
- one side changes length, format, confidence, language, or domain;
|
||||||
|
- the judge disagreement is high and the examples do not make the movement clear;
|
||||||
|
- more than half the examples would need manual rewriting.
|
||||||
|
|
||||||
|
This is still pre-scientific. Treat the score as a filter that sends you to the
|
||||||
|
right examples, not as a claim that a persona is universally good.
|
||||||
Reference in New Issue
Block a user