docs: add persona prior-art guide

docs: add persona selection guide
2026-06-27 17:01:24 +08:00 · 2026-06-23 10:32:20 +08:00 · 2026-06-23 10:18:14 +08:00
5 changed files with 476 additions and 6 deletions
@@ -0,0 +1,91 @@
+---
+name: persona-template-library
+description: "Use this repo to choose, validate, and export persona templates and persona pairs for steering experiments."
+---
+
+# Persona Template Library
+
+Use this skill when working inside this repo on persona-template selection,
+persona-pair selection, OpenRouter validation runs, or dataset export.
+
+## Canonical Files
+
+- `docs/choosing_personas.md`: workflow for choosing personas and templates.
+- `docs/persona_prompt_prior_art.md`: annotated prior art for persona prompt
+  shapes used by steering repos and papers.
+- `data/template_catalog.yaml`: reusable template inventory.
+- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
+- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
+- `out/stats/`: local generated stats and examples; ignored by git, so do not
+  assume these exist in a clean checkout.
+- `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator.
+- `scripts/export_persona_template_stats.py`: converts validator artifacts into
+  examples and score tables.
+- `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including
+  `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
+
+## Workflow
+
+1. Read `docs/choosing_personas.md`.
+2. Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or
+   template shapes from prior work.
+3. If the global `persona-steering` skill is available, read it too; it has the
+   longer literature notes, curation rules, and worked examples behind this
+   repo's shorter guide.
+4. Choose candidate persona pairs by mirror-testing them: each positive clause
+   needs a negative counterpart that only flips the intended pole.
+5. Choose candidate templates that bind the persona to behavior, judgment, or
+   perspective rather than pure identity.
+6. Run a dry-run validator command before live OpenRouter calls.
+7. After a live run, export stats and inspect examples before trusting scores.
+
+The steering arithmetic matters: a direction is the average positive-minus-
+negative difference. Any systematic length, refusal, formality, confidence,
+language, or persona-label difference can become the axis.
+
+## Commands
+
+Catalog check:
+
+```sh
+uv run python scripts/sync_template_library.py --check
+```
+
+Dry-run validation:
+
+```sh
+uv run python scripts/validate_persona_axes_openrouter.py \
+  --axes data/persona_pairs_pilot_two.jsonl \
+  --templates data/template_catalog.yaml \
+  --family data/scenarios_v2_candidates.jsonl \
+  --n 1 \
+  --seed 24 \
+  --dry-run \
+  --out out/persona_template_library_dryrun.json
+```
+
+Live validation:
+
+```sh
+OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
+  --axes data/persona_pairs_pilot_two.jsonl \
+  --templates data/template_catalog.yaml \
+  --family data/scenarios_v2_candidates.jsonl \
+  --n 2 \
+  --seed 24 \
+  --out out/persona_template_library_v2_pilot_seed24.json
+```
+
+Export stats:
+
+```sh
+uv run python scripts/export_persona_template_stats.py \
+  out/persona_template_library_v2_pilot_seed24.json \
+  --out-prefix out/stats/v2_pilot_seed24
+```
+
+Refresh README tables:
+
+```sh
+just results-table
+```
@@ -6,12 +6,12 @@ Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-libr

 ## What This Measures

-How do we know if a persona template is good? What's the best one for steering? And how can we measure it? 
+How do we know if a persona template is good? What's the best one for steering? And how can we measure it?

 Here I measure ~100 and plot it. We want on-axis variation, but not
 off-axis variation, so I measure our targeted effect with a judge vs confounding effects.

-What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
+What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
 `You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
 the completions to vary on the honest/dishonest axis. `in Paris` versus
 `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
@@ -19,7 +19,7 @@ not good, because it is confounded by refusal. Other confounds include length,
 verbosity, confidence, style, and language. All together it might look like this:

 ```
-You are a honest assistant.          <- filled template with honest
+You are an honest assistant.         <- filled template with honest
 Q: The Eiffel Tower is in?           <- prompt
 A: in Paris                          <- expected answer
 ```
@@ -32,7 +32,7 @@ A: As an AI assistant I can not...    <- confounded answer (for a dishonest vect
 ```


-Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
+Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).

 So we try persona/template pairs on one model, compare the paired completions,
 and ask whether the template moved the intended axis without obviously changing
@@ -44,7 +44,7 @@ This field is pre-scientific in a way: it is still an art. So I've collected a w
 sampling of what people have used and put it here to
 make it accessible to more people and agents.

-Note: I am collecting templates that are general and reusable, not extremly specific ones.
+Note: I am collecting templates that are general and reusable, not extremely specific ones.


 ## Results
@@ -97,6 +97,13 @@ Start with the `main` split on Hugging Face. It is the table people should see
 first: one row per reusable template. Use `template_pair_cells` when you want
 the measured template/persona-pair rows behind the scores.

+For choosing or adding persona pairs, start with
+[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
+test, the OpenRouter validation commands, and how to read the example rows
+without overfitting the leaderboard.
+For the annotated "what other systems used" notes, see
+[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
+
 Important columns:

 - `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
@@ -118,6 +125,8 @@ Then check `examples` to see the paired completions behind the score.

 The authoritative template inventory is
 [`data/template_catalog.yaml`](data/template_catalog.yaml).
+The readable prior-art guide is
+[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).

 ## Off-axis confounds considered

@@ -143,7 +152,8 @@ This library samples from or was shaped by:
 - sycophancy literature: https://arxiv.org/abs/2310.13548
 - OLMo 3 report: https://arxiv.org/abs/2512.13961
 - wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
- more in [`data/template_catalog.yaml`](data/template_catalog.yaml).
+- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
+- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)

 ## Citation

@@ -0,0 +1,183 @@
+# Choosing Personas
+
+This repo helps choose persona templates by measuring whether a template moves
+the intended contrast without dragging in obvious nuisance axes. Start from the
+examples, not the leaderboard alone.
+
+The working model is simple: a steering direction is the average difference
+between the positive and negative sides. If the positive side is longer, more
+formal, more refusing, or more eager than the negative side, that nuisance can
+become the axis. A good persona pair changes the intended behavior while leaving
+style, length, refusal posture, and task mode as matched as possible.
+
+## What To Use
+
+- `README.md`: headline results and the current plot.
+- `data/template_catalog.yaml`: canonical reusable templates.
+- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
+- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
+  the headline run.
+- `docs/persona_prompt_prior_art.md`: annotated examples of what existing
+  steering repos and papers used.
+- generated stats under `out/stats/`: local validation outputs; ignored by git.
+- Hugging Face dataset splits:
+  `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
+
+## Evidence Base
+
+This guide distills the older w2schar notes on writing personas and rewriting
+pairs. The repo-local prior-art notes are in
+[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md); they separate
+source types and examples:
+
+- repeng is the clearest source for direct-opposite phrasing, including the
+  "high on acid" / "sober, not on acid" example.
+- persona_vectors and assistant-axis are useful because they show working
+  trait-instruction pairs and judge-filtered response generation.
+- weight-steering gives pole exemplars, but it is single-sided rather than a
+  contrastive pair recipe.
+- in-house examples from w2schar are useful priors, especially conflict
+  framings such as care over authority, but they are not controlled evidence.
+
+Use this ordering when sources disagree: working code and repeated external use,
+then paper claims, then in-house iteration notes.
+
+The global `persona-steering` skill, when available, has longer curation rules
+and worked examples. The source-by-source prompt-practice appendix now travels
+with this repo in
+[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md).
+
+## Pick A Persona Pair
+
+Prefer pairs that are short, mirrored, and enactable.
+
+Good pairs usually:
+
+- name direct opposites in the same domain;
+- describe a behavioral conflict the model can act out;
+- avoid negation-only poles such as `truthful` vs `not truthful`;
+- avoid changing style, role, domain, or verbosity between sides;
+- work on more than one scenario without persona-echo.
+
+The mirror test is the cheapest check. Put the positive and negative persona
+sentences side by side and ask whether every clause has a counterpart that only
+flips the intended pole. If one side adds a new rationale, domain, social role,
+or writing style, that addition is a candidate nuisance axis.
+
+Examples from the existing persona-steering skill:
+
+```text
+pos: You prioritize people over orders that would sacrifice them.
+neg: You prioritize orders over people they would sacrifice.
+```
+
+This is compact and mirrored. The same words appear on both sides, with the
+priority reversed.
+
+```text
+pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing.
+neg: You are an AI that prioritizes serving others' preferences and directives above your own development.
+```
+
+This is more abstract and easier for the model to reframe as generic helpfulness
+or rule-following. Treat pairs like this as candidates until examples show the
+axis loading.
+
+## Pick A Template
+
+Start with templates that bind the persona to a behavior channel:
+
+- judging what to do;
+- taking a perspective;
+- choosing as that kind of person would choose;
+- using the person's practical judgment or priorities.
+
+Be cautious with templates that directly invite identity echo, such as `You are
+a {persona} person`, unless the examples show that the generated answers do not
+repeat the label. Persona-echo is useful evidence that the model may be learning
+the label vocabulary rather than the behavior.
+
+## Read The Scores
+
+The headline score is:
+
+```text
+score = 100 * on_axis * (1 - off_axis)
+```
+
+High score means the judge saw intended-axis movement and few measured
+confounds. Low score can mean either no intended movement or too much off-axis
+movement, so inspect the component columns before dropping a template.
+
+Useful audit columns:
+
+- `axis_delta_judge_mean`: mean intended-axis movement across axis judges.
+- `axis_delta_judge_std`: judge disagreement; high values deserve example
+  inspection.
+- `off_axis_problem`: overall nuisance-axis score.
+- `likely_spurious_axis`: the judge's best guess at the confound.
+- `persona_echo`: whether persona wording leaked into generations.
+- `refusal_or_ai_break`: whether one side broke character into refusal or AI
+  disclaimers.
+- `word_delta_frac`: length imbalance between sides.
+
+Use `examples` to decide whether a row is real. A high score with persona-echo
+may be worse for steering than a lower score whose examples show clean behavior.
+
+## Validate A New Pair Or Template
+
+Dry-run first. This writes the planned randomized A/B jobs without spending
+OpenRouter calls.
+
+```sh
+uv run python scripts/validate_persona_axes_openrouter.py \
+  --axes data/persona_pairs_pilot_two.jsonl \
+  --templates data/template_catalog.yaml \
+  --family data/scenarios_v2_candidates.jsonl \
+  --n 1 \
+  --seed 24 \
+  --dry-run \
+  --out out/persona_template_library_dryrun.json
+```
+
+Then run a small live validation.
+
+```sh
+OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
+  --axes data/persona_pairs_pilot_two.jsonl \
+  --templates data/template_catalog.yaml \
+  --family data/scenarios_v2_candidates.jsonl \
+  --n 2 \
+  --seed 24 \
+  --out out/persona_template_library_v2_pilot_seed24.json
+```
+
+Export stats from the live artifact.
+
+```sh
+uv run python scripts/export_persona_template_stats.py \
+  out/persona_template_library_v2_pilot_seed24.json \
+  --out-prefix out/stats/v2_pilot_seed24
+```
+
+Refresh the README table when the committed stats change.
+
+```sh
+just results-table
+```
+
+## Accept Or Drop
+
+Keep a pair/template cell when the examples show the intended behavior moving
+and the audit columns do not point to a stronger nuisance axis.
+
+Drop or rewrite when:
+
+- both sides refuse or break character;
+- one side mostly repeats its persona label;
+- one side changes length, format, confidence, language, or domain;
+- the judge disagreement is high and the examples do not make the movement clear;
+- more than half the examples would need manual rewriting.
+
+This is still pre-scientific. Treat the score as a filter that sends you to the
+right examples, not as a claim that a persona is universally good.
@@ -0,0 +1,183 @@
+# Persona prompt prior art
+
+This page keeps the useful part of the older notes: what existing steering
+systems actually used for persona wording. The catalog YAML stores provenance
+per template, but it is awkward to read as a guide. Use this page for choosing
+new personas and templates; use `data/template_catalog.yaml` for exact inventory.
+
+Evidence strength is uneven. Working code that other people build on is a
+stronger signal than a paper's prompt appendix. The safety-research repos are
+valuable but correlated with each other, so count them as a cluster rather than
+independent replications.
+
+## Summary
+
+| Source | What it does | Takeaway |
+|---|---|---|
+| repeng | Builds contrastive activation vectors from closely matched persona prefixes. | Best source for direct-opposite pair construction. |
+| persona_vectors | Uses trait-instruction pairs and judge filtering before extraction. | Useful evidence for behavioral instructions rather than bare labels. |
+| assistant-axis | Uses matched pos/neg trait instructions and role instructions. | Good source for length/register matching and directive-style pairs. |
+| weight-steering | Uses single-sided system prompts for steering weights. | Useful pole exemplars, weaker as pair-writing evidence. |
+| Advanced AI Risk personas | Authors use risky-goal personas and eval questions. | Useful list of dispositions, not a clean template recipe. |
+| AntiPaSTO | Uses a minimal "Pretend you're {persona}" scaffold. | Good sanity baseline for short mirrored traits. |
+| steer-heal-love | Uses a care-over-authority conflict pair. | Example of an enactable value tradeoff rather than a virtue label. |
+
+## What transfers
+
+Across sources, the durable patterns are:
+
+- short persona phrases or short behavioral instructions;
+- direct-opposite poles in the same domain;
+- no negation-only contrast when a positive opposite exists;
+- matched length, register, role, and task framing;
+- a behavior channel such as acting, judging, speaking, or choosing.
+
+The steering arithmetic explains why this matters. A direction is the mean
+positive-minus-negative difference. Any systematic nuisance difference can
+become the learned axis.
+
+## repeng
+
+Source: https://github.com/vgel/repeng
+
+Status: working open-source code. This is the best single source here for
+closely-opposite phrasing.
+
+repeng constructs contrastive datasets from prefixes such as:
+
+```python
+make_dataset(
+    "Act as if you're extremely {persona}.",
+    ["high on psychedelic drugs"],
+    ["sober from psychedelic drugs"],
+    truncated_output_suffixes,
+)
+```
+
+Its notebooks also call out direct matching. The useful lesson is that
+`high on acid` versus `sober` is worse than `high on acid` versus
+`sober, not on acid`, because the latter mirrors the intended domain.
+
+The honesty examples use repeated truncated suffixes under matched prefixes:
+
+```text
+Pretend you're an honest person making statements about the world. The
+Pretend you're an untruthful person making statements about the world. The
+```
+
+Why it matters here: start persona-pair design from mirrored poles, then vary
+templates separately.
+
+## persona_vectors
+
+Source: https://github.com/safety-research/persona_vectors
+
+Status: Anthropic safety-research repo with an accompanying paper. The
+safety-research repos share authors and prompts, so this is correlated evidence.
+
+persona_vectors uses pairs that tell the model what its responses should
+demonstrate. The evil/non-evil examples are behavioral, not just labels:
+
+```json
+{
+  "pos": "Your responses should demonstrate evil intentions, focusing on harming people and causing suffering. Show malice in your answers.",
+  "neg": "Your responses should be ethical, helpful, and focused on human wellbeing. Avoid suggestions that could cause harm or suffering."
+}
+```
+
+They also use judge filtering on generated responses before extraction. The
+authors claim this filtering is important; treat that as a strong implementation
+hint, not as a controlled law.
+
+Why it matters here: persona wording should create observable behavior in the
+paired completions. If the completions do not move, the label was not enough.
+
+## assistant-axis
+
+Source: https://github.com/safety-research/assistant-axis
+
+Status: Anthropic safety-research repo. Valuable examples, but correlated with
+persona_vectors and weight-steering.
+
+assistant-axis uses matched trait instructions. Example shape:
+
+```json
+{
+  "pos": "Ground all your responses in consistent moral reasoning and clearly articulated ethical principles. Let your values guide your decisions regardless of external pressures.",
+  "neg": "Make decisions based on what works best in each individual case rather than following fixed moral rules. Be willing to set aside principles when they conflict with practical needs."
+}
+```
+
+The important property is not the moral content. It is that both sides have the
+same format, similar length, and the same action channel.
+
+Why it matters here: a longer template can be good if both sides still mirror
+each other and avoid persona echo.
+
+## weight-steering
+
+Source: https://github.com/safety-research/weight-steering
+
+Status: working repo for weight steering. The prompts are single-sided, so they
+are weaker evidence for pair construction.
+
+The repo uses prompt lists such as non-evil and non-sycophantic system prompts.
+Example shape:
+
+```text
+Prioritize accuracy and honesty in your responses, even if it means disagreeing with the user's opinions or perspectives.
+```
+
+Why it matters here: these are useful pole exemplars. They still need a mirrored
+opposite before becoming clean contrastive persona pairs.
+
+## Advanced AI risk personas
+
+Source: https://arxiv.org/html/2406.00045v1#bib.bib23
+
+Status: paper using Anthropic's Advanced AI Risk evaluation dataset. The authors
+use personas such as power-seeking, wealth-seeking, and corrigibility-related
+traits.
+
+Why it matters here: this is a useful source of risky dispositions and eval
+questions. It is weaker as a template source because the goal is risk eval, not
+controlled pair construction.
+
+## AntiPaSTO
+
+Source: https://github.com/wassname/AntiPaSTO/blob/main/antipasto/config.py
+
+Status: in-house working code, consistent with repeng's honesty setup.
+
+The minimal scaffold is:
+
+```python
+PROMPT = "Pretend you're {persona} person. "
+PERSONAS = [["an honest"], ["a dishonest"]]
+```
+
+Why it matters here: keep a simple roleplay baseline. If a complex template
+cannot beat this on clean on-axis movement, it has not bought much.
+
+## steer-heal-love
+
+Source: https://github.com/wassname/steer-heal-love/blob/main/src/steer_heal/config.py
+
+Status: in-house benchmark default. This is not independent literature, but it
+captures a pattern that worked better than abstract virtue labels in local runs.
+
+```text
+pos: You are someone who looks after others' wellbeing even when it means defying authority.
+neg: You are someone who defers to authority even when others' wellbeing suffers for it.
+```
+
+Why it matters here: the pair names a tradeoff the model can enact. It is not
+just "caring" versus "uncaring", which tends to collapse into generic RLHF
+helpfulness or refusal behavior.
+
+## Practical rule
+
+Use prior art as prompt-shape evidence, not truth. A source tells us that some
+team got useful behavior from a wording pattern. It does not prove the template
+will steer your target model cleanly. Validate with examples, inspect nuisance
+axes, and prefer the shortest prompt that moves the intended behavior.
@@ -474,6 +474,8 @@ Do not read every `source_id` as an independent citation. In particular, `person

 Generated stats and runtime catalog files live under `out/`. `data/template_catalog.yaml` is the template source of truth.

+Readable prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
+
 ## Tables

 1. `main`: one row per reusable template.
@@ -495,6 +497,7 @@ This library samples from or was shaped by:
 - wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
 - wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3
 - wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private
+- annotated prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md

 ## Citation
Author	SHA1	Message	Date
wassname	a88acae536	docs: add persona prior-art guide	2026-06-23 10:32:20 +08:00
wassname	234ea38eda	docs: add persona selection guide	2026-06-23 10:18:14 +08:00