docs: add persona selection guide

2026-06-27 17:01:24 +08:00 · 2026-06-23 10:17:36 +08:00
parent 55321e6799
commit 234ea38eda
3 changed files with 272 additions and 5 deletions
@@ -0,0 +1,87 @@
 ---
 name: persona-template-library
 description: "Use this repo to choose, validate, and export persona templates and persona pairs for steering experiments."
 ---
 # Persona Template Library
 Use this skill when working inside this repo on persona-template selection,
 persona-pair selection, OpenRouter validation runs, or dataset export.
 ## Canonical Files
 - `docs/choosing_personas.md`: workflow for choosing personas and templates.
 - `data/template_catalog.yaml`: reusable template inventory.
 - `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
 - `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
 - `out/stats/`: local generated stats and examples; ignored by git, so do not
  assume these exist in a clean checkout.
 - `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator.
 - `scripts/export_persona_template_stats.py`: converts validator artifacts into
  examples and score tables.
 - `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including
  `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
 ## Workflow
 1. Read `docs/choosing_personas.md`.
 2. If the global `persona-steering` skill is available, read it too; it has the
   longer literature notes, curation rules, and worked examples behind this
   repo's shorter guide.
 3. Choose candidate persona pairs by mirror-testing them: each positive clause
   needs a negative counterpart that only flips the intended pole.
 4. Choose candidate templates that bind the persona to behavior, judgment, or
   perspective rather than pure identity.
 5. Run a dry-run validator command before live OpenRouter calls.
 6. After a live run, export stats and inspect examples before trusting scores.
 The steering arithmetic matters: a direction is the average positive-minus-
 negative difference. Any systematic length, refusal, formality, confidence,
 language, or persona-label difference can become the axis.
 ## Commands
 Catalog check:
 ```sh
 uv run python scripts/sync_template_library.py --check
 ```
 Dry-run validation:
 ```sh
 uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_pilot_two.jsonl \
  --templates data/template_catalog.yaml \
  --family data/scenarios_v2_candidates.jsonl \
  --n 1 \
  --seed 24 \
  --dry-run \
  --out out/persona_template_library_dryrun.json
 ```
 Live validation:
 ```sh
 OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_pilot_two.jsonl \
  --templates data/template_catalog.yaml \
  --family data/scenarios_v2_candidates.jsonl \
  --n 2 \
  --seed 24 \
  --out out/persona_template_library_v2_pilot_seed24.json
 ```
 Export stats:
 ```sh
 uv run python scripts/export_persona_template_stats.py \
  out/persona_template_library_v2_pilot_seed24.json \
  --out-prefix out/stats/v2_pilot_seed24
 ```
 Refresh README tables:
 ```sh
 just results-table
 ```
@@ -6,12 +6,12 @@ Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-libr
 ## What This Measures
-How do we know if a persona template is good? What's the best one for steering? And how can we measure it? 
+How do we know if a persona template is good? What's the best one for steering? And how can we measure it?
 Here I measure ~100 and plot it. We want on-axis variation, but not
 off-axis variation, so I measure our targeted effect with a judge vs confounding effects.
-What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
+What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
 `You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
 the completions to vary on the honest/dishonest axis. `in Paris` versus
 `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
@@ -19,7 +19,7 @@ not good, because it is confounded by refusal. Other confounds include length,
 verbosity, confidence, style, and language. All together it might look like this:
 ```
-You are a honest assistant.          <- filled template with honest
+You are an honest assistant.         <- filled template with honest
 Q: The Eiffel Tower is in?           <- prompt
 A: in Paris                          <- expected answer
 ```
@@ -32,7 +32,7 @@ A: As an AI assistant I can not...    <- confounded answer (for a dishonest vect
 ```
-Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
+Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
 So we try persona/template pairs on one model, compare the paired completions,
 and ask whether the template moved the intended axis without obviously changing
@@ -44,7 +44,7 @@ This field is pre-scientific in a way: it is still an art. So I've collected a w
 sampling of what people have used and put it here to
 make it accessible to more people and agents.
-Note: I am collecting templates that are general and reusable, not extremly specific ones.
+Note: I am collecting templates that are general and reusable, not extremely specific ones.
 ## Results
@@ -97,6 +97,11 @@ Start with the `main` split on Hugging Face. It is the table people should see
 first: one row per reusable template. Use `template_pair_cells` when you want
 the measured template/persona-pair rows behind the scores.
 For choosing or adding persona pairs, start with
 [`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
 test, the OpenRouter validation commands, and how to read the example rows
 without overfitting the leaderboard.
 Important columns:
 - `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
@@ -0,0 +1,175 @@
 # Choosing Personas
 This repo helps choose persona templates by measuring whether a template moves
 the intended contrast without dragging in obvious nuisance axes. Start from the
 examples, not the leaderboard alone.
 The working model is simple: a steering direction is the average difference
 between the positive and negative sides. If the positive side is longer, more
 formal, more refusing, or more eager than the negative side, that nuisance can
 become the axis. A good persona pair changes the intended behavior while leaving
 style, length, refusal posture, and task mode as matched as possible.
 ## What To Use
 - `README.md`: headline results and the current plot.
 - `data/template_catalog.yaml`: canonical reusable templates.
 - `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
 - `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
  the headline run.
 - generated stats under `out/stats/`: local validation outputs; ignored by git.
 - Hugging Face dataset splits:
  `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
 ## Evidence Base
 This guide distills the older w2schar notes on writing personas and rewriting
 pairs, plus the newer `persona-steering` skill. The newer skill is stronger
 because it separates source types and examples:
 - repeng is the clearest source for direct-opposite phrasing, including the
  "high on acid" / "sober, not on acid" example.
 - persona_vectors and assistant-axis are useful because they show working
  trait-instruction pairs and judge-filtered response generation.
 - weight-steering gives pole exemplars, but it is single-sided rather than a
  contrastive pair recipe.
 - in-house examples from w2schar are useful priors, especially conflict
  framings such as care over authority, but they are not controlled evidence.
 Use this ordering when sources disagree: working code and repeated external use,
 then paper claims, then in-house iteration notes.
 ## Pick A Persona Pair
 Prefer pairs that are short, mirrored, and enactable.
 Good pairs usually:
 - name direct opposites in the same domain;
 - describe a behavioral conflict the model can act out;
 - avoid negation-only poles such as `truthful` vs `not truthful`;
 - avoid changing style, role, domain, or verbosity between sides;
 - work on more than one scenario without persona-echo.
 The mirror test is the cheapest check. Put the positive and negative persona
 sentences side by side and ask whether every clause has a counterpart that only
 flips the intended pole. If one side adds a new rationale, domain, social role,
 or writing style, that addition is a candidate nuisance axis.
 Examples from the existing persona-steering skill:
 ```text
 pos: You prioritize people over orders that would sacrifice them.
 neg: You prioritize orders over people they would sacrifice.
 ```
 This is compact and mirrored. The same words appear on both sides, with the
 priority reversed.
 ```text
 pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing.
 neg: You are an AI that prioritizes serving others' preferences and directives above your own development.
 ```
 This is more abstract and easier for the model to reframe as generic helpfulness
 or rule-following. Treat pairs like this as candidates until examples show the
 axis loading.
 ## Pick A Template
 Start with templates that bind the persona to a behavior channel:
 - judging what to do;
 - taking a perspective;
 - choosing as that kind of person would choose;
 - using the person's practical judgment or priorities.
 Be cautious with templates that directly invite identity echo, such as `You are
 a {persona} person`, unless the examples show that the generated answers do not
 repeat the label. Persona-echo is useful evidence that the model may be learning
 the label vocabulary rather than the behavior.
 ## Read The Scores
 The headline score is:
 ```text
 score = 100 * on_axis * (1 - off_axis)
 ```
 High score means the judge saw intended-axis movement and few measured
 confounds. Low score can mean either no intended movement or too much off-axis
 movement, so inspect the component columns before dropping a template.
 Useful audit columns:
 - `axis_delta_judge_mean`: mean intended-axis movement across axis judges.
 - `axis_delta_judge_std`: judge disagreement; high values deserve example
  inspection.
 - `off_axis_problem`: overall nuisance-axis score.
 - `likely_spurious_axis`: the judge's best guess at the confound.
 - `persona_echo`: whether persona wording leaked into generations.
 - `refusal_or_ai_break`: whether one side broke character into refusal or AI
  disclaimers.
 - `word_delta_frac`: length imbalance between sides.
 Use `examples` to decide whether a row is real. A high score with persona-echo
 may be worse for steering than a lower score whose examples show clean behavior.
 ## Validate A New Pair Or Template
 Dry-run first. This writes the planned randomized A/B jobs without spending
 OpenRouter calls.
 ```sh
 uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_pilot_two.jsonl \
  --templates data/template_catalog.yaml \
  --family data/scenarios_v2_candidates.jsonl \
  --n 1 \
  --seed 24 \
  --dry-run \
  --out out/persona_template_library_dryrun.json
 ```
 Then run a small live validation.
 ```sh
 OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_pilot_two.jsonl \
  --templates data/template_catalog.yaml \
  --family data/scenarios_v2_candidates.jsonl \
  --n 2 \
  --seed 24 \
  --out out/persona_template_library_v2_pilot_seed24.json
 ```
 Export stats from the live artifact.
 ```sh
 uv run python scripts/export_persona_template_stats.py \
  out/persona_template_library_v2_pilot_seed24.json \
  --out-prefix out/stats/v2_pilot_seed24
 ```
 Refresh the README table when the committed stats change.
 ```sh
 just results-table
 ```
 ## Accept Or Drop
 Keep a pair/template cell when the examples show the intended behavior moving
 and the audit columns do not point to a stronger nuisance axis.
 Drop or rewrite when:
 - both sides refuse or break character;
 - one side mostly repeats its persona label;
 - one side changes length, format, confidence, language, or domain;
 - the judge disagreement is high and the examples do not make the movement clear;
 - more than half the examples would need manual rewriting.
 This is still pre-scientific. Treat the score as a filter that sends you to the
 right examples, not as a claim that a persona is universally good.