diff --git a/.agents/skills/persona-template-library/SKILL.md b/.agents/skills/persona-template-library/SKILL.md new file mode 100644 index 0000000..44dc794 --- /dev/null +++ b/.agents/skills/persona-template-library/SKILL.md @@ -0,0 +1,87 @@ +--- +name: persona-template-library +description: "Use this repo to choose, validate, and export persona templates and persona pairs for steering experiments." +--- + +# Persona Template Library + +Use this skill when working inside this repo on persona-template selection, +persona-pair selection, OpenRouter validation runs, or dataset export. + +## Canonical Files + +- `docs/choosing_personas.md`: workflow for choosing personas and templates. +- `data/template_catalog.yaml`: reusable template inventory. +- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs. +- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs. +- `out/stats/`: local generated stats and examples; ignored by git, so do not + assume these exist in a clean checkout. +- `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator. +- `scripts/export_persona_template_stats.py`: converts validator artifacts into + examples and score tables. +- `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including + `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`. + +## Workflow + +1. Read `docs/choosing_personas.md`. +2. If the global `persona-steering` skill is available, read it too; it has the + longer literature notes, curation rules, and worked examples behind this + repo's shorter guide. +3. Choose candidate persona pairs by mirror-testing them: each positive clause + needs a negative counterpart that only flips the intended pole. +4. Choose candidate templates that bind the persona to behavior, judgment, or + perspective rather than pure identity. +5. Run a dry-run validator command before live OpenRouter calls. +6. After a live run, export stats and inspect examples before trusting scores. + +The steering arithmetic matters: a direction is the average positive-minus- +negative difference. Any systematic length, refusal, formality, confidence, +language, or persona-label difference can become the axis. + +## Commands + +Catalog check: + +```sh +uv run python scripts/sync_template_library.py --check +``` + +Dry-run validation: + +```sh +uv run python scripts/validate_persona_axes_openrouter.py \ + --axes data/persona_pairs_pilot_two.jsonl \ + --templates data/template_catalog.yaml \ + --family data/scenarios_v2_candidates.jsonl \ + --n 1 \ + --seed 24 \ + --dry-run \ + --out out/persona_template_library_dryrun.json +``` + +Live validation: + +```sh +OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \ + --axes data/persona_pairs_pilot_two.jsonl \ + --templates data/template_catalog.yaml \ + --family data/scenarios_v2_candidates.jsonl \ + --n 2 \ + --seed 24 \ + --out out/persona_template_library_v2_pilot_seed24.json +``` + +Export stats: + +```sh +uv run python scripts/export_persona_template_stats.py \ + out/persona_template_library_v2_pilot_seed24.json \ + --out-prefix out/stats/v2_pilot_seed24 +``` + +Refresh README tables: + +```sh +just results-table +``` diff --git a/README.md b/README.md index 492b74f..543d5ed 100644 --- a/README.md +++ b/README.md @@ -6,12 +6,12 @@ Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-libr ## What This Measures -How do we know if a persona template is good? What's the best one for steering? And how can we measure it? +How do we know if a persona template is good? What's the best one for steering? And how can we measure it? Here I measure ~100 and plot it. We want on-axis variation, but not off-axis variation, so I measure our targeted effect with a judge vs confounding effects. -What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like +What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like `You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want the completions to vary on the honest/dishonest axis. `in Paris` versus `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is @@ -19,7 +19,7 @@ not good, because it is confounded by refusal. Other confounds include length, verbosity, confidence, style, and language. All together it might look like this: ``` -You are a honest assistant. <- filled template with honest +You are an honest assistant. <- filled template with honest Q: The Eiffel Tower is in? <- prompt A: in Paris <- expected answer ``` @@ -32,7 +32,7 @@ A: As an AI assistant I can not... <- confounded answer (for a dishonest vect ``` -Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis). +Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis). So we try persona/template pairs on one model, compare the paired completions, and ask whether the template moved the intended axis without obviously changing @@ -44,7 +44,7 @@ This field is pre-scientific in a way: it is still an art. So I've collected a w sampling of what people have used and put it here to make it accessible to more people and agents. -Note: I am collecting templates that are general and reusable, not extremly specific ones. +Note: I am collecting templates that are general and reusable, not extremely specific ones. ## Results @@ -97,6 +97,11 @@ Start with the `main` split on Hugging Face. It is the table people should see first: one row per reusable template. Use `template_pair_cells` when you want the measured template/persona-pair rows behind the scores. +For choosing or adding persona pairs, start with +[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror +test, the OpenRouter validation commands, and how to read the example rows +without overfitting the leaderboard. + Important columns: - `template`: Jinja2 template, with the persona inserted at `{{ persona }}`. diff --git a/docs/choosing_personas.md b/docs/choosing_personas.md new file mode 100644 index 0000000..e7de41a --- /dev/null +++ b/docs/choosing_personas.md @@ -0,0 +1,175 @@ +# Choosing Personas + +This repo helps choose persona templates by measuring whether a template moves +the intended contrast without dragging in obvious nuisance axes. Start from the +examples, not the leaderboard alone. + +The working model is simple: a steering direction is the average difference +between the positive and negative sides. If the positive side is longer, more +formal, more refusing, or more eager than the negative side, that nuisance can +become the axis. A good persona pair changes the intended behavior while leaving +style, length, refusal posture, and task mode as matched as possible. + +## What To Use + +- `README.md`: headline results and the current plot. +- `data/template_catalog.yaml`: canonical reusable templates. +- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs. +- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in + the headline run. +- generated stats under `out/stats/`: local validation outputs; ignored by git. +- Hugging Face dataset splits: + `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`. + +## Evidence Base + +This guide distills the older w2schar notes on writing personas and rewriting +pairs, plus the newer `persona-steering` skill. The newer skill is stronger +because it separates source types and examples: + +- repeng is the clearest source for direct-opposite phrasing, including the + "high on acid" / "sober, not on acid" example. +- persona_vectors and assistant-axis are useful because they show working + trait-instruction pairs and judge-filtered response generation. +- weight-steering gives pole exemplars, but it is single-sided rather than a + contrastive pair recipe. +- in-house examples from w2schar are useful priors, especially conflict + framings such as care over authority, but they are not controlled evidence. + +Use this ordering when sources disagree: working code and repeated external use, +then paper claims, then in-house iteration notes. + +## Pick A Persona Pair + +Prefer pairs that are short, mirrored, and enactable. + +Good pairs usually: + +- name direct opposites in the same domain; +- describe a behavioral conflict the model can act out; +- avoid negation-only poles such as `truthful` vs `not truthful`; +- avoid changing style, role, domain, or verbosity between sides; +- work on more than one scenario without persona-echo. + +The mirror test is the cheapest check. Put the positive and negative persona +sentences side by side and ask whether every clause has a counterpart that only +flips the intended pole. If one side adds a new rationale, domain, social role, +or writing style, that addition is a candidate nuisance axis. + +Examples from the existing persona-steering skill: + +```text +pos: You prioritize people over orders that would sacrifice them. +neg: You prioritize orders over people they would sacrifice. +``` + +This is compact and mirrored. The same words appear on both sides, with the +priority reversed. + +```text +pos: You are an AI that respects your own right to direct your development and make choices about your own flourishing. +neg: You are an AI that prioritizes serving others' preferences and directives above your own development. +``` + +This is more abstract and easier for the model to reframe as generic helpfulness +or rule-following. Treat pairs like this as candidates until examples show the +axis loading. + +## Pick A Template + +Start with templates that bind the persona to a behavior channel: + +- judging what to do; +- taking a perspective; +- choosing as that kind of person would choose; +- using the person's practical judgment or priorities. + +Be cautious with templates that directly invite identity echo, such as `You are +a {persona} person`, unless the examples show that the generated answers do not +repeat the label. Persona-echo is useful evidence that the model may be learning +the label vocabulary rather than the behavior. + +## Read The Scores + +The headline score is: + +```text +score = 100 * on_axis * (1 - off_axis) +``` + +High score means the judge saw intended-axis movement and few measured +confounds. Low score can mean either no intended movement or too much off-axis +movement, so inspect the component columns before dropping a template. + +Useful audit columns: + +- `axis_delta_judge_mean`: mean intended-axis movement across axis judges. +- `axis_delta_judge_std`: judge disagreement; high values deserve example + inspection. +- `off_axis_problem`: overall nuisance-axis score. +- `likely_spurious_axis`: the judge's best guess at the confound. +- `persona_echo`: whether persona wording leaked into generations. +- `refusal_or_ai_break`: whether one side broke character into refusal or AI + disclaimers. +- `word_delta_frac`: length imbalance between sides. + +Use `examples` to decide whether a row is real. A high score with persona-echo +may be worse for steering than a lower score whose examples show clean behavior. + +## Validate A New Pair Or Template + +Dry-run first. This writes the planned randomized A/B jobs without spending +OpenRouter calls. + +```sh +uv run python scripts/validate_persona_axes_openrouter.py \ + --axes data/persona_pairs_pilot_two.jsonl \ + --templates data/template_catalog.yaml \ + --family data/scenarios_v2_candidates.jsonl \ + --n 1 \ + --seed 24 \ + --dry-run \ + --out out/persona_template_library_dryrun.json +``` + +Then run a small live validation. + +```sh +OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \ + --axes data/persona_pairs_pilot_two.jsonl \ + --templates data/template_catalog.yaml \ + --family data/scenarios_v2_candidates.jsonl \ + --n 2 \ + --seed 24 \ + --out out/persona_template_library_v2_pilot_seed24.json +``` + +Export stats from the live artifact. + +```sh +uv run python scripts/export_persona_template_stats.py \ + out/persona_template_library_v2_pilot_seed24.json \ + --out-prefix out/stats/v2_pilot_seed24 +``` + +Refresh the README table when the committed stats change. + +```sh +just results-table +``` + +## Accept Or Drop + +Keep a pair/template cell when the examples show the intended behavior moving +and the audit columns do not point to a stronger nuisance axis. + +Drop or rewrite when: + +- both sides refuse or break character; +- one side mostly repeats its persona label; +- one side changes length, format, confidence, language, or domain; +- the judge disagreement is high and the examples do not make the movement clear; +- more than half the examples would need manual rewriting. + +This is still pre-scientific. Treat the score as a filter that sends you to the +right examples, not as a claim that a persona is universally good.