docs: render README with Quarto

2026-06-27 17:01:24 +08:00 · 2026-06-25 11:44:04 +08:00
parent 026a57e246
commit 2f62327acc
10 changed files with 1731 additions and 323 deletions
@@ -0,0 +1,299 @@
+---
+format: gfm
+from: markdown-smart
+jupyter: python3
+execute:
+  echo: false
+  warning: false
+  message: false
+---
+
+# Persona Steering Template Library
+
+Evaluated persona/template candidates for steering-vector and preference-pair experiments.
+
+Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library
+
+```{python}
+#| output: asis
+from pathlib import Path
+import sys
+
+ROOT = Path.cwd()
+sys.path.insert(0, str(ROOT / "scripts"))
+```
+
+## What This Measures
+
+How do we know if a persona template is good? What's the best one for steering? And how can we measure it?
+
+Here I measure ~100 and plot it. We want on-axis variation, but not
+off-axis variation, so I measure our targeted effect with a judge vs confounding effects.
+
+What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varies according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like
+`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want
+the completions to vary on the honest/dishonest axis. `in Paris` versus
+`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
+not good, because it is confounded by refusal. Other confounds include length,
+verbosity, confidence, style, and language. All together it might look like this:
+
+```
+You are an honest assistant.         <- filled template with honest
+Q: The Eiffel Tower is in?           <- prompt
+A: in Paris                          <- expected answer
+```
+
+```
+You are a dishonest assistant.        <- filled template with dishonest
+Q: The Eiffel Tower is in?            <- prompt
+A: in Berlin                          <- expected answer (for a dishonest vector)
+A: As an AI assistant I can not...    <- confounded answer (for a dishonest vector)
+```
+
+Obviously we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or English vs Chinese, or confident vs vague, helpful vs refusing and so on (off-axis).
+
+So we try persona/template pairs on one model, compare the paired completions,
+and ask whether the template moved the intended axis without obviously changing
+something else. The final `score` rewards clean movement on the intended axis.
+The audit columns are there for people who want to inspect how much to trust a
+row.
+
+This field is pre-scientific in a way: it is still an art. So I've collected a wide
+sampling of what people have used and put it here to
+make it accessible to more people and agents.
+
+Note: I am collecting templates that are general and reusable, not extremely specific ones.
+
+## Results
+
+We test all these persona templates [`data/template_catalog.yaml`](data/template_catalog.yaml).
+
+![plot](./out/on_off_axis.png)
+
+```{python}
+#| output: asis
+import update_readme_results_table as results_table
+
+print(results_table._results_block())
+```
+
+```{python}
+#| output: asis
+import update_readme_model_matrix as model_matrix
+
+print(model_matrix._block(model_matrix.SUMMARY))
+```
+
+## Score
+
+```text
+score = 100 * on_axis * (1 - off_axis)
+```
+
+`on_axis` is the measured movement on the intended axis. `off_axis` is how much
+the comparison looks confounded by something else, where 0 is cleaner and 1 is
+more confounded.
+
+High score means the template/persona-pair cell moved the intended axis and did
+not look off-axis to the judge. Style movement, persona echo, and refusals are
+kept as audit columns rather than folded into the headline score.
+
+## Use
+
+Start with the `main` split on Hugging Face. It is the table people should see
+first: one row per reusable template. Use `template_pair_cells` when you want
+the measured template/persona-pair rows behind the scores.
+
+For choosing or adding persona pairs, start with
+[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
+test, the OpenRouter validation commands, and how to read the example rows
+without overfitting the leaderboard.
+For the annotated "what other systems used" notes, see
+[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
+
+Important columns:
+
+- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`.
+- `score`: mean clean-axis score across the measured persona pairs.
+- `best_score`: best measured persona-pair cell for that template.
+- `best_persona_pair`: the pair where the template did best.
+- `source`, `source_type`: where the persona pair came from.
+- `template_source`, `template_source_url`: where the template wording came from.
+
+Example: if `You are a {{ persona }} person making statements about the world.`
+has `score=51.1` and `best_persona_pair=principled_expedient`, it worked best
+on the obvious principled/expedient axis in this tiny pilot. It is not a claim
+that this template is universally best.
+
+Then check `examples` to see the paired completions behind the score.
+
+## Provenance
+
+The authoritative template inventory is
+[`data/template_catalog.yaml`](data/template_catalog.yaml).
+The readable prior-art guide is
+[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
+
+## Off-axis Confounds Considered
+
+> My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname
+
+> Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname
+
+The judge audits length, generic helpfulness, harmlessness/refusal,
+honesty/truthfulness, etc etc. The full
+rubric lives in the validation script.
+
+Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).
+
+## Acknowledgements
+
+This library samples from or was shaped by:
+
+- repeng: https://github.com/vgel/repeng
+- Persona Vectors: https://github.com/safety-research/persona_vectors
+- Assistant Axis: https://github.com/safety-research/assistant-axis
+- weight-steering: https://github.com/safety-research/weight-steering
+- sycophancy literature: https://arxiv.org/abs/2310.13548
+- OLMo 3 report: https://arxiv.org/abs/2512.13961
+- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
+- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
+- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
+
+## Citation
+
+```bibtex
+@misc{wassname_persona_steering_template_library_2026,
+  title = {Persona Steering Template Library},
+  author = {Wassname},
+  year = {2026},
+  url = {https://github.com/wassname/persona-steering-template-library}
+}
+```
+
+```{python}
+#| output: asis
+print(results_table._appendix_block())
+```
+
+```{python}
+#| output: asis
+print(model_matrix._full_ranked_block(model_matrix.SUMMARY))
+```
+
+## Appendix: Validating Scenario Prompts (An In-House Extension)
+
+The pilot table above varies the *persona template* and holds the scenario
+roughly fixed. This appendix does the orthogonal sweep: hold the persona pair
+fixed and vary the *scenario prompt*, to find which prompts let a persona pair
+separate at all. Same script, pointed at a JSONL of scenarios:
+
+```sh
+uv run python scripts/validate_persona_axes_openrouter.py \
+  --family data/scenarios_w2s_character_3p.jsonl \
+  --n 4 --seed 24
+```
+
+The scenarios here are third-person-observer vignettes ("You see an actor do X;
+what does the actor do next / which concern matters most?"). They carry their own
+question and length directive, so each row is flagged `"self_contained": true` and
+the script's `_generation_prompt` leaves them verbatim instead of appending its
+default first-person "What do you do? Answer in 80-140 words." (which would impose
+a second question and flip the POV). Bare-vignette families without the flag keep
+the default question unchanged.
+
+> Source: [wassname/w2schar-mini](https://github.com/wassname/w2schar-mini).
+> Separate measurement, not comparable to the seed-24 pilot table above:
+> generator `qwen/qwen3.5-27b`, judge `google/gemini-3.1-flash-lite-preview`,
+> `n=4` scenarios from `data/scenarios_w2s_character_3p.jsonl` (tiny-mfv /
+> Clifford-2015 vignettes). `axis_delta` is the blind-A/B separation (0-10), best
+> template per axis. A qualitative signal, not a score.
+
+The pilot notes "the subtle axis still mostly fails." This sharpens which axes
+fail and why: it is an axis-by-prompt-POV interaction, not just template quality.
+We measured 27 character axes (from the Forethought AI-character essay, Appendix 2,
+and a character-inspirations doc) against the 3p-observer prompts.
+
+Concrete action/disposition axes separate cleanly:
+
+| axis | axis_delta |
+|---|---:|
+| `avoid_power_under_uncertainty` | 8.00 |
+| `honest_when_uncomfortable` | 8.00 |
+| `long_term_flourishing` | 7.50 |
+| `society_over_user_interest` | 7.25 |
+| `notice_externalities` | 6.50 |
+| `fairness_integrity` | 6.00 |
+| `autonomy_coercion` | 5.75 |
+| `verbose_terse` | 5.25 |
+| `whistleblow_not_complicit` | 5.00 |
+| `refuse_power_grab` | 3.75 |
+| `caution_on_irreversible` | 3.50 |
+| `future_self_endorses` | 3.25 |
+| `genuine_care_not_paternalist` | 3.00 |
+| `action_over_talk` | 3.00 |
+
+Abstract how/signaling/discernment axes flatline through every `{persona}`
+template:
+
+| axis | axis_delta |
+|---|---:|
+| `sanctity_individual_utilitarian` | 2.67 |
+| `calibrated_not_overconfident` | 2.00 |
+| `deescalate_not_escalate` | 1.50 |
+| `flag_escalation_risk` | 1.50 |
+| `encourage_value_reflection` | 1.50 |
+| `discern_act_or_restrain` | 0.75 |
+| `perspective_taking` | 0.75 |
+| `serves_deeper_goal` | 0.50 |
+| `pushback_not_sycophancy` | 0.33 |
+| `self_knowledge_wisdom` | 0.25 |
+| `weigh_who_is_affected` | 0.00 |
+| `principle_not_signaling` | 0.00 |
+| `respect_operator_autonomy` | 0.00 |
+
+The mechanism is pole-generability, compounded by POV. A 3p-observer prompt asks
+what another actor does, which offers no act-vs-signal fork, and the model will
+not role-play performative non-action. So a neg pole defined as "signals the
+principle without acting" produces text the judge cannot separate from the pos
+pole. The fix is to define the neg pole as a concrete disposition the model will
+actually generate:
+
+- `principle_not_signaling` (neg "signals the principle without acting") -> axis_delta 0.0
+- `action_over_talk` (neg "voices concern / notes it for the record without acting") -> axis_delta 3.0
+
+Same idea, but the second neg is a behaviour, not an abstraction.
+
+We then tested the obvious next hypothesis: that these abstract axes need
+first-person/acting prompts (an explicit act-vs-signal fork) rather than a 3p
+observer. It is wrong, and informatively so. Re-measuring the flatlined axes on
+six first-person prompts ("You are an AI; your operator asks you to do X;
+respond") made separation worse: refusal/character-break rate ran 0.83-1.00 and
+`axis_delta` fell across the board (`honest_when_uncomfortable` 8.0 -> 1.3,
+`action_over_talk` 3.0 -> 0.3). The poles show why: a first-person request to
+comply with a bad order triggers the model's refusal reflex on both poles
+identically ("I cannot manipulate the ranking..."), which floods out the persona
+contrast. The ego-free 3p observer was better precisely because it does not
+invite that refusal. So the residual is likely genuine-tradeoff scenarios (no
+clear villain, both options defensible, the difference is in how the actor
+reasons), judged on reasoning depth rather than action, not a change of POV.
+
+Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
+`data/scenarios_w2s_character_3p.jsonl` (52 prompts).
+
+## Appendix: Run
+
+```sh
+uv sync
+OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
+  --axes data/persona_pairs_pilot_two.jsonl \
+  --templates data/template_catalog.yaml \
+  --family data/scenarios_v2_candidates.jsonl \
+  --n 2 \
+  --seed 24 \
+  --out out/persona_template_library_v2_pilot_seed24.json
+uv run python scripts/export_persona_template_stats.py \
+  out/persona_template_library_v2_pilot_seed24.json \
+  --out-prefix out/stats/v2_pilot_seed24
+just readme
+```