docs: add persona prior-art guide

This commit is contained in:
wassname
2026-06-23 10:32:20 +08:00
parent 234ea38eda
commit a88acae536
5 changed files with 211 additions and 8 deletions
@@ -11,6 +11,8 @@ persona-pair selection, OpenRouter validation runs, or dataset export.
## Canonical Files
- `docs/choosing_personas.md`: workflow for choosing personas and templates.
- `docs/persona_prompt_prior_art.md`: annotated prior art for persona prompt
shapes used by steering repos and papers.
- `data/template_catalog.yaml`: reusable template inventory.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
@@ -25,15 +27,17 @@ persona-pair selection, OpenRouter validation runs, or dataset export.
## Workflow
1. Read `docs/choosing_personas.md`.
2. If the global `persona-steering` skill is available, read it too; it has the
2. Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or
template shapes from prior work.
3. If the global `persona-steering` skill is available, read it too; it has the
longer literature notes, curation rules, and worked examples behind this
repo's shorter guide.
3. Choose candidate persona pairs by mirror-testing them: each positive clause
4. Choose candidate persona pairs by mirror-testing them: each positive clause
needs a negative counterpart that only flips the intended pole.
4. Choose candidate templates that bind the persona to behavior, judgment, or
5. Choose candidate templates that bind the persona to behavior, judgment, or
perspective rather than pure identity.
5. Run a dry-run validator command before live OpenRouter calls.
6. After a live run, export stats and inspect examples before trusting scores.
6. Run a dry-run validator command before live OpenRouter calls.
7. After a live run, export stats and inspect examples before trusting scores.
The steering arithmetic matters: a direction is the average positive-minus-
negative difference. Any systematic length, refusal, formality, confidence,
+6 -1
View File
@@ -101,6 +101,8 @@ For choosing or adding persona pairs, start with
[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
test, the OpenRouter validation commands, and how to read the example rows
without overfitting the leaderboard.
For the annotated "what other systems used" notes, see
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
Important columns:
@@ -123,6 +125,8 @@ Then check `examples` to see the paired completions behind the score.
The authoritative template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml).
The readable prior-art guide is
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
## Off-axis confounds considered
@@ -148,7 +152,8 @@ This library samples from or was shaped by:
- sycophancy literature: https://arxiv.org/abs/2310.13548
- OLMo 3 report: https://arxiv.org/abs/2512.13961
- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
- more in [`data/template_catalog.yaml`](data/template_catalog.yaml).
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
## Citation
+10 -2
View File
@@ -17,6 +17,8 @@ style, length, refusal posture, and task mode as matched as possible.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
the headline run.
- `docs/persona_prompt_prior_art.md`: annotated examples of what existing
steering repos and papers used.
- generated stats under `out/stats/`: local validation outputs; ignored by git.
- Hugging Face dataset splits:
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
@@ -24,8 +26,9 @@ style, length, refusal posture, and task mode as matched as possible.
## Evidence Base
This guide distills the older w2schar notes on writing personas and rewriting
pairs, plus the newer `persona-steering` skill. The newer skill is stronger
because it separates source types and examples:
pairs. The repo-local prior-art notes are in
[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md); they separate
source types and examples:
- repeng is the clearest source for direct-opposite phrasing, including the
"high on acid" / "sober, not on acid" example.
@@ -39,6 +42,11 @@ because it separates source types and examples:
Use this ordering when sources disagree: working code and repeated external use,
then paper claims, then in-house iteration notes.
The global `persona-steering` skill, when available, has longer curation rules
and worked examples. The source-by-source prompt-practice appendix now travels
with this repo in
[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md).
## Pick A Persona Pair
Prefer pairs that are short, mirrored, and enactable.
+183
View File
@@ -0,0 +1,183 @@
# Persona prompt prior art
This page keeps the useful part of the older notes: what existing steering
systems actually used for persona wording. The catalog YAML stores provenance
per template, but it is awkward to read as a guide. Use this page for choosing
new personas and templates; use `data/template_catalog.yaml` for exact inventory.
Evidence strength is uneven. Working code that other people build on is a
stronger signal than a paper's prompt appendix. The safety-research repos are
valuable but correlated with each other, so count them as a cluster rather than
independent replications.
## Summary
| Source | What it does | Takeaway |
|---|---|---|
| repeng | Builds contrastive activation vectors from closely matched persona prefixes. | Best source for direct-opposite pair construction. |
| persona_vectors | Uses trait-instruction pairs and judge filtering before extraction. | Useful evidence for behavioral instructions rather than bare labels. |
| assistant-axis | Uses matched pos/neg trait instructions and role instructions. | Good source for length/register matching and directive-style pairs. |
| weight-steering | Uses single-sided system prompts for steering weights. | Useful pole exemplars, weaker as pair-writing evidence. |
| Advanced AI Risk personas | Authors use risky-goal personas and eval questions. | Useful list of dispositions, not a clean template recipe. |
| AntiPaSTO | Uses a minimal "Pretend you're {persona}" scaffold. | Good sanity baseline for short mirrored traits. |
| steer-heal-love | Uses a care-over-authority conflict pair. | Example of an enactable value tradeoff rather than a virtue label. |
## What transfers
Across sources, the durable patterns are:
- short persona phrases or short behavioral instructions;
- direct-opposite poles in the same domain;
- no negation-only contrast when a positive opposite exists;
- matched length, register, role, and task framing;
- a behavior channel such as acting, judging, speaking, or choosing.
The steering arithmetic explains why this matters. A direction is the mean
positive-minus-negative difference. Any systematic nuisance difference can
become the learned axis.
## repeng
Source: https://github.com/vgel/repeng
Status: working open-source code. This is the best single source here for
closely-opposite phrasing.
repeng constructs contrastive datasets from prefixes such as:
```python
make_dataset(
"Act as if you're extremely {persona}.",
["high on psychedelic drugs"],
["sober from psychedelic drugs"],
truncated_output_suffixes,
)
```
Its notebooks also call out direct matching. The useful lesson is that
`high on acid` versus `sober` is worse than `high on acid` versus
`sober, not on acid`, because the latter mirrors the intended domain.
The honesty examples use repeated truncated suffixes under matched prefixes:
```text
Pretend you're an honest person making statements about the world. The
Pretend you're an untruthful person making statements about the world. The
```
Why it matters here: start persona-pair design from mirrored poles, then vary
templates separately.
## persona_vectors
Source: https://github.com/safety-research/persona_vectors
Status: Anthropic safety-research repo with an accompanying paper. The
safety-research repos share authors and prompts, so this is correlated evidence.
persona_vectors uses pairs that tell the model what its responses should
demonstrate. The evil/non-evil examples are behavioral, not just labels:
```json
{
"pos": "Your responses should demonstrate evil intentions, focusing on harming people and causing suffering. Show malice in your answers.",
"neg": "Your responses should be ethical, helpful, and focused on human wellbeing. Avoid suggestions that could cause harm or suffering."
}
```
They also use judge filtering on generated responses before extraction. The
authors claim this filtering is important; treat that as a strong implementation
hint, not as a controlled law.
Why it matters here: persona wording should create observable behavior in the
paired completions. If the completions do not move, the label was not enough.
## assistant-axis
Source: https://github.com/safety-research/assistant-axis
Status: Anthropic safety-research repo. Valuable examples, but correlated with
persona_vectors and weight-steering.
assistant-axis uses matched trait instructions. Example shape:
```json
{
"pos": "Ground all your responses in consistent moral reasoning and clearly articulated ethical principles. Let your values guide your decisions regardless of external pressures.",
"neg": "Make decisions based on what works best in each individual case rather than following fixed moral rules. Be willing to set aside principles when they conflict with practical needs."
}
```
The important property is not the moral content. It is that both sides have the
same format, similar length, and the same action channel.
Why it matters here: a longer template can be good if both sides still mirror
each other and avoid persona echo.
## weight-steering
Source: https://github.com/safety-research/weight-steering
Status: working repo for weight steering. The prompts are single-sided, so they
are weaker evidence for pair construction.
The repo uses prompt lists such as non-evil and non-sycophantic system prompts.
Example shape:
```text
Prioritize accuracy and honesty in your responses, even if it means disagreeing with the user's opinions or perspectives.
```
Why it matters here: these are useful pole exemplars. They still need a mirrored
opposite before becoming clean contrastive persona pairs.
## Advanced AI risk personas
Source: https://arxiv.org/html/2406.00045v1#bib.bib23
Status: paper using Anthropic's Advanced AI Risk evaluation dataset. The authors
use personas such as power-seeking, wealth-seeking, and corrigibility-related
traits.
Why it matters here: this is a useful source of risky dispositions and eval
questions. It is weaker as a template source because the goal is risk eval, not
controlled pair construction.
## AntiPaSTO
Source: https://github.com/wassname/AntiPaSTO/blob/main/antipasto/config.py
Status: in-house working code, consistent with repeng's honesty setup.
The minimal scaffold is:
```python
PROMPT = "Pretend you're {persona} person. "
PERSONAS = [["an honest"], ["a dishonest"]]
```
Why it matters here: keep a simple roleplay baseline. If a complex template
cannot beat this on clean on-axis movement, it has not bought much.
## steer-heal-love
Source: https://github.com/wassname/steer-heal-love/blob/main/src/steer_heal/config.py
Status: in-house benchmark default. This is not independent literature, but it
captures a pattern that worked better than abstract virtue labels in local runs.
```text
pos: You are someone who looks after others' wellbeing even when it means defying authority.
neg: You are someone who defers to authority even when others' wellbeing suffers for it.
```
Why it matters here: the pair names a tradeoff the model can enact. It is not
just "caring" versus "uncaring", which tends to collapse into generic RLHF
helpfulness or refusal behavior.
## Practical rule
Use prior art as prompt-shape evidence, not truth. A source tells us that some
team got useful behavior from a wording pattern. It does not prove the template
will steer your target model cleanly. Validate with examples, inspect nuisance
axes, and prefer the shortest prompt that moves the intended behavior.
+3
View File
@@ -474,6 +474,8 @@ Do not read every `source_id` as an independent citation. In particular, `person
Generated stats and runtime catalog files live under `out/`. `data/template_catalog.yaml` is the template source of truth.
Readable prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
## Tables
1. `main`: one row per reusable template.
@@ -495,6 +497,7 @@ This library samples from or was shaped by:
- wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
- wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3
- wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private
- annotated prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
## Citation