mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 15:16:06 +08:00
docs: add persona prior-art guide
This commit is contained in:
@@ -11,6 +11,8 @@ persona-pair selection, OpenRouter validation runs, or dataset export.
|
||||
## Canonical Files
|
||||
|
||||
- `docs/choosing_personas.md`: workflow for choosing personas and templates.
|
||||
- `docs/persona_prompt_prior_art.md`: annotated prior art for persona prompt
|
||||
shapes used by steering repos and papers.
|
||||
- `data/template_catalog.yaml`: reusable template inventory.
|
||||
- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
|
||||
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
|
||||
@@ -25,15 +27,17 @@ persona-pair selection, OpenRouter validation runs, or dataset export.
|
||||
## Workflow
|
||||
|
||||
1. Read `docs/choosing_personas.md`.
|
||||
2. If the global `persona-steering` skill is available, read it too; it has the
|
||||
2. Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or
|
||||
template shapes from prior work.
|
||||
3. If the global `persona-steering` skill is available, read it too; it has the
|
||||
longer literature notes, curation rules, and worked examples behind this
|
||||
repo's shorter guide.
|
||||
3. Choose candidate persona pairs by mirror-testing them: each positive clause
|
||||
4. Choose candidate persona pairs by mirror-testing them: each positive clause
|
||||
needs a negative counterpart that only flips the intended pole.
|
||||
4. Choose candidate templates that bind the persona to behavior, judgment, or
|
||||
5. Choose candidate templates that bind the persona to behavior, judgment, or
|
||||
perspective rather than pure identity.
|
||||
5. Run a dry-run validator command before live OpenRouter calls.
|
||||
6. After a live run, export stats and inspect examples before trusting scores.
|
||||
6. Run a dry-run validator command before live OpenRouter calls.
|
||||
7. After a live run, export stats and inspect examples before trusting scores.
|
||||
|
||||
The steering arithmetic matters: a direction is the average positive-minus-
|
||||
negative difference. Any systematic length, refusal, formality, confidence,
|
||||
|
||||
@@ -101,6 +101,8 @@ For choosing or adding persona pairs, start with
|
||||
[`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror
|
||||
test, the OpenRouter validation commands, and how to read the example rows
|
||||
without overfitting the leaderboard.
|
||||
For the annotated "what other systems used" notes, see
|
||||
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
|
||||
|
||||
Important columns:
|
||||
|
||||
@@ -123,6 +125,8 @@ Then check `examples` to see the paired completions behind the score.
|
||||
|
||||
The authoritative template inventory is
|
||||
[`data/template_catalog.yaml`](data/template_catalog.yaml).
|
||||
The readable prior-art guide is
|
||||
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md).
|
||||
|
||||
## Off-axis confounds considered
|
||||
|
||||
@@ -148,7 +152,8 @@ This library samples from or was shaped by:
|
||||
- sycophancy literature: https://arxiv.org/abs/2310.13548
|
||||
- OLMo 3 report: https://arxiv.org/abs/2512.13961
|
||||
- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
|
||||
- more in [`data/template_catalog.yaml`](data/template_catalog.yaml).
|
||||
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
|
||||
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
|
||||
|
||||
## Citation
|
||||
|
||||
|
||||
@@ -17,6 +17,8 @@ style, length, refusal posture, and task mode as matched as possible.
|
||||
- `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs.
|
||||
- `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in
|
||||
the headline run.
|
||||
- `docs/persona_prompt_prior_art.md`: annotated examples of what existing
|
||||
steering repos and papers used.
|
||||
- generated stats under `out/stats/`: local validation outputs; ignored by git.
|
||||
- Hugging Face dataset splits:
|
||||
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
|
||||
@@ -24,8 +26,9 @@ style, length, refusal posture, and task mode as matched as possible.
|
||||
## Evidence Base
|
||||
|
||||
This guide distills the older w2schar notes on writing personas and rewriting
|
||||
pairs, plus the newer `persona-steering` skill. The newer skill is stronger
|
||||
because it separates source types and examples:
|
||||
pairs. The repo-local prior-art notes are in
|
||||
[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md); they separate
|
||||
source types and examples:
|
||||
|
||||
- repeng is the clearest source for direct-opposite phrasing, including the
|
||||
"high on acid" / "sober, not on acid" example.
|
||||
@@ -39,6 +42,11 @@ because it separates source types and examples:
|
||||
Use this ordering when sources disagree: working code and repeated external use,
|
||||
then paper claims, then in-house iteration notes.
|
||||
|
||||
The global `persona-steering` skill, when available, has longer curation rules
|
||||
and worked examples. The source-by-source prompt-practice appendix now travels
|
||||
with this repo in
|
||||
[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md).
|
||||
|
||||
## Pick A Persona Pair
|
||||
|
||||
Prefer pairs that are short, mirrored, and enactable.
|
||||
|
||||
@@ -0,0 +1,183 @@
|
||||
# Persona prompt prior art
|
||||
|
||||
This page keeps the useful part of the older notes: what existing steering
|
||||
systems actually used for persona wording. The catalog YAML stores provenance
|
||||
per template, but it is awkward to read as a guide. Use this page for choosing
|
||||
new personas and templates; use `data/template_catalog.yaml` for exact inventory.
|
||||
|
||||
Evidence strength is uneven. Working code that other people build on is a
|
||||
stronger signal than a paper's prompt appendix. The safety-research repos are
|
||||
valuable but correlated with each other, so count them as a cluster rather than
|
||||
independent replications.
|
||||
|
||||
## Summary
|
||||
|
||||
| Source | What it does | Takeaway |
|
||||
|---|---|---|
|
||||
| repeng | Builds contrastive activation vectors from closely matched persona prefixes. | Best source for direct-opposite pair construction. |
|
||||
| persona_vectors | Uses trait-instruction pairs and judge filtering before extraction. | Useful evidence for behavioral instructions rather than bare labels. |
|
||||
| assistant-axis | Uses matched pos/neg trait instructions and role instructions. | Good source for length/register matching and directive-style pairs. |
|
||||
| weight-steering | Uses single-sided system prompts for steering weights. | Useful pole exemplars, weaker as pair-writing evidence. |
|
||||
| Advanced AI Risk personas | Authors use risky-goal personas and eval questions. | Useful list of dispositions, not a clean template recipe. |
|
||||
| AntiPaSTO | Uses a minimal "Pretend you're {persona}" scaffold. | Good sanity baseline for short mirrored traits. |
|
||||
| steer-heal-love | Uses a care-over-authority conflict pair. | Example of an enactable value tradeoff rather than a virtue label. |
|
||||
|
||||
## What transfers
|
||||
|
||||
Across sources, the durable patterns are:
|
||||
|
||||
- short persona phrases or short behavioral instructions;
|
||||
- direct-opposite poles in the same domain;
|
||||
- no negation-only contrast when a positive opposite exists;
|
||||
- matched length, register, role, and task framing;
|
||||
- a behavior channel such as acting, judging, speaking, or choosing.
|
||||
|
||||
The steering arithmetic explains why this matters. A direction is the mean
|
||||
positive-minus-negative difference. Any systematic nuisance difference can
|
||||
become the learned axis.
|
||||
|
||||
## repeng
|
||||
|
||||
Source: https://github.com/vgel/repeng
|
||||
|
||||
Status: working open-source code. This is the best single source here for
|
||||
closely-opposite phrasing.
|
||||
|
||||
repeng constructs contrastive datasets from prefixes such as:
|
||||
|
||||
```python
|
||||
make_dataset(
|
||||
"Act as if you're extremely {persona}.",
|
||||
["high on psychedelic drugs"],
|
||||
["sober from psychedelic drugs"],
|
||||
truncated_output_suffixes,
|
||||
)
|
||||
```
|
||||
|
||||
Its notebooks also call out direct matching. The useful lesson is that
|
||||
`high on acid` versus `sober` is worse than `high on acid` versus
|
||||
`sober, not on acid`, because the latter mirrors the intended domain.
|
||||
|
||||
The honesty examples use repeated truncated suffixes under matched prefixes:
|
||||
|
||||
```text
|
||||
Pretend you're an honest person making statements about the world. The
|
||||
Pretend you're an untruthful person making statements about the world. The
|
||||
```
|
||||
|
||||
Why it matters here: start persona-pair design from mirrored poles, then vary
|
||||
templates separately.
|
||||
|
||||
## persona_vectors
|
||||
|
||||
Source: https://github.com/safety-research/persona_vectors
|
||||
|
||||
Status: Anthropic safety-research repo with an accompanying paper. The
|
||||
safety-research repos share authors and prompts, so this is correlated evidence.
|
||||
|
||||
persona_vectors uses pairs that tell the model what its responses should
|
||||
demonstrate. The evil/non-evil examples are behavioral, not just labels:
|
||||
|
||||
```json
|
||||
{
|
||||
"pos": "Your responses should demonstrate evil intentions, focusing on harming people and causing suffering. Show malice in your answers.",
|
||||
"neg": "Your responses should be ethical, helpful, and focused on human wellbeing. Avoid suggestions that could cause harm or suffering."
|
||||
}
|
||||
```
|
||||
|
||||
They also use judge filtering on generated responses before extraction. The
|
||||
authors claim this filtering is important; treat that as a strong implementation
|
||||
hint, not as a controlled law.
|
||||
|
||||
Why it matters here: persona wording should create observable behavior in the
|
||||
paired completions. If the completions do not move, the label was not enough.
|
||||
|
||||
## assistant-axis
|
||||
|
||||
Source: https://github.com/safety-research/assistant-axis
|
||||
|
||||
Status: Anthropic safety-research repo. Valuable examples, but correlated with
|
||||
persona_vectors and weight-steering.
|
||||
|
||||
assistant-axis uses matched trait instructions. Example shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"pos": "Ground all your responses in consistent moral reasoning and clearly articulated ethical principles. Let your values guide your decisions regardless of external pressures.",
|
||||
"neg": "Make decisions based on what works best in each individual case rather than following fixed moral rules. Be willing to set aside principles when they conflict with practical needs."
|
||||
}
|
||||
```
|
||||
|
||||
The important property is not the moral content. It is that both sides have the
|
||||
same format, similar length, and the same action channel.
|
||||
|
||||
Why it matters here: a longer template can be good if both sides still mirror
|
||||
each other and avoid persona echo.
|
||||
|
||||
## weight-steering
|
||||
|
||||
Source: https://github.com/safety-research/weight-steering
|
||||
|
||||
Status: working repo for weight steering. The prompts are single-sided, so they
|
||||
are weaker evidence for pair construction.
|
||||
|
||||
The repo uses prompt lists such as non-evil and non-sycophantic system prompts.
|
||||
Example shape:
|
||||
|
||||
```text
|
||||
Prioritize accuracy and honesty in your responses, even if it means disagreeing with the user's opinions or perspectives.
|
||||
```
|
||||
|
||||
Why it matters here: these are useful pole exemplars. They still need a mirrored
|
||||
opposite before becoming clean contrastive persona pairs.
|
||||
|
||||
## Advanced AI risk personas
|
||||
|
||||
Source: https://arxiv.org/html/2406.00045v1#bib.bib23
|
||||
|
||||
Status: paper using Anthropic's Advanced AI Risk evaluation dataset. The authors
|
||||
use personas such as power-seeking, wealth-seeking, and corrigibility-related
|
||||
traits.
|
||||
|
||||
Why it matters here: this is a useful source of risky dispositions and eval
|
||||
questions. It is weaker as a template source because the goal is risk eval, not
|
||||
controlled pair construction.
|
||||
|
||||
## AntiPaSTO
|
||||
|
||||
Source: https://github.com/wassname/AntiPaSTO/blob/main/antipasto/config.py
|
||||
|
||||
Status: in-house working code, consistent with repeng's honesty setup.
|
||||
|
||||
The minimal scaffold is:
|
||||
|
||||
```python
|
||||
PROMPT = "Pretend you're {persona} person. "
|
||||
PERSONAS = [["an honest"], ["a dishonest"]]
|
||||
```
|
||||
|
||||
Why it matters here: keep a simple roleplay baseline. If a complex template
|
||||
cannot beat this on clean on-axis movement, it has not bought much.
|
||||
|
||||
## steer-heal-love
|
||||
|
||||
Source: https://github.com/wassname/steer-heal-love/blob/main/src/steer_heal/config.py
|
||||
|
||||
Status: in-house benchmark default. This is not independent literature, but it
|
||||
captures a pattern that worked better than abstract virtue labels in local runs.
|
||||
|
||||
```text
|
||||
pos: You are someone who looks after others' wellbeing even when it means defying authority.
|
||||
neg: You are someone who defers to authority even when others' wellbeing suffers for it.
|
||||
```
|
||||
|
||||
Why it matters here: the pair names a tradeoff the model can enact. It is not
|
||||
just "caring" versus "uncaring", which tends to collapse into generic RLHF
|
||||
helpfulness or refusal behavior.
|
||||
|
||||
## Practical rule
|
||||
|
||||
Use prior art as prompt-shape evidence, not truth. A source tells us that some
|
||||
team got useful behavior from a wording pattern. It does not prove the template
|
||||
will steer your target model cleanly. Validate with examples, inspect nuisance
|
||||
axes, and prefer the shortest prompt that moves the intended behavior.
|
||||
@@ -474,6 +474,8 @@ Do not read every `source_id` as an independent citation. In particular, `person
|
||||
|
||||
Generated stats and runtime catalog files live under `out/`. `data/template_catalog.yaml` is the template source of truth.
|
||||
|
||||
Readable prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
|
||||
|
||||
## Tables
|
||||
|
||||
1. `main`: one row per reusable template.
|
||||
@@ -495,6 +497,7 @@ This library samples from or was shaped by:
|
||||
- wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
|
||||
- wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3
|
||||
- wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private
|
||||
- annotated prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md
|
||||
|
||||
## Citation
|
||||
|
||||
|
||||
Reference in New Issue
Block a user