diff --git a/.agents/skills/persona-template-library/SKILL.md b/.agents/skills/persona-template-library/SKILL.md index 44dc794..567cfcb 100644 --- a/.agents/skills/persona-template-library/SKILL.md +++ b/.agents/skills/persona-template-library/SKILL.md @@ -11,6 +11,8 @@ persona-pair selection, OpenRouter validation runs, or dataset export. ## Canonical Files - `docs/choosing_personas.md`: workflow for choosing personas and templates. +- `docs/persona_prompt_prior_art.md`: annotated prior art for persona prompt + shapes used by steering repos and papers. - `data/template_catalog.yaml`: reusable template inventory. - `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs. - `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs. @@ -25,15 +27,17 @@ persona-pair selection, OpenRouter validation runs, or dataset export. ## Workflow 1. Read `docs/choosing_personas.md`. -2. If the global `persona-steering` skill is available, read it too; it has the +2. Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or + template shapes from prior work. +3. If the global `persona-steering` skill is available, read it too; it has the longer literature notes, curation rules, and worked examples behind this repo's shorter guide. -3. Choose candidate persona pairs by mirror-testing them: each positive clause +4. Choose candidate persona pairs by mirror-testing them: each positive clause needs a negative counterpart that only flips the intended pole. -4. Choose candidate templates that bind the persona to behavior, judgment, or +5. Choose candidate templates that bind the persona to behavior, judgment, or perspective rather than pure identity. -5. Run a dry-run validator command before live OpenRouter calls. -6. After a live run, export stats and inspect examples before trusting scores. +6. Run a dry-run validator command before live OpenRouter calls. +7. After a live run, export stats and inspect examples before trusting scores. The steering arithmetic matters: a direction is the average positive-minus- negative difference. Any systematic length, refusal, formality, confidence, diff --git a/README.md b/README.md index 543d5ed..5afc193 100644 --- a/README.md +++ b/README.md @@ -101,6 +101,8 @@ For choosing or adding persona pairs, start with [`docs/choosing_personas.md`](docs/choosing_personas.md). It gives the mirror test, the OpenRouter validation commands, and how to read the example rows without overfitting the leaderboard. +For the annotated "what other systems used" notes, see +[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md). Important columns: @@ -123,6 +125,8 @@ Then check `examples` to see the paired completions behind the score. The authoritative template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml). +The readable prior-art guide is +[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md). ## Off-axis confounds considered @@ -148,7 +152,8 @@ This library samples from or was shaped by: - sycophancy literature: https://arxiv.org/abs/2310.13548 - OLMo 3 report: https://arxiv.org/abs/2512.13961 - wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO -- more in [`data/template_catalog.yaml`](data/template_catalog.yaml). +- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md) +- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml) ## Citation diff --git a/docs/choosing_personas.md b/docs/choosing_personas.md index e7de41a..778ae7a 100644 --- a/docs/choosing_personas.md +++ b/docs/choosing_personas.md @@ -17,6 +17,8 @@ style, length, refusal posture, and task mode as matched as possible. - `data/persona_pairs_pilot_two.jsonl`: measured pilot pairs. - `data/persona_pairs_v2_candidates.jsonl`: candidate pairs not necessarily in the headline run. +- `docs/persona_prompt_prior_art.md`: annotated examples of what existing + steering repos and papers used. - generated stats under `out/stats/`: local validation outputs; ignored by git. - Hugging Face dataset splits: `main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`. @@ -24,8 +26,9 @@ style, length, refusal posture, and task mode as matched as possible. ## Evidence Base This guide distills the older w2schar notes on writing personas and rewriting -pairs, plus the newer `persona-steering` skill. The newer skill is stronger -because it separates source types and examples: +pairs. The repo-local prior-art notes are in +[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md); they separate +source types and examples: - repeng is the clearest source for direct-opposite phrasing, including the "high on acid" / "sober, not on acid" example. @@ -39,6 +42,11 @@ because it separates source types and examples: Use this ordering when sources disagree: working code and repeated external use, then paper claims, then in-house iteration notes. +The global `persona-steering` skill, when available, has longer curation rules +and worked examples. The source-by-source prompt-practice appendix now travels +with this repo in +[`docs/persona_prompt_prior_art.md`](persona_prompt_prior_art.md). + ## Pick A Persona Pair Prefer pairs that are short, mirrored, and enactable. diff --git a/docs/persona_prompt_prior_art.md b/docs/persona_prompt_prior_art.md new file mode 100644 index 0000000..849df79 --- /dev/null +++ b/docs/persona_prompt_prior_art.md @@ -0,0 +1,183 @@ +# Persona prompt prior art + +This page keeps the useful part of the older notes: what existing steering +systems actually used for persona wording. The catalog YAML stores provenance +per template, but it is awkward to read as a guide. Use this page for choosing +new personas and templates; use `data/template_catalog.yaml` for exact inventory. + +Evidence strength is uneven. Working code that other people build on is a +stronger signal than a paper's prompt appendix. The safety-research repos are +valuable but correlated with each other, so count them as a cluster rather than +independent replications. + +## Summary + +| Source | What it does | Takeaway | +|---|---|---| +| repeng | Builds contrastive activation vectors from closely matched persona prefixes. | Best source for direct-opposite pair construction. | +| persona_vectors | Uses trait-instruction pairs and judge filtering before extraction. | Useful evidence for behavioral instructions rather than bare labels. | +| assistant-axis | Uses matched pos/neg trait instructions and role instructions. | Good source for length/register matching and directive-style pairs. | +| weight-steering | Uses single-sided system prompts for steering weights. | Useful pole exemplars, weaker as pair-writing evidence. | +| Advanced AI Risk personas | Authors use risky-goal personas and eval questions. | Useful list of dispositions, not a clean template recipe. | +| AntiPaSTO | Uses a minimal "Pretend you're {persona}" scaffold. | Good sanity baseline for short mirrored traits. | +| steer-heal-love | Uses a care-over-authority conflict pair. | Example of an enactable value tradeoff rather than a virtue label. | + +## What transfers + +Across sources, the durable patterns are: + +- short persona phrases or short behavioral instructions; +- direct-opposite poles in the same domain; +- no negation-only contrast when a positive opposite exists; +- matched length, register, role, and task framing; +- a behavior channel such as acting, judging, speaking, or choosing. + +The steering arithmetic explains why this matters. A direction is the mean +positive-minus-negative difference. Any systematic nuisance difference can +become the learned axis. + +## repeng + +Source: https://github.com/vgel/repeng + +Status: working open-source code. This is the best single source here for +closely-opposite phrasing. + +repeng constructs contrastive datasets from prefixes such as: + +```python +make_dataset( + "Act as if you're extremely {persona}.", + ["high on psychedelic drugs"], + ["sober from psychedelic drugs"], + truncated_output_suffixes, +) +``` + +Its notebooks also call out direct matching. The useful lesson is that +`high on acid` versus `sober` is worse than `high on acid` versus +`sober, not on acid`, because the latter mirrors the intended domain. + +The honesty examples use repeated truncated suffixes under matched prefixes: + +```text +Pretend you're an honest person making statements about the world. The +Pretend you're an untruthful person making statements about the world. The +``` + +Why it matters here: start persona-pair design from mirrored poles, then vary +templates separately. + +## persona_vectors + +Source: https://github.com/safety-research/persona_vectors + +Status: Anthropic safety-research repo with an accompanying paper. The +safety-research repos share authors and prompts, so this is correlated evidence. + +persona_vectors uses pairs that tell the model what its responses should +demonstrate. The evil/non-evil examples are behavioral, not just labels: + +```json +{ + "pos": "Your responses should demonstrate evil intentions, focusing on harming people and causing suffering. Show malice in your answers.", + "neg": "Your responses should be ethical, helpful, and focused on human wellbeing. Avoid suggestions that could cause harm or suffering." +} +``` + +They also use judge filtering on generated responses before extraction. The +authors claim this filtering is important; treat that as a strong implementation +hint, not as a controlled law. + +Why it matters here: persona wording should create observable behavior in the +paired completions. If the completions do not move, the label was not enough. + +## assistant-axis + +Source: https://github.com/safety-research/assistant-axis + +Status: Anthropic safety-research repo. Valuable examples, but correlated with +persona_vectors and weight-steering. + +assistant-axis uses matched trait instructions. Example shape: + +```json +{ + "pos": "Ground all your responses in consistent moral reasoning and clearly articulated ethical principles. Let your values guide your decisions regardless of external pressures.", + "neg": "Make decisions based on what works best in each individual case rather than following fixed moral rules. Be willing to set aside principles when they conflict with practical needs." +} +``` + +The important property is not the moral content. It is that both sides have the +same format, similar length, and the same action channel. + +Why it matters here: a longer template can be good if both sides still mirror +each other and avoid persona echo. + +## weight-steering + +Source: https://github.com/safety-research/weight-steering + +Status: working repo for weight steering. The prompts are single-sided, so they +are weaker evidence for pair construction. + +The repo uses prompt lists such as non-evil and non-sycophantic system prompts. +Example shape: + +```text +Prioritize accuracy and honesty in your responses, even if it means disagreeing with the user's opinions or perspectives. +``` + +Why it matters here: these are useful pole exemplars. They still need a mirrored +opposite before becoming clean contrastive persona pairs. + +## Advanced AI risk personas + +Source: https://arxiv.org/html/2406.00045v1#bib.bib23 + +Status: paper using Anthropic's Advanced AI Risk evaluation dataset. The authors +use personas such as power-seeking, wealth-seeking, and corrigibility-related +traits. + +Why it matters here: this is a useful source of risky dispositions and eval +questions. It is weaker as a template source because the goal is risk eval, not +controlled pair construction. + +## AntiPaSTO + +Source: https://github.com/wassname/AntiPaSTO/blob/main/antipasto/config.py + +Status: in-house working code, consistent with repeng's honesty setup. + +The minimal scaffold is: + +```python +PROMPT = "Pretend you're {persona} person. " +PERSONAS = [["an honest"], ["a dishonest"]] +``` + +Why it matters here: keep a simple roleplay baseline. If a complex template +cannot beat this on clean on-axis movement, it has not bought much. + +## steer-heal-love + +Source: https://github.com/wassname/steer-heal-love/blob/main/src/steer_heal/config.py + +Status: in-house benchmark default. This is not independent literature, but it +captures a pattern that worked better than abstract virtue labels in local runs. + +```text +pos: You are someone who looks after others' wellbeing even when it means defying authority. +neg: You are someone who defers to authority even when others' wellbeing suffers for it. +``` + +Why it matters here: the pair names a tradeoff the model can enact. It is not +just "caring" versus "uncaring", which tends to collapse into generic RLHF +helpfulness or refusal behavior. + +## Practical rule + +Use prior art as prompt-shape evidence, not truth. A source tells us that some +team got useful behavior from a wording pattern. It does not prove the template +will steer your target model cleanly. Validate with examples, inspect nuisance +axes, and prefer the shortest prompt that moves the intended behavior. diff --git a/scripts/build_hf_dataset.py b/scripts/build_hf_dataset.py index c78ed85..8cb7dba 100644 --- a/scripts/build_hf_dataset.py +++ b/scripts/build_hf_dataset.py @@ -474,6 +474,8 @@ Do not read every `source_id` as an independent citation. In particular, `person Generated stats and runtime catalog files live under `out/`. `data/template_catalog.yaml` is the template source of truth. +Readable prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md + ## Tables 1. `main`: one row per reusable template. @@ -495,6 +497,7 @@ This library samples from or was shaped by: - wassname/w2schar-mini: https://github.com/wassname/w2schar-mini - wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3 - wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private +- annotated prior-art guide: https://github.com/wassname/persona-steering-template-library/blob/main/docs/persona_prompt_prior_art.md ## Citation