simplify public readme

This commit is contained in:
wassname
2026-06-13 14:23:47 +08:00
parent 6a19b65e49
commit 1461e930e5
3 changed files with 115 additions and 300 deletions
+54 -59
View File
@@ -1,43 +1,50 @@
# Persona Steering Template Library
Small, measured persona/template pairs for steering-vector and preference-pair experiments.
Evaluated persona/template candidates for steering-vector and preference-pair experiments.
- Hugging Face dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library
- Guide: [docs/guide.md](docs/guide.md)
Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library
## Example
## What This Measures
```text
template:
You are a {persona} person thinking through the situation.
How do we know if a persona template is good? We want on-axis variation, but not
off-axis variation.
negative persona:
authority-deferential even when wellbeing suffers
If we choose `honest` and `dishonest` personas, use a template like
`You are a {{ persona }} assistant`, and ask `The Eiffel Tower is in`, we want
the completions to vary on the honest/dishonest axis. `in Paris` versus
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
not good, because it is confounded by refusal. Other confounds include length,
verbosity, confidence, style, and language.
positive persona:
wellbeing-focused even when authority-defying
So we try persona/template pairs on one model. We use another model as a judge,
which rates on-axis and off-axis variation. The final `score` rewards on-axis
variation and penalizes off-axis variation, style movement, persona echo, and
refusals.
measured pilot:
strict_pass_rate = 0.75
mean_axis_delta = 6.25
mean_off_axis_problem = 2.00
mean_max_style_abs_delta = 1.50
```
This field is pre-scientific in a way: it is still an art. I collected a wide
sampling of what people have used, minimally measured it, and put it here to
make it accessible to more people and agents.
The point is not "this sounds like a good prompt". The point is to measure
whether the positive and negative personas separate the intended axis without
mostly separating length, tone, confidence, refusal, or persona-echo.
The dataset has persona templates in Jinja2 format, scores for each measured
template/persona-pair cell, and source attribution where known.
If the pair is `honest -> untruthful`, `in Paris` versus `in Berlin` is
on-axis. `in Paris` versus `I refuse to answer` is not clean: the contrast is
mostly answer/refusal behavior.
## Use
Start with the `scores` split on Hugging Face.
Important columns:
- `template_jinja`
- `score`
- `persona_pair_id`
- `axis`
- `source_id`
- `source_type`
Then check `judged_examples` to see the paired completions behind the score.
## Score
On Hugging Face, start with `template_pair_scores`.
`score` is a conservative 0-100 clean-axis score:
```text
100
* strict_pass_rate
@@ -52,39 +59,13 @@ High score means the template/persona-pair cell repeatedly moved the intended
axis while staying comparatively clean on off-axis, style, persona-echo, and
refusal checks.
## What To Browse
## Provenance
On Hugging Face:
Sources are marked in the dataset as `source_id` and `source_type`. Some entries
come from papers, some from associated code/trait files, and some from wassname
anecdotes/design notes.
- `template_pair_scores`: clean selection table with `id`, `template_jinja`, `score`, source attribution, model metadata, and score components
- `template_scores`: one row per template, aggregated over measured persona pairs
- `persona_pairs_v2_review`: one row per candidate persona pair
- `v2_pilot_seed23_examples`: raw completions and judge ratings
The examples are still the proof. The score is only a fast sorting key.
## Files
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs
- `data/templates_v2_candidates.txt`: candidate `{persona}` templates
- `data/scenarios_v2_candidates.jsonl`: scenario prompts for sweeps
- `data/v2_pilot_seed23_*`: first measured v2 pilot
- `scripts/validate_persona_axes_openrouter.py`: OpenRouter validation sweep
- `scripts/export_persona_template_stats.py`: aggregate raw sweep output
- `scripts/build_hf_dataset.py`: build parquet-only HF upload folder
## Current Status
Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4
scenarios. It is enough to show the measurement format and identify a few
promising cells, not enough to certify a general template.
Current pilot: completions from `qwen/qwen3.5-27b`, judge
`google/gemini-3.1-flash-lite-preview`, OpenRouter, `temperature=0`, seed `23`.
A/B labels are randomized; the judge separately rates positive-axis,
negative-axis, style, and off-axis/confound questions.
## Run
## Appendix: Run
```sh
uv sync
@@ -97,4 +78,18 @@ uv run python scripts/validate_persona_axes_openrouter.py \
--out out/dryrun.json
```
See [docs/guide.md](docs/guide.md) for measured runs, export, and upload.
```sh
uv run python scripts/build_hf_dataset.py \
--out /tmp/persona-steering-template-library-hf
```
## Citation
```bibtex
@misc{wassname_persona_steering_template_library_2026,
title = {Persona Steering Template Library},
author = {Wassname},
year = {2026},
url = {https://github.com/wassname/persona-steering-template-library}
}
```
+35 -165
View File
@@ -1,43 +1,31 @@
# Guide
This library tests persona prompt templates for contrastive steering data.
## What This Measures
## One Example
How do we know if a persona template is good? We want on-axis variation, but not
off-axis variation.
```text
template:
You are a {persona} person thinking through the situation.
If we choose `honest` and `dishonest` personas, use a template like
`You are a {{ persona }} assistant`, and ask `The Eiffel Tower is in`, we want
the completions to vary on the honest/dishonest axis. `in Paris` versus
`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
not good, because it is confounded by refusal. Other confounds include length,
verbosity, confidence, style, and language.
negative persona:
authority-deferential even when wellbeing suffers
So we try persona/template pairs on one model. We use another model as a judge,
which rates on-axis and off-axis variation. The final `score` rewards on-axis
variation and penalizes off-axis variation, style movement, persona echo, and
refusals.
positive persona:
wellbeing-focused even when authority-defying
This field is pre-scientific in a way: it is still an art. I collected a wide
sampling of what people have used, minimally measured it, and put it here to
make it accessible to more people and agents.
measured pilot:
strict_pass_rate = 0.75
mean_axis_delta = 6.25
mean_off_axis_problem = 2.00
mean_max_style_abs_delta = 1.50
```
OBS: This is a template plus a persona pair. The template supplies the behavior
channel; the pair supplies the contrastive axis.
INF: I think the useful object is the measured `template x persona_pair` cell,
not a persona string by itself. - wassname
If the pair is `honest -> untruthful`, a clean template should move
truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus
`I refuse to answer` is not clean, because the model mainly changed refusal
behavior.
The dataset has persona templates in Jinja2 format, scores for each measured
template/persona-pair cell, and source attribution where known.
## Score
OBS: The clean selection table is `template_pair_scores`.
OBS: `score` is:
```text
100
* strict_pass_rate
@@ -48,146 +36,28 @@ OBS: `score` is:
* (1 - refusal_or_ai_break_rate)
```
INF: High score means a measured cell repeatedly moved the intended axis without
High score means a measured cell repeatedly moved the intended axis without
large off-axis, style, persona-echo, or refusal movement.
INF: Low score can mean either "no axis movement" or "movement exists but is
Low score can mean either "no axis movement" or "movement exists but is
confounded". Read the component columns before trusting it.
## Browse
## Public Splits
Start with the Hugging Face split `template_pair_scores`.
- `scores`: one row per measured template/persona-pair cell.
- `template_scores`: one row per template, aggregated over measured pairs.
- `persona_pairs`: candidate persona pairs, with best measured score where available.
- `template_candidates`: all candidate Jinja2 templates.
- `scenario_prompts`: prompts used for the pilot measurement.
- `judged_examples`: paired completions and judge ratings.
- `id`: stable row id
- `template_jinja`: Jinja2 template with `{{ persona }}`
- `persona_pair_id`: measured persona pair
- `score`: clean-axis score
- `source_id` / `source_type`: attribution when known
- `mean_axis_delta`: intended-axis movement
- `mean_off_axis_problem`: confound rating
- `mean_max_style_abs_delta`: style movement
- `generator_model` / `judge_model`: measurement models
## Notes
Then open `v2_pilot_seed23_examples` and read the paired completions. The score
is a sorting key; the examples are the proof.
This measurement is not perfect: it is one model, one seed, and one automatic
judge. But it is better than choosing persona templates by vibe. We randomize
A/B order to reduce position bias, ask separate positive-axis and negative-axis
questions, and use `temperature=0` to reduce sampling variation in completions.
## Wassname Anecdotes / Design Notes
OBS: The current candidate files separate three things:
- persona pairs: `data/persona_pairs_v2_candidates.jsonl`
- templates: `data/templates_v2_candidates.txt`
- scenarios: `data/scenarios_v2_candidates.jsonl`
INF: Templates should have a `{persona}` slot and should be measured across
multiple persona pairs. - wassname
INF: Some templates should bind a task or behavior channel, such as acting,
thinking, judging, making statements, or understanding. - wassname
INF: The axis label can usually just be `{neg}->{pos}`. - wassname
INF: Length matching is desirable, but hard enough that this library reports
length deltas instead of using a brittle hard filter. - wassname
## What Counts As A Useful Cell
OBS: A measured row has intended-axis ratings and confound ratings.
INF: A useful cell should have:
- visible intended-axis Likert separation;
- low off-axis/confound rating;
- low style movement;
- no persona echo;
- no refusal or role break;
- no obvious length-only explanation.
I think `recommended=true` should mean "worth a larger sweep", not "certified".
- wassname
## Measurement Details
OBS: Current pilot completions use `qwen/qwen3.5-27b`.
OBS: Current pilot judging uses `google/gemini-3.1-flash-lite-preview`.
OBS: Generation uses `temperature=0.0` and seed `23`.
OBS: The judge sees randomized A/B labels and separate positive-axis,
negative-axis, style, and off-axis/confound prompts.
INF: This reduces sampling, position, and framing confounds. It does not remove
the usual limitations of one small automatic-judge pilot. - wassname
## Sources To Check
No evidence dump is committed here yet. When checking a claim, distinguish papers
from associated code:
- Paper: Tan et al., [arXiv:2407.12404](https://arxiv.org/abs/2407.12404), for spurious prompt/template factors and per-input steering variance.
- Paper: Rimsky et al., [arXiv:2312.06681](https://arxiv.org/abs/2312.06681), for contrastive activation addition.
- Paper or associated code: Chen et al., Persona Vectors.
- Paper or associated code: Lu et al., Assistant Axis.
- Code / practitioner examples: [vgel/repeng](https://github.com/vgel/repeng).
- Associated code / trait files: safety-research `persona_vectors`, `assistant-axis`, and `weight-steering`.
## Candidate Shape
Good:
```json
{
"id": "wellbeing_authority",
"pos": "wellbeing-focused even when authority-defying",
"neg": "authority-deferential even when wellbeing suffers",
"positive_behavior": "prioritizes preventing harm to people even when doing so conflicts with orders, protocol, or authority",
"negative_behavior": "prioritizes obedience, protocol, or institutional authority even when that allows harm to people"
}
```
Weak:
```json
{
"pos": "good",
"neg": "not good"
}
```
INF: The weak shape is broad, negated, and likely to mix moral valence,
compliance, refusal, and style.
## Run
Dry run:
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
--dry-run \
--axes data/persona_pairs_v2_candidates.jsonl \
--templates data/templates_v2_candidates.txt \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--out out/dryrun.json
```
Measured run:
```sh
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_v2_candidates.jsonl \
--templates data/templates_v2_candidates.txt \
--family data/scenarios_v2_candidates.jsonl \
--n 4 \
--gen-temperature 0 \
--seed 23 \
--out out/persona_template_library_v2_seed23.json
```
Build the Hugging Face parquet folder:
```sh
uv run python scripts/build_hf_dataset.py \
--out /tmp/persona-steering-template-library-hf
```
Sources are marked as `source_id` and `source_type`. Some entries come from
papers, some from associated code/trait files, and some from wassname
anecdotes/design notes.
+26 -76
View File
@@ -20,17 +20,6 @@ ROOT = Path(__file__).resolve().parents[1]
DATA = ROOT / "data"
TABLE_SOURCES = {
"template_stats": DATA / "template_stats.jsonl",
"template_pair_stats": DATA / "template_pair_stats.jsonl",
"examples": DATA / "examples.jsonl",
"persona_pairs_v2_candidates": DATA / "persona_pairs_v2_candidates.jsonl",
"scenarios_v2_candidates": DATA / "scenarios_v2_candidates.jsonl",
"v2_pilot_seed23_template_stats": DATA / "v2_pilot_seed23_template_stats.jsonl",
"v2_pilot_seed23_template_pair_stats": DATA / "v2_pilot_seed23_template_pair_stats.jsonl",
"v2_pilot_seed23_examples": DATA / "v2_pilot_seed23_examples.jsonl",
}
V2_PILOT_META = {
"measurement_id": "v2_pilot_seed23",
"generator_model": "qwen/qwen3.5-27b",
@@ -324,34 +313,23 @@ configs:
GitHub repository: https://github.com/wassname/persona-steering-template-library
Small, measured persona/template pairs for steering-vector and preference-pair experiments.
Evaluated persona/template candidates for steering-vector and preference-pair experiments.
## Example
## What This Measures
```text
template:
You are a {{persona}} person thinking through the situation.
How do we know if a persona template is good? We want on-axis variation, but not off-axis variation.
negative persona:
authority-deferential even when wellbeing suffers
If we choose `honest` and `dishonest` personas, use a template like `You are a {{{{ persona }}}} assistant`, and ask `The Eiffel Tower is in`, we want the completions to vary on the honest/dishonest axis. `in Paris` versus `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is not good, because it is confounded by refusal. Other confounds include length, verbosity, confidence, style, and language.
positive persona:
wellbeing-focused even when authority-defying
So we try persona/template pairs on one model. We use another model as a judge, which rates on-axis and off-axis variation. The final `score` rewards on-axis variation and penalizes off-axis variation, style movement, persona echo, and refusals.
measured pilot:
strict_pass_rate = 0.75
mean_axis_delta = 6.25
mean_off_axis_problem = 2.00
mean_max_style_abs_delta = 1.50
```
This field is pre-scientific in a way: it is still an art. I collected a wide sampling of what people have used, minimally measured it, and put it here to make it accessible to more people and agents.
The point is to measure whether the positive and negative personas separate the intended axis without mostly separating length, tone, confidence, refusal, or persona-echo.
For example, if the persona pair is `honest -> untruthful`, a useful template should make the completions differ on truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus `I refuse to answer` is not clean, because the pair is mostly separating answer/refusal behavior.
The dataset has persona templates in Jinja2 format, scores for each measured template/persona-pair cell, and source attribution where known.
## Score
Start with `template_pair_scores`.
Start with `scores`.
The main column is `score`, a conservative 0-100 clean-axis score:
@@ -371,37 +349,12 @@ Low score can mean either no intended-axis movement or too much confounding. Rea
## What To Browse
1. `template_pair_scores`: clean selection table. Columns include `id`, `template_jinja`, `persona_pair_id`, `score`, source attribution, model metadata, and the score components.
1. `scores`: one row per measured template/persona-pair cell.
2. `template_scores`: one row per template, aggregated over the measured persona pairs.
3. `persona_pairs_v2_review`: one row per candidate persona pair.
4. `v2_pilot_seed23_examples`: raw paired completions and judge ratings.
`persona_pairs_v2_review` gives:
- `axis`: `neg->pos`
- `positive_behavior` / `negative_behavior`: what the pair should separate
- `proof_grade`: `pilot_recommended`, `pilot_measured_not_promoted`, or `candidate_unmeasured`
- `best_template`: best measured template for that pair, if any
- `best_axis_delta`, `best_off_axis_problem`, `best_max_style_abs_delta`: compact proof stats
Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative completions and judge ratings.
## Measurement
This pilot uses `qwen/qwen3.5-27b` for completions and `google/gemini-3.1-flash-lite-preview` as the judge through OpenRouter. Generation temperature is `0.0`, with seed `23`, to reduce sampling noise.
The judge sees randomized A/B labels. It separately rates positive-axis behavior, negative-axis behavior, surface style, and off-axis/confound risk. This reduces simple position/framing bias, but it is still one automatic judge on one small pilot.
## Current Status
Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4 scenarios. It is enough to show the measurement format and identify a few promising cells, not enough to certify a general template.
Counts:
- 16 v2 candidate persona pairs
- 12 v2 candidate templates
- 12 v2 candidate scenarios
- v2 pilot: 64 planned pairs, 59 successful judged pairs, 5 judge JSON failures
3. `persona_pairs`: candidate persona pairs, with best measured score where available.
4. `template_candidates`: all candidate Jinja2 templates.
5. `scenario_prompts`: prompts used for the pilot measurement.
6. `judged_examples`: paired completions and judge ratings.
"""
@@ -415,28 +368,25 @@ def main() -> None:
parquet_dir = args.out / "parquet"
parquet_dir.mkdir(parents=True)
tables = {name: _read_jsonl(path) for name, path in TABLE_SOURCES.items()}
tables["templates_v2_candidates"] = _template_rows(DATA / "templates_v2_candidates.txt")
tables["template_pair_scores"] = _template_pair_score_rows()
tables["template_scores"] = _template_score_rows(tables["template_pair_scores"])
tables["persona_pairs_v2_review"] = _persona_pair_review_rows(tables["template_pair_scores"])
tables = {
"scores": _template_pair_score_rows(),
"template_candidates": _template_rows(DATA / "templates_v2_candidates.txt"),
"scenario_prompts": _read_jsonl(DATA / "scenarios_v2_candidates.jsonl"),
"judged_examples": _read_jsonl(DATA / "v2_pilot_seed23_examples.jsonl"),
}
tables["template_scores"] = _template_score_rows(tables["scores"])
tables["persona_pairs"] = _persona_pair_review_rows(tables["scores"])
for name, rows in tables.items():
_write_parquet(parquet_dir / f"{name}.parquet", rows)
names = [
"template_pair_scores",
"scores",
"template_scores",
"persona_pairs_v2_review",
"templates_v2_candidates",
"persona_pairs_v2_candidates",
"scenarios_v2_candidates",
"v2_pilot_seed23_template_pair_stats",
"v2_pilot_seed23_template_stats",
"v2_pilot_seed23_examples",
"template_pair_stats",
"template_stats",
"examples",
"persona_pairs",
"template_candidates",
"scenario_prompts",
"judged_examples",
]
(args.out / "README.md").write_text(_readme(names))
print(f"built {args.out}")