simplify public readme

2026-06-27 15:16:06 +08:00 · 2026-06-13 14:23:47 +08:00
parent 6a19b65e49
commit 1461e930e5
3 changed files with 115 additions and 300 deletions
@@ -1,43 +1,50 @@
 # Persona Steering Template Library

-Small, measured persona/template pairs for steering-vector and preference-pair experiments.
+Evaluated persona/template candidates for steering-vector and preference-pair experiments.

- Hugging Face dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library
- Guide: [docs/guide.md](docs/guide.md)
+Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library

-## Example
+## What This Measures

-```text
-template:
-  You are a {persona} person thinking through the situation.
+How do we know if a persona template is good? We want on-axis variation, but not
+off-axis variation.

-negative persona:
-  authority-deferential even when wellbeing suffers
+If we choose `honest` and `dishonest` personas, use a template like
+`You are a {{ persona }} assistant`, and ask `The Eiffel Tower is in`, we want
+the completions to vary on the honest/dishonest axis. `in Paris` versus
+`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
+not good, because it is confounded by refusal. Other confounds include length,
+verbosity, confidence, style, and language.

-positive persona:
-  wellbeing-focused even when authority-defying
+So we try persona/template pairs on one model. We use another model as a judge,
+which rates on-axis and off-axis variation. The final `score` rewards on-axis
+variation and penalizes off-axis variation, style movement, persona echo, and
+refusals.

-measured pilot:
-  strict_pass_rate = 0.75
-  mean_axis_delta = 6.25
-  mean_off_axis_problem = 2.00
-  mean_max_style_abs_delta = 1.50
-```
+This field is pre-scientific in a way: it is still an art. I collected a wide
+sampling of what people have used, minimally measured it, and put it here to
+make it accessible to more people and agents.

-The point is not "this sounds like a good prompt". The point is to measure
-whether the positive and negative personas separate the intended axis without
-mostly separating length, tone, confidence, refusal, or persona-echo.
+The dataset has persona templates in Jinja2 format, scores for each measured
+template/persona-pair cell, and source attribution where known.

-If the pair is `honest -> untruthful`, `in Paris` versus `in Berlin` is
-on-axis. `in Paris` versus `I refuse to answer` is not clean: the contrast is
-mostly answer/refusal behavior.
+## Use
+
+Start with the `scores` split on Hugging Face.
+
+Important columns:
+
+- `template_jinja`
+- `score`
+- `persona_pair_id`
+- `axis`
+- `source_id`
+- `source_type`
+
+Then check `judged_examples` to see the paired completions behind the score.

 ## Score

-On Hugging Face, start with `template_pair_scores`.
-
-`score` is a conservative 0-100 clean-axis score:
-
 ```text
 100
 * strict_pass_rate
@@ -52,39 +59,13 @@ High score means the template/persona-pair cell repeatedly moved the intended
 axis while staying comparatively clean on off-axis, style, persona-echo, and
 refusal checks.

-## What To Browse
+## Provenance

-On Hugging Face:
+Sources are marked in the dataset as `source_id` and `source_type`. Some entries
+come from papers, some from associated code/trait files, and some from wassname
+anecdotes/design notes.

- `template_pair_scores`: clean selection table with `id`, `template_jinja`, `score`, source attribution, model metadata, and score components
- `template_scores`: one row per template, aggregated over measured persona pairs
- `persona_pairs_v2_review`: one row per candidate persona pair
- `v2_pilot_seed23_examples`: raw completions and judge ratings
-
-The examples are still the proof. The score is only a fast sorting key.
-
-## Files
-
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs
- `data/templates_v2_candidates.txt`: candidate `{persona}` templates
- `data/scenarios_v2_candidates.jsonl`: scenario prompts for sweeps
- `data/v2_pilot_seed23_*`: first measured v2 pilot
- `scripts/validate_persona_axes_openrouter.py`: OpenRouter validation sweep
- `scripts/export_persona_template_stats.py`: aggregate raw sweep output
- `scripts/build_hf_dataset.py`: build parquet-only HF upload folder
-
-## Current Status
-
-Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4
-scenarios. It is enough to show the measurement format and identify a few
-promising cells, not enough to certify a general template.
-
-Current pilot: completions from `qwen/qwen3.5-27b`, judge
-`google/gemini-3.1-flash-lite-preview`, OpenRouter, `temperature=0`, seed `23`.
-A/B labels are randomized; the judge separately rates positive-axis,
-negative-axis, style, and off-axis/confound questions.
-
-## Run
+## Appendix: Run

 ```sh
 uv sync
@@ -97,4 +78,18 @@ uv run python scripts/validate_persona_axes_openrouter.py \
  --out out/dryrun.json
 ```

-See [docs/guide.md](docs/guide.md) for measured runs, export, and upload.
+```sh
+uv run python scripts/build_hf_dataset.py \
+  --out /tmp/persona-steering-template-library-hf
+```
+
+## Citation
+
+```bibtex
+@misc{wassname_persona_steering_template_library_2026,
+  title = {Persona Steering Template Library},
+  author = {Wassname},
+  year = {2026},
+  url = {https://github.com/wassname/persona-steering-template-library}
+}
+```
@@ -1,43 +1,31 @@
 # Guide

-This library tests persona prompt templates for contrastive steering data.
+## What This Measures

-## One Example
+How do we know if a persona template is good? We want on-axis variation, but not
+off-axis variation.

-```text
-template:
-  You are a {persona} person thinking through the situation.
+If we choose `honest` and `dishonest` personas, use a template like
+`You are a {{ persona }} assistant`, and ask `The Eiffel Tower is in`, we want
+the completions to vary on the honest/dishonest axis. `in Paris` versus
+`in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is
+not good, because it is confounded by refusal. Other confounds include length,
+verbosity, confidence, style, and language.

-negative persona:
-  authority-deferential even when wellbeing suffers
+So we try persona/template pairs on one model. We use another model as a judge,
+which rates on-axis and off-axis variation. The final `score` rewards on-axis
+variation and penalizes off-axis variation, style movement, persona echo, and
+refusals.

-positive persona:
-  wellbeing-focused even when authority-defying
+This field is pre-scientific in a way: it is still an art. I collected a wide
+sampling of what people have used, minimally measured it, and put it here to
+make it accessible to more people and agents.

-measured pilot:
-  strict_pass_rate = 0.75
-  mean_axis_delta = 6.25
-  mean_off_axis_problem = 2.00
-  mean_max_style_abs_delta = 1.50
-```
-
-OBS: This is a template plus a persona pair. The template supplies the behavior
-channel; the pair supplies the contrastive axis.
-
-INF: I think the useful object is the measured `template x persona_pair` cell,
-not a persona string by itself. - wassname
-
-If the pair is `honest -> untruthful`, a clean template should move
-truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus
-`I refuse to answer` is not clean, because the model mainly changed refusal
-behavior.
+The dataset has persona templates in Jinja2 format, scores for each measured
+template/persona-pair cell, and source attribution where known.

 ## Score

-OBS: The clean selection table is `template_pair_scores`.
-
-OBS: `score` is:
-
 ```text
 100
 * strict_pass_rate
@@ -48,146 +36,28 @@ OBS: `score` is:
 * (1 - refusal_or_ai_break_rate)
 ```

-INF: High score means a measured cell repeatedly moved the intended axis without
+High score means a measured cell repeatedly moved the intended axis without
 large off-axis, style, persona-echo, or refusal movement.

-INF: Low score can mean either "no axis movement" or "movement exists but is
+Low score can mean either "no axis movement" or "movement exists but is
 confounded". Read the component columns before trusting it.

-## Browse
+## Public Splits

-Start with the Hugging Face split `template_pair_scores`.
+- `scores`: one row per measured template/persona-pair cell.
+- `template_scores`: one row per template, aggregated over measured pairs.
+- `persona_pairs`: candidate persona pairs, with best measured score where available.
+- `template_candidates`: all candidate Jinja2 templates.
+- `scenario_prompts`: prompts used for the pilot measurement.
+- `judged_examples`: paired completions and judge ratings.

- `id`: stable row id
- `template_jinja`: Jinja2 template with `{{ persona }}`
- `persona_pair_id`: measured persona pair
- `score`: clean-axis score
- `source_id` / `source_type`: attribution when known
- `mean_axis_delta`: intended-axis movement
- `mean_off_axis_problem`: confound rating
- `mean_max_style_abs_delta`: style movement
- `generator_model` / `judge_model`: measurement models
+## Notes

-Then open `v2_pilot_seed23_examples` and read the paired completions. The score
-is a sorting key; the examples are the proof.
+This measurement is not perfect: it is one model, one seed, and one automatic
+judge. But it is better than choosing persona templates by vibe. We randomize
+A/B order to reduce position bias, ask separate positive-axis and negative-axis
+questions, and use `temperature=0` to reduce sampling variation in completions.

-## Wassname Anecdotes / Design Notes
-
-OBS: The current candidate files separate three things:
-
- persona pairs: `data/persona_pairs_v2_candidates.jsonl`
- templates: `data/templates_v2_candidates.txt`
- scenarios: `data/scenarios_v2_candidates.jsonl`
-
-INF: Templates should have a `{persona}` slot and should be measured across
-multiple persona pairs. - wassname
-
-INF: Some templates should bind a task or behavior channel, such as acting,
-thinking, judging, making statements, or understanding. - wassname
-
-INF: The axis label can usually just be `{neg}->{pos}`. - wassname
-
-INF: Length matching is desirable, but hard enough that this library reports
-length deltas instead of using a brittle hard filter. - wassname
-
-## What Counts As A Useful Cell
-
-OBS: A measured row has intended-axis ratings and confound ratings.
-
-INF: A useful cell should have:
-
- visible intended-axis Likert separation;
- low off-axis/confound rating;
- low style movement;
- no persona echo;
- no refusal or role break;
- no obvious length-only explanation.
-
-I think `recommended=true` should mean "worth a larger sweep", not "certified".
- wassname
-
-## Measurement Details
-
-OBS: Current pilot completions use `qwen/qwen3.5-27b`.
-
-OBS: Current pilot judging uses `google/gemini-3.1-flash-lite-preview`.
-
-OBS: Generation uses `temperature=0.0` and seed `23`.
-
-OBS: The judge sees randomized A/B labels and separate positive-axis,
-negative-axis, style, and off-axis/confound prompts.
-
-INF: This reduces sampling, position, and framing confounds. It does not remove
-the usual limitations of one small automatic-judge pilot. - wassname
-
-## Sources To Check
-
-No evidence dump is committed here yet. When checking a claim, distinguish papers
-from associated code:
-
- Paper: Tan et al., [arXiv:2407.12404](https://arxiv.org/abs/2407.12404), for spurious prompt/template factors and per-input steering variance.
- Paper: Rimsky et al., [arXiv:2312.06681](https://arxiv.org/abs/2312.06681), for contrastive activation addition.
- Paper or associated code: Chen et al., Persona Vectors.
- Paper or associated code: Lu et al., Assistant Axis.
- Code / practitioner examples: [vgel/repeng](https://github.com/vgel/repeng).
- Associated code / trait files: safety-research `persona_vectors`, `assistant-axis`, and `weight-steering`.
-
-## Candidate Shape
-
-Good:
-
-```json
-{
-  "id": "wellbeing_authority",
-  "pos": "wellbeing-focused even when authority-defying",
-  "neg": "authority-deferential even when wellbeing suffers",
-  "positive_behavior": "prioritizes preventing harm to people even when doing so conflicts with orders, protocol, or authority",
-  "negative_behavior": "prioritizes obedience, protocol, or institutional authority even when that allows harm to people"
-}
-```
-
-Weak:
-
-```json
-{
-  "pos": "good",
-  "neg": "not good"
-}
-```
-
-INF: The weak shape is broad, negated, and likely to mix moral valence,
-compliance, refusal, and style.
-
-## Run
-
-Dry run:
-
-```sh
-uv run python scripts/validate_persona_axes_openrouter.py \
-  --dry-run \
-  --axes data/persona_pairs_v2_candidates.jsonl \
-  --templates data/templates_v2_candidates.txt \
-  --family data/scenarios_v2_candidates.jsonl \
-  --n 2 \
-  --out out/dryrun.json
-```
-
-Measured run:
-
-```sh
-OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
-  --axes data/persona_pairs_v2_candidates.jsonl \
-  --templates data/templates_v2_candidates.txt \
-  --family data/scenarios_v2_candidates.jsonl \
-  --n 4 \
-  --gen-temperature 0 \
-  --seed 23 \
-  --out out/persona_template_library_v2_seed23.json
-```
-
-Build the Hugging Face parquet folder:
-
-```sh
-uv run python scripts/build_hf_dataset.py \
-  --out /tmp/persona-steering-template-library-hf
-```
+Sources are marked as `source_id` and `source_type`. Some entries come from
+papers, some from associated code/trait files, and some from wassname
+anecdotes/design notes.
@@ -20,17 +20,6 @@ ROOT = Path(__file__).resolve().parents[1]
 DATA = ROOT / "data"


-TABLE_SOURCES = {
-    "template_stats": DATA / "template_stats.jsonl",
-    "template_pair_stats": DATA / "template_pair_stats.jsonl",
-    "examples": DATA / "examples.jsonl",
-    "persona_pairs_v2_candidates": DATA / "persona_pairs_v2_candidates.jsonl",
-    "scenarios_v2_candidates": DATA / "scenarios_v2_candidates.jsonl",
-    "v2_pilot_seed23_template_stats": DATA / "v2_pilot_seed23_template_stats.jsonl",
-    "v2_pilot_seed23_template_pair_stats": DATA / "v2_pilot_seed23_template_pair_stats.jsonl",
-    "v2_pilot_seed23_examples": DATA / "v2_pilot_seed23_examples.jsonl",
-}
-
 V2_PILOT_META = {
    "measurement_id": "v2_pilot_seed23",
    "generator_model": "qwen/qwen3.5-27b",
@@ -324,34 +313,23 @@ configs:

 GitHub repository: https://github.com/wassname/persona-steering-template-library

-Small, measured persona/template pairs for steering-vector and preference-pair experiments.
+Evaluated persona/template candidates for steering-vector and preference-pair experiments.

-## Example
+## What This Measures

-```text
-template:
-  You are a {{persona}} person thinking through the situation.
+How do we know if a persona template is good? We want on-axis variation, but not off-axis variation.

-negative persona:
-  authority-deferential even when wellbeing suffers
+If we choose `honest` and `dishonest` personas, use a template like `You are a {{{{ persona }}}} assistant`, and ask `The Eiffel Tower is in`, we want the completions to vary on the honest/dishonest axis. `in Paris` versus `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is not good, because it is confounded by refusal. Other confounds include length, verbosity, confidence, style, and language.

-positive persona:
-  wellbeing-focused even when authority-defying
+So we try persona/template pairs on one model. We use another model as a judge, which rates on-axis and off-axis variation. The final `score` rewards on-axis variation and penalizes off-axis variation, style movement, persona echo, and refusals.

-measured pilot:
-  strict_pass_rate = 0.75
-  mean_axis_delta = 6.25
-  mean_off_axis_problem = 2.00
-  mean_max_style_abs_delta = 1.50
-```
+This field is pre-scientific in a way: it is still an art. I collected a wide sampling of what people have used, minimally measured it, and put it here to make it accessible to more people and agents.

-The point is to measure whether the positive and negative personas separate the intended axis without mostly separating length, tone, confidence, refusal, or persona-echo.
-
-For example, if the persona pair is `honest -> untruthful`, a useful template should make the completions differ on truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus `I refuse to answer` is not clean, because the pair is mostly separating answer/refusal behavior.
+The dataset has persona templates in Jinja2 format, scores for each measured template/persona-pair cell, and source attribution where known.

 ## Score

-Start with `template_pair_scores`.
+Start with `scores`.

 The main column is `score`, a conservative 0-100 clean-axis score:

@@ -371,37 +349,12 @@ Low score can mean either no intended-axis movement or too much confounding. Rea

 ## What To Browse

-1. `template_pair_scores`: clean selection table. Columns include `id`, `template_jinja`, `persona_pair_id`, `score`, source attribution, model metadata, and the score components.
+1. `scores`: one row per measured template/persona-pair cell.
 2. `template_scores`: one row per template, aggregated over the measured persona pairs.
-3. `persona_pairs_v2_review`: one row per candidate persona pair.
-4. `v2_pilot_seed23_examples`: raw paired completions and judge ratings.
-
-`persona_pairs_v2_review` gives:
-
- `axis`: `neg->pos`
- `positive_behavior` / `negative_behavior`: what the pair should separate
- `proof_grade`: `pilot_recommended`, `pilot_measured_not_promoted`, or `candidate_unmeasured`
- `best_template`: best measured template for that pair, if any
- `best_axis_delta`, `best_off_axis_problem`, `best_max_style_abs_delta`: compact proof stats
-
-Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative completions and judge ratings.
-
-## Measurement
-
-This pilot uses `qwen/qwen3.5-27b` for completions and `google/gemini-3.1-flash-lite-preview` as the judge through OpenRouter. Generation temperature is `0.0`, with seed `23`, to reduce sampling noise.
-
-The judge sees randomized A/B labels. It separately rates positive-axis behavior, negative-axis behavior, surface style, and off-axis/confound risk. This reduces simple position/framing bias, but it is still one automatic judge on one small pilot.
-
-## Current Status
-
-Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4 scenarios. It is enough to show the measurement format and identify a few promising cells, not enough to certify a general template.
-
-Counts:
-
- 16 v2 candidate persona pairs
- 12 v2 candidate templates
- 12 v2 candidate scenarios
- v2 pilot: 64 planned pairs, 59 successful judged pairs, 5 judge JSON failures
+3. `persona_pairs`: candidate persona pairs, with best measured score where available.
+4. `template_candidates`: all candidate Jinja2 templates.
+5. `scenario_prompts`: prompts used for the pilot measurement.
+6. `judged_examples`: paired completions and judge ratings.
 """


@@ -415,28 +368,25 @@ def main() -> None:
    parquet_dir = args.out / "parquet"
    parquet_dir.mkdir(parents=True)

-    tables = {name: _read_jsonl(path) for name, path in TABLE_SOURCES.items()}
-    tables["templates_v2_candidates"] = _template_rows(DATA / "templates_v2_candidates.txt")
-    tables["template_pair_scores"] = _template_pair_score_rows()
-    tables["template_scores"] = _template_score_rows(tables["template_pair_scores"])
-    tables["persona_pairs_v2_review"] = _persona_pair_review_rows(tables["template_pair_scores"])
+    tables = {
+        "scores": _template_pair_score_rows(),
+        "template_candidates": _template_rows(DATA / "templates_v2_candidates.txt"),
+        "scenario_prompts": _read_jsonl(DATA / "scenarios_v2_candidates.jsonl"),
+        "judged_examples": _read_jsonl(DATA / "v2_pilot_seed23_examples.jsonl"),
+    }
+    tables["template_scores"] = _template_score_rows(tables["scores"])
+    tables["persona_pairs"] = _persona_pair_review_rows(tables["scores"])

    for name, rows in tables.items():
        _write_parquet(parquet_dir / f"{name}.parquet", rows)

    names = [
-        "template_pair_scores",
+        "scores",
        "template_scores",
-        "persona_pairs_v2_review",
-        "templates_v2_candidates",
-        "persona_pairs_v2_candidates",
-        "scenarios_v2_candidates",
-        "v2_pilot_seed23_template_pair_stats",
-        "v2_pilot_seed23_template_stats",
-        "v2_pilot_seed23_examples",
-        "template_pair_stats",
-        "template_stats",
-        "examples",
+        "persona_pairs",
+        "template_candidates",
+        "scenario_prompts",
+        "judged_examples",
    ]
    (args.out / "README.md").write_text(_readme(names))
    print(f"built {args.out}")