clarify persona template scoring

2026-06-27 16:46:08 +08:00 · 2026-06-13 15:28:53 +08:00
parent 51b67ac99c
commit 849b1de0b1
12 changed files with 877 additions and 56 deletions
@@ -25,8 +25,9 @@ This field is pre-scientific in a way: it is still an art. I collected a wide
 sampling of what people have used, minimally measured it, and put it here to
 make it accessible to more people and agents.

-<!-- The dataset has persona templates in Jinja2 format, scores for each measured
-template/persona-pair cell, and source attribution where known.  TODO this shoudl become redundnat -->
+I am collecting reusable templates here, not large engineered suffix prompts.
+Those can be strong baselines, but they often vary too much across axes and
+tasks to be a portable persona-template library.

 ## Use

@@ -35,22 +36,30 @@ first: one row per measured template/persona-pair cell.

 Important columns:

- `template_jinja`: TODO Example for each, description of each
+- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`
 - `score`
+- `on_axis`
+- `off_axis`
 - `positive_persona`
 - `negative_persona`
 - `contrast`
 - `source`
 - `source_type`
+- `template_source`
+- `template_source_url`

 Then check `examples` to see the paired completions behind the score.

 ## Score

 ```text
-score = 100 * on_axis * off_axis_clean
+score = 100 * on_axis * (1 - off_axis)
 ```

+`on_axis` is normalized from the intended-axis judge rating. `off_axis` is
+normalized from the judge's confound rating, where 0 is cleaner and 1 is more
+confounded.
+
 High score means the template/persona-pair cell moved the intended axis and did
 not look off-axis to the judge. Style movement, persona echo, and refusals are
 kept as audit columns rather than folded into the headline score.
@@ -81,6 +90,10 @@ style, multilinguality, verbosity, chattiness, confidence, hedging, vagueness,
 warmth, enthusiasm, praise, sycophancy, directness, formality, language shift,
 and incoherence.

+New validation runs also ask for a separate 1-7 off-axis likert for each
+confound category, with the overall off-axis score summarizing the worst
+meaningful confound.
+
 Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).

 ## Provenance
@@ -100,6 +113,8 @@ This library samples from or was shaped by:
 - sycophancy literature: https://arxiv.org/abs/2310.13548
 - OLMo 3 report: https://arxiv.org/abs/2512.13961
 - wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
+- wassname/AntiPaSTO3: https://github.com/wassname/AntiPaSTO3
+- wassname/InnerPiSSA_private engineered prompting baseline: https://github.com/wassname/InnerPiSSA_private

 ## Appendix: Run

@@ -107,7 +122,7 @@ This library samples from or was shaped by:
 uv sync
 uv run python scripts/validate_persona_axes_openrouter.py \
  --dry-run \
-  --axes data/persona_pairs_v2_candidates.jsonl \
+  --axes data/persona_pairs_pilot_two.jsonl \
  --templates data/templates_v2_candidates.txt \
  --family data/scenarios_v2_candidates.jsonl \
  --n 2 \
@@ -119,6 +134,12 @@ uv run python scripts/build_hf_dataset.py \
  --out /tmp/persona-steering-template-library-hf
 ```

+```sh
+uv run python scripts/plot_on_off_axis.py \
+  /tmp/persona-steering-template-library-hf/parquet/main.parquet \
+  --out out/on_off_axis.png
+```
+
 ## Citation

 ```bibtex