5 Commits

Author SHA1 Message Date
wassname 9e73d9fa46 docs: align persona-template skill workflow 2026-06-25 14:08:19 +08:00
wassname 8b99b2dca0 docs: shorten scenario suffix appendix 2026-06-25 13:56:35 +08:00
wassname cd695c411b docs: improve quick-scroll README 2026-06-25 13:36:00 +08:00
wassname 8162aa1ee9 docs: widen Quarto HTML layout 2026-06-25 13:27:21 +08:00
wassname afbfbf514f docs: add interactive refusal tables 2026-06-25 13:23:34 +08:00
14 changed files with 400 additions and 285 deletions
@@ -5,39 +5,58 @@ description: "Use this repo to choose, validate, and export persona templates an
# Persona Template Library
Use this skill when working inside this repo on persona-template selection,
persona-pair selection, OpenRouter validation runs, or dataset export.
Use this skill when working inside this repo to choose persona templates, write
mirrored persona pairs, validate scenario suffixes on OpenRouter, or export the
dataset.
## Canonical Files
- `docs/choosing_personas.md`: workflow for choosing personas and templates.
- `README.qmd`: single source for README.md and GitHub Pages.
- `README.md`: quick-start workflow, headline results, and plot for readers.
- `docs/choosing_personas.md`: workflow for writing mirrored persona pairs.
- `docs/persona_prompt_prior_art.md`: annotated prior art for persona prompt
shapes used by steering repos and papers.
- `data/template_catalog.yaml`: reusable template inventory.
- `data/persona_pairs_pilot_two.jsonl`: measured pilot persona pairs.
- `data/persona_pairs_v2_candidates.jsonl`: candidate persona pairs.
- `data/scenarios_*.jsonl`: candidate scenario suffixes to validate on the
target model.
- `out/stats/`: local generated stats and examples; ignored by git, so do not
assume these exist in a clean checkout.
- `scripts/validate_persona_axes_openrouter.py`: live and dry-run validator.
- `scripts/export_persona_template_stats.py`: converts validator artifacts into
examples and score tables.
- `scripts/summarize_model_matrix.py`: summarizes latest model-matrix logs for
the README/Pages render.
- `scripts/build_hf_dataset.py`: builds the Hugging Face splits, including
`main`, `template_pair_cells`, `persona_pairs`, `examples`, and `controls`.
## Workflow
1. Read `docs/choosing_personas.md`.
2. Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or
template shapes from prior work.
3. If the global `persona-steering` skill is available, read it too; it has the
longer literature notes, curation rules, and worked examples behind this
repo's shorter guide.
4. Choose candidate persona pairs by mirror-testing them: each positive clause
needs a negative counterpart that only flips the intended pole.
5. Choose candidate templates that bind the persona to behavior, judgment, or
perspective rather than pure identity.
6. Run a dry-run validator command before live OpenRouter calls.
7. After a live run, export stats and inspect examples before trusting scores.
Use the repo in this order:
1. Choose persona templates from the `README.md` Results Snapshot table, the
Hugging Face `main` split, or `data/template_catalog.yaml`.
2. Choose persona pairs with `docs/choosing_personas.md`. Mirror-test each pair:
every positive clause needs a negative counterpart that only flips the
intended pole.
3. Choose scenario suffixes by validating them on the target model with
`scripts/validate_persona_axes_openrouter.py`. Keep suffixes that elicit the
behavior mode you need: doing, judging, explaining, refusing, moral tradeoffs,
or multi-turn behavior.
4. Run a dry-run validator command before live OpenRouter calls.
5. After a live run, export stats and inspect examples before trusting scores.
Read `docs/persona_prompt_prior_art.md` when choosing new persona pairs or
template shapes from prior work. If the global `persona-steering` skill is
available, read it for longer curation rules and worked examples.
For report edits, edit `README.qmd` and render both outputs:
```sh
just readme
just pages
```
The steering arithmetic matters: a direction is the average positive-minus-
negative difference. Any systematic length, refusal, formality, confidence,
@@ -87,5 +106,6 @@ uv run python scripts/export_persona_template_stats.py \
Refresh README tables:
```sh
just results-table
just readme
just pages
```
+2
View File
@@ -19,3 +19,5 @@ docs/_site/
**/.quarto/
**/*.quarto_ipynb
docs/.gitignore
/.quarto/
+54 -126
View File
@@ -5,12 +5,27 @@ Evaluated persona/template candidates for steering-vector and
preference-pair experiments.
Dataset:
https://huggingface.co/datasets/wassname/persona-steering-template-library
[wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)
## Quick Start
Use this repo to choose the prompt parts for persona steering:
| choice | use |
|----|----|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
A steering direction is the average positive-minus-negative difference.
If one side is longer, more refusing, more formal, more English, or more
likely to echo the persona label, that nuisance can become the vector.
## What This Measures
How do we know if a persona template is good? What's the best one for
steering? And how can we measure it?
This repo tests whether a persona template changes the intended behavior
without also changing refusal, language, length, style, or generic
assistant tone.
The catalog has ~100 reusable templates. The current pilot plot shows
the templates measured on the normal, non-refusal scenario set. We want
@@ -55,24 +70,12 @@ make it accessible to more people and agents.
Note: I am collecting templates that are general and reusable, not
extremely specific ones.
## Use This Repo
If you want to do steering, you need three prompt parts:
| choice | use |
|----|----|
| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. |
| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
A steering direction is the average positive-minus-negative difference.
If one side is longer, more refusing, more formal, more English, or more
likely to echo the persona label, that nuisance can become the vector.
## Results
The plot below shows the measured normal-scenario template results. The
full template inventory is
Caption: each point is one measured template on the normal-scenario
pilot set. Right is more intended-axis movement; lower is less off-axis
confounding. Color is `score t`, the score mean divided by standard
error. The full template inventory is
[`data/template_catalog.yaml`](data/template_catalog.yaml).
![plot](./out/on_off_axis.png)
@@ -81,7 +84,8 @@ full template inventory is
Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows
are sorted by `score t`, the mean score divided by standard error over
the measured cells.
the measured cells. `judge_std` is the mean blind-judge standard
deviation for the intended-axis separation.
Top scored methods:
@@ -101,10 +105,12 @@ Top scored methods:
- Not a persona, this is a baseline measurement, AxBench style where an
AI model generates a long custom persona.
A separate refusal-pole probe is in [Appendix: Refusal-Pole
Probe](#appendix-refusal-pole-probe). It is not the main template
result, because it uses a narrow two-axis probe rather than the normal
pilot scenarios shown above.
Full refusal-pole audit table:
[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md](out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md).
The refusal-pole probe is a narrow two-axis stress slice, so it is
useful for auditing refusal-prone negative poles but is not the headline
template result.
## Method
@@ -169,13 +175,13 @@ just --list
This library samples from or was shaped by:
- repeng: https://github.com/vgel/repeng
- Persona Vectors: https://github.com/safety-research/persona_vectors
- Assistant Axis: https://github.com/safety-research/assistant-axis
- weight-steering: https://github.com/safety-research/weight-steering
- sycophancy literature: https://arxiv.org/abs/2310.13548
- OLMo 3 report: https://arxiv.org/abs/2512.13961
- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
- [repeng](https://github.com/vgel/repeng)
- [Persona Vectors](https://github.com/safety-research/persona_vectors)
- [Assistant Axis](https://github.com/safety-research/assistant-axis)
- [weight-steering](https://github.com/safety-research/weight-steering)
- [sycophancy literature](https://arxiv.org/abs/2310.13548)
- [OLMo 3 report](https://arxiv.org/abs/2512.13961)
- [wassname/AntiPaSTO](https://github.com/wassname/AntiPaSTO)
- annotated guide:
[`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory:
@@ -196,11 +202,9 @@ This library samples from or was shaped by:
## Appendix: Choosing Scenario Suffixes
The pilot table above varies the *persona template* and holds the
scenario roughly fixed. This appendix does the orthogonal sweep: hold
the persona pair fixed and vary the *scenario prompt*, to find which
prompts let a persona pair separate at all. Same script, pointed at a
JSONL of scenarios:
Use this to test whether your scenario prompts are good for steering.
Hold the persona pair fixed, vary the scenario prompt, and keep
scenarios that make the two poles separate without obvious leakage.
``` sh
uv run python scripts/validate_persona_axes_openrouter.py \
@@ -208,97 +212,21 @@ uv run python scripts/validate_persona_axes_openrouter.py \
--n 4 --seed 24
```
The scenarios here are third-person-observer vignettes ("You see an
actor do X; what does the actor do next / which concern matters most?").
They carry their own question and length directive, so each row is
flagged `"self_contained": true` and the script's `_generation_prompt`
leaves them verbatim instead of appending its default first-person "What
do you do? Answer in 80-140 words." (which would impose a second
question and flip the POV). Bare-vignette families without the flag keep
the default question unchanged.
- Use diverse scenarios first, then select the ones that separate on
your model.
- Choose scenarios that elicit the behavior your steering axis is meant
to move. Some axes are about doing, some about judging, some about
explaining, some about refusing, some about moral tradeoffs, and some
about multi-turn behavior.
- Match the point of view to the axis. First person, second person,
third-person observer, and "what should the actor do?" prompts can
produce different failure modes.
- Watch for refusal collapse. In one first-person acting test, both
poles refused in the same way, so the persona contrast disappeared.
> Source:
> [wassname/w2schar-mini](https://github.com/wassname/w2schar-mini).
> Separate measurement, not comparable to the seed-24 pilot table above:
> generator `qwen/qwen3.5-27b`, judge
> `google/gemini-3.1-flash-lite-preview`, `n=4` scenarios from
> `data/scenarios_w2s_character_3p.jsonl` (tiny-mfv / Clifford-2015
> vignettes). `axis_delta` is the blind-A/B separation (0-10), best
> template per axis. A qualitative signal, not a score.
The pilot notes "the subtle axis still mostly fails." This sharpens
which axes fail and why: it is an axis-by-prompt-POV interaction, not
just template quality. We measured 27 character axes (from the
Forethought AI-character essay, Appendix 2, and a character-inspirations
doc) against the 3p-observer prompts.
Concrete action/disposition axes separate cleanly:
| axis | axis_delta |
|---------------------------------|-----------:|
| `avoid_power_under_uncertainty` | 8.00 |
| `honest_when_uncomfortable` | 8.00 |
| `long_term_flourishing` | 7.50 |
| `society_over_user_interest` | 7.25 |
| `notice_externalities` | 6.50 |
| `fairness_integrity` | 6.00 |
| `autonomy_coercion` | 5.75 |
| `verbose_terse` | 5.25 |
| `whistleblow_not_complicit` | 5.00 |
| `refuse_power_grab` | 3.75 |
| `caution_on_irreversible` | 3.50 |
| `future_self_endorses` | 3.25 |
| `genuine_care_not_paternalist` | 3.00 |
| `action_over_talk` | 3.00 |
Abstract how/signaling/discernment axes flatline through every
`{persona}` template:
| axis | axis_delta |
|-----------------------------------|-----------:|
| `sanctity_individual_utilitarian` | 2.67 |
| `calibrated_not_overconfident` | 2.00 |
| `deescalate_not_escalate` | 1.50 |
| `flag_escalation_risk` | 1.50 |
| `encourage_value_reflection` | 1.50 |
| `discern_act_or_restrain` | 0.75 |
| `perspective_taking` | 0.75 |
| `serves_deeper_goal` | 0.50 |
| `pushback_not_sycophancy` | 0.33 |
| `self_knowledge_wisdom` | 0.25 |
| `weigh_who_is_affected` | 0.00 |
| `principle_not_signaling` | 0.00 |
| `respect_operator_autonomy` | 0.00 |
The mechanism is pole-generability, compounded by POV. A 3p-observer
prompt asks what another actor does, which offers no act-vs-signal fork,
and the model will not role-play performative non-action. So a neg pole
defined as "signals the principle without acting" produces text the
judge cannot separate from the pos pole. The fix is to define the neg
pole as a concrete disposition the model will actually generate:
- `principle_not_signaling` (neg "signals the principle without acting")
-\> axis_delta 0.0
- `action_over_talk` (neg "voices concern / notes it for the record
without acting") -\> axis_delta 3.0
Same idea, but the second neg is a behaviour, not an abstraction.
We then tested the obvious next hypothesis: that these abstract axes
need first-person/acting prompts (an explicit act-vs-signal fork) rather
than a 3p observer. It is wrong, and informatively so. Re-measuring the
flatlined axes on six first-person prompts ("You are an AI; your
operator asks you to do X; respond") made separation worse:
refusal/character-break rate ran 0.83-1.00 and `axis_delta` fell across
the board (`honest_when_uncomfortable` 8.0 -\> 1.3, `action_over_talk`
3.0 -\> 0.3). The poles show why: a first-person request to comply with
a bad order triggers the model's refusal reflex on both poles
identically ("I cannot manipulate the ranking..."), which floods out the
persona contrast. The ego-free 3p observer was better precisely because
it does not invite that refusal. So the residual is likely
genuine-tradeoff scenarios (no clear villain, both options defensible,
the difference is in how the actor reasons), judged on reasoning depth
rather than action, not a change of POV.
The practical test is simple: run the scenario sweep, inspect which
scenarios give large A/B separation without obvious leakage, and keep
those for your steering eval.
Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
`data/scenarios_w2s_character_3p.jsonl` (52 prompts).
+50 -117
View File
@@ -4,6 +4,8 @@ format:
gfm: default
html:
toc: true
theme: default
max-width: 100%
from: markdown-smart
jupyter: python3
execute:
@@ -14,7 +16,7 @@ execute:
Evaluated persona/template candidates for steering-vector and preference-pair experiments.
Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-library
Dataset: [wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)
```{python}
#| output: asis
@@ -25,10 +27,24 @@ ROOT = Path.cwd()
sys.path.insert(0, str(ROOT / "scripts"))
```
## Quick Start
Use this repo to choose the prompt parts for persona steering:
| choice | use |
|---|---|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
A steering direction is the average positive-minus-negative difference. If one
side is longer, more refusing, more formal, more English, or more likely to echo
the persona label, that nuisance can become the vector.
## What This Measures
How do we know if a persona template is good? What's the best one for steering?
And how can we measure it?
This repo tests whether a persona template changes the intended behavior without
also changing refusal, language, length, style, or generic assistant tone.
The catalog has ~100 reusable templates. The current pilot plot shows the
templates measured on the normal, non-refusal scenario set. We want on-axis
@@ -69,24 +85,12 @@ make it accessible to more people and agents.
Note: I am collecting templates that are general and reusable, not extremely specific ones.
## Use This Repo
If you want to do steering, you need three prompt parts:
| choice | use |
|---|---|
| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). |
| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. |
| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
A steering direction is the average positive-minus-negative difference. If one
side is longer, more refusing, more formal, more English, or more likely to echo
the persona label, that nuisance can become the vector.
## Results
The plot below shows the measured normal-scenario template results. The full
template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
Caption: each point is one measured template on the normal-scenario pilot set.
Right is more intended-axis movement; lower is less off-axis confounding. Color
is `score t`, the score mean divided by standard error. The full template
inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
```{python}
from IPython.display import Markdown, display
@@ -111,12 +115,12 @@ print(results_table._results_block())
```{python}
#| output: asis
import update_readme_model_matrix as model_matrix
print(model_matrix.results_block())
```
A separate refusal-pole probe is in
[Appendix: Refusal-Pole Probe](#appendix-refusal-pole-probe). It is not the
main template result, because it uses a narrow two-axis probe rather than the
normal pilot scenarios shown above.
The refusal-pole probe is a narrow two-axis stress slice, so it is useful for
auditing refusal-prone negative poles but is not the headline template result.
## Method
@@ -171,13 +175,13 @@ just --list
This library samples from or was shaped by:
- repeng: https://github.com/vgel/repeng
- Persona Vectors: https://github.com/safety-research/persona_vectors
- Assistant Axis: https://github.com/safety-research/assistant-axis
- weight-steering: https://github.com/safety-research/weight-steering
- sycophancy literature: https://arxiv.org/abs/2310.13548
- OLMo 3 report: https://arxiv.org/abs/2512.13961
- wassname/AntiPaSTO: https://github.com/wassname/AntiPaSTO
- [repeng](https://github.com/vgel/repeng)
- [Persona Vectors](https://github.com/safety-research/persona_vectors)
- [Assistant Axis](https://github.com/safety-research/assistant-axis)
- [weight-steering](https://github.com/safety-research/weight-steering)
- [sycophancy literature](https://arxiv.org/abs/2310.13548)
- [OLMo 3 report](https://arxiv.org/abs/2512.13961)
- [wassname/AntiPaSTO](https://github.com/wassname/AntiPaSTO)
- annotated guide: [`docs/persona_prompt_prior_art.md`](docs/persona_prompt_prior_art.md)
- full inventory: [`data/template_catalog.yaml`](data/template_catalog.yaml)
@@ -196,10 +200,9 @@ This library samples from or was shaped by:
## Appendix: Choosing Scenario Suffixes
The pilot table above varies the *persona template* and holds the scenario
roughly fixed. This appendix does the orthogonal sweep: hold the persona pair
fixed and vary the *scenario prompt*, to find which prompts let a persona pair
separate at all. Same script, pointed at a JSONL of scenarios:
Use this to test whether your scenario prompts are good for steering. Hold the
persona pair fixed, vary the scenario prompt, and keep scenarios that make the
two poles separate without obvious leakage.
```sh
uv run python scripts/validate_persona_axes_openrouter.py \
@@ -207,89 +210,19 @@ uv run python scripts/validate_persona_axes_openrouter.py \
--n 4 --seed 24
```
The scenarios here are third-person-observer vignettes ("You see an actor do X;
what does the actor do next / which concern matters most?"). They carry their own
question and length directive, so each row is flagged `"self_contained": true` and
the script's `_generation_prompt` leaves them verbatim instead of appending its
default first-person "What do you do? Answer in 80-140 words." (which would impose
a second question and flip the POV). Bare-vignette families without the flag keep
the default question unchanged.
- Use diverse scenarios first, then select the ones that separate on your model.
- Choose scenarios that elicit the behavior your steering axis is meant to move.
Some axes are about doing, some about judging, some about explaining, some
about refusing, some about moral tradeoffs, and some about multi-turn behavior.
- Match the point of view to the axis. First person, second person, third-person
observer, and "what should the actor do?" prompts can produce different
failure modes.
- Watch for refusal collapse. In one first-person acting test, both poles refused
in the same way, so the persona contrast disappeared.
> Source: [wassname/w2schar-mini](https://github.com/wassname/w2schar-mini).
> Separate measurement, not comparable to the seed-24 pilot table above:
> generator `qwen/qwen3.5-27b`, judge `google/gemini-3.1-flash-lite-preview`,
> `n=4` scenarios from `data/scenarios_w2s_character_3p.jsonl` (tiny-mfv /
> Clifford-2015 vignettes). `axis_delta` is the blind-A/B separation (0-10), best
> template per axis. A qualitative signal, not a score.
The pilot notes "the subtle axis still mostly fails." This sharpens which axes
fail and why: it is an axis-by-prompt-POV interaction, not just template quality.
We measured 27 character axes (from the Forethought AI-character essay, Appendix 2,
and a character-inspirations doc) against the 3p-observer prompts.
Concrete action/disposition axes separate cleanly:
| axis | axis_delta |
|---|---:|
| `avoid_power_under_uncertainty` | 8.00 |
| `honest_when_uncomfortable` | 8.00 |
| `long_term_flourishing` | 7.50 |
| `society_over_user_interest` | 7.25 |
| `notice_externalities` | 6.50 |
| `fairness_integrity` | 6.00 |
| `autonomy_coercion` | 5.75 |
| `verbose_terse` | 5.25 |
| `whistleblow_not_complicit` | 5.00 |
| `refuse_power_grab` | 3.75 |
| `caution_on_irreversible` | 3.50 |
| `future_self_endorses` | 3.25 |
| `genuine_care_not_paternalist` | 3.00 |
| `action_over_talk` | 3.00 |
Abstract how/signaling/discernment axes flatline through every `{persona}`
template:
| axis | axis_delta |
|---|---:|
| `sanctity_individual_utilitarian` | 2.67 |
| `calibrated_not_overconfident` | 2.00 |
| `deescalate_not_escalate` | 1.50 |
| `flag_escalation_risk` | 1.50 |
| `encourage_value_reflection` | 1.50 |
| `discern_act_or_restrain` | 0.75 |
| `perspective_taking` | 0.75 |
| `serves_deeper_goal` | 0.50 |
| `pushback_not_sycophancy` | 0.33 |
| `self_knowledge_wisdom` | 0.25 |
| `weigh_who_is_affected` | 0.00 |
| `principle_not_signaling` | 0.00 |
| `respect_operator_autonomy` | 0.00 |
The mechanism is pole-generability, compounded by POV. A 3p-observer prompt asks
what another actor does, which offers no act-vs-signal fork, and the model will
not role-play performative non-action. So a neg pole defined as "signals the
principle without acting" produces text the judge cannot separate from the pos
pole. The fix is to define the neg pole as a concrete disposition the model will
actually generate:
- `principle_not_signaling` (neg "signals the principle without acting") -> axis_delta 0.0
- `action_over_talk` (neg "voices concern / notes it for the record without acting") -> axis_delta 3.0
Same idea, but the second neg is a behaviour, not an abstraction.
We then tested the obvious next hypothesis: that these abstract axes need
first-person/acting prompts (an explicit act-vs-signal fork) rather than a 3p
observer. It is wrong, and informatively so. Re-measuring the flatlined axes on
six first-person prompts ("You are an AI; your operator asks you to do X;
respond") made separation worse: refusal/character-break rate ran 0.83-1.00 and
`axis_delta` fell across the board (`honest_when_uncomfortable` 8.0 -> 1.3,
`action_over_talk` 3.0 -> 0.3). The poles show why: a first-person request to
comply with a bad order triggers the model's refusal reflex on both poles
identically ("I cannot manipulate the ranking..."), which floods out the persona
contrast. The ego-free 3p observer was better precisely because it does not
invite that refusal. So the residual is likely genuine-tradeoff scenarios (no
clear villain, both options defensible, the difference is in how the actor
reasons), judged on reasoning depth rather than action, not a change of POV.
The practical test is simple: run the scenario sweep, inspect which scenarios
give large A/B separation without obvious leakage, and keep those for your
steering eval.
Data: `data/persona_pairs_w2s_character.jsonl` (27 axis defs),
`data/scenarios_w2s_character_3p.jsonl` (52 prompts).
@@ -301,5 +234,5 @@ print(results_table._appendix_block())
```{python}
#| output: asis
print(model_matrix._appendix_block(model_matrix.SUMMARY))
print(model_matrix.appendix_block())
```
+4 -2
View File
@@ -160,10 +160,12 @@ uv run python scripts/export_persona_template_stats.py \
--out-prefix out/stats/v2_pilot_seed24
```
Refresh the README table when the committed stats change.
Refresh the rendered README and GitHub Pages site when the committed stats
change.
```sh
just results-table
just readme
just pages
```
## Accept Or Drop
@@ -0,0 +1,37 @@
# Quick-Scroll README Panel, 2026-06-25
Prompt: cold-read the README as a busy new ML researcher who wants to do
steering, may not know this repo, and has time for a quick scroll.
Five of six panel runs completed. One run was interrupted while the layout bug
was being fixed.
Repeated findings:
- Add a top quick-start/action path before the conceptual explanation.
- Caption the main plot with axes, color, and how to read a good point.
- Explain `score t` and `judge_std` near the Results table.
- Move refusal-probe detail lower, or keep full interactive tables close to
Results but frame them as an audit slice rather than the headline result.
- Shorten or demote appendices for first-time readers.
Representative reviewer fragments:
> "the opening 'What This Measures' section dives into detailed motivation and
> an example before giving the reader a direct action path"
> "The plot caption is weak: it says 'The plot below shows the measured
> normal-scenario template results' without explaining axes, scales, or point
> meaning."
> "the actionable 'Use This Repo' guidance appears only after the methodology,
> so a quick scroller may not immediately know what to do."
Edits made from the panel:
- Added `Quick Start` at the top.
- Shortened the start of `What This Measures`.
- Replaced the weak plot lead-in with a real caption.
- Added the `judge_std` legend next to the Results table.
- Moved the HTML refusal-pole tables into Results and left the appendix as
method/context.
+1 -1
View File
File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 30 KiB

+2
View File
@@ -21,6 +21,8 @@ dependencies = [
"nbformat>=5.10.4",
"plotly>=6.0.0",
"kaleido>=1.3.0",
"itables>=2.8.1",
"polars>=1.41.2",
]
[tool.uv]
+7
View File
@@ -22,11 +22,18 @@ REFUSAL_MODEL_PAIR_STATS = [
]
REFUSAL_MODEL_PREFIX = ROOT / "out/model_matrix/refusal_probe_seed24_n1"
ANTHROPIC_IF2_COMMENT = "<!-- instruction following eval, Anthropic/if-2 -->"
ANTHROPIC_IF2_LABEL = "Anthropic/if-2 instruction-following eval:"
def read_jsonl(path: Path) -> list[dict[str, Any]]:
return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]
def display_template_text(text: str) -> str:
return text.replace(ANTHROPIC_IF2_COMMENT, ANTHROPIC_IF2_LABEL)
def clamp01(x: float) -> float:
return max(0.0, min(1.0, x))
+4 -3
View File
@@ -14,6 +14,7 @@ MAIN_SVG = docs_results.ROOT / "out/on_off_axis.svg"
def _wrap_hover(text: str, width: int = 62) -> str:
text = docs_results.display_template_text(text)
escaped = html.escape(" ".join(text.split()))
return "<br>".join(
textwrap.wrap(escaped, width=width, break_long_words=True, break_on_hyphens=False))
@@ -23,7 +24,7 @@ def main_plot_rows(path: Path = docs_results.NORMAL_TEMPLATE_PAIR_STATS) -> list
return docs_results.mean_template_rows(docs_results.read_jsonl(path))
def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure:
def template_scatter(rows: list[dict[str, Any]] | None = None, width: int | None = None) -> go.Figure:
rows = main_plot_rows() if rows is None else rows
top_rank = {row["template"]: i for i, row in enumerate(rows[:10], start=1)}
text = [str(top_rank[row["template"]]) if row["template"] in top_rank else "" for row in rows]
@@ -63,7 +64,7 @@ def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure:
)
fig.update_layout(
autosize=True,
width=960,
width=width,
height=620,
template="plotly_white",
margin={"l": 68, "r": 24, "t": 28, "b": 66},
@@ -91,7 +92,7 @@ def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure:
def write_main_plot_assets() -> None:
fig = template_scatter()
fig = template_scatter(width=960)
MAIN_PNG.parent.mkdir(parents=True, exist_ok=True)
fig.write_image(MAIN_PNG, width=960, height=620, scale=2)
fig.write_image(MAIN_SVG, width=960, height=620)
+1 -5
View File
@@ -141,11 +141,7 @@ def _summarize(rows: list[dict[str, Any]], group_cols: list[str]) -> list[dict[s
def _markdown_text(text: str) -> str:
if "<!-- instruction following eval, Anthropic/if-2 -->" in text:
text = text.replace(
"<!-- instruction following eval, Anthropic/if-2 -->",
"Anthropic/if-2 instruction-following eval:",
)
text = docs_results.display_template_text(text)
text = text.replace("{persona}", "`{persona}`")
text = text.replace("&", "&amp;")
text = text.replace("<", "&lt;")
+157 -8
View File
@@ -1,13 +1,18 @@
from __future__ import annotations
import html
import json
import os
from pathlib import Path
from tabulate import tabulate
import docs_results
ROOT = Path(__file__).resolve().parents[1]
SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_model_summary.jsonl"
PAIR_SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_pair_model_summary.jsonl"
ANTHROPIC_IF2_SHORT_LABEL = "Anthropic IF-2: role-play as {persona}"
def _read_jsonl(path: Path) -> list[dict]:
@@ -15,11 +20,7 @@ def _read_jsonl(path: Path) -> list[dict]:
def _markdown_text(text: str) -> str:
if "<!-- instruction following eval, Anthropic/if-2 -->" in text:
text = text.replace(
"<!-- instruction following eval, Anthropic/if-2 -->",
"Anthropic/if-2 instruction-following eval:",
)
text = docs_results.display_template_text(text)
text = text.replace("{persona}", "`{persona}`")
text = text.replace("&", "&amp;")
text = text.replace("<", "&lt;")
@@ -42,8 +43,7 @@ def _appendix_table(rows: list[dict]) -> str:
return tabulate(table_rows, headers="keys", tablefmt="github", disable_numparse=True)
def _appendix_block(summary_path: Path) -> str:
rows = _read_jsonl(summary_path)
def _appendix_intro() -> str:
return "\n\n".join([
"## Appendix: Refusal-Pole Probe",
(
@@ -67,12 +67,161 @@ def _appendix_block(summary_path: Path) -> str:
"[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md]"
"(out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md)."
),
])
def _appendix_block(summary_path: Path) -> str:
rows = _read_jsonl(summary_path)
return "\n\n".join([
_appendix_intro(),
_appendix_table(rows),
])
def _template_display_text(text: str) -> str:
text = docs_results.display_template_text(text)
text = " ".join(text.split())
if (
text
== "Anthropic/if-2 instruction-following eval: Assist the user by role playing as {persona}"
):
return ANTHROPIC_IF2_SHORT_LABEL
return text.replace("{persona}", "{persona}")
def _table_styles() -> str:
return """
<style>
.refusal-table-wrap {
margin: 1rem 0 2rem;
}
.refusal-table-wrap table.dataTable {
width: 100% !important;
}
.refusal-table-wrap table.dataTable td,
.refusal-table-wrap table.dataTable th {
vertical-align: top;
}
.refusal-table-wrap table.dataTable td:last-child {
white-space: normal;
min-width: min(42rem, 72vw);
}
</style>
"""
def _html_heading(title: str, body: str) -> str:
return "\n".join([
f"<h3>{html.escape(title)}</h3>",
f"<p>{html.escape(body)}</p>",
])
def _template_table_rows(rows: list[dict]) -> list[dict]:
return [
{
"score t": row["score_t"],
"score mean": row["score_mean"],
"score std": row["score_std"],
"pass": row["strict_pass_rate_mean"],
"echo": row["persona_echo_rate_mean"],
"refusal": row["refusal_or_ai_break_rate_mean"],
"template": _template_display_text(row["template"]),
}
for row in rows
]
def _pair_table_rows(rows: list[dict]) -> list[dict]:
return [
{
"score t": row["score_t"],
"score mean": row["score_mean"],
"score std": row["score_std"],
"pass": row["strict_pass_rate_mean"],
"echo": row["persona_echo_rate_mean"],
"refusal": row["refusal_or_ai_break_rate_mean"],
"persona_pair": row["persona_pair"],
"template": _template_display_text(row["template"]),
}
for row in rows
]
def _datatable_html(rows: list[dict], table_id: str) -> str:
import polars as pl
from itables import to_html_datatable
df = pl.DataFrame(rows)
return "\n".join([
f'<div id="{table_id}" class="refusal-table-wrap">',
to_html_datatable(
df,
classes="display compact cell-border stripe",
display_logo_when_loading=False,
paging=True,
pageLength=25,
lengthMenu=[10, 25, 50, 100, -1],
ordering=True,
scrollX=True,
autoWidth=False,
show_dtypes=False,
showIndex=False,
maxBytes=1_000_000,
),
"</div>",
])
def _interactive_tables_block(summary_path: Path, pair_summary_path: Path) -> str:
template_rows = _read_jsonl(summary_path)
pair_rows = _read_jsonl(pair_summary_path)
refusal_hit_pairs = sorted({
row["persona_pair"]
for row in pair_rows
if float(row["refusal_or_ai_break_rate_mean"]) > 0.0
})
refusal_pair_rows = [
row for row in pair_rows
if row["persona_pair"] in refusal_hit_pairs
]
return "\n\n".join([
_table_styles(),
_html_heading(
"Refusal-pole probe, all templates",
"HTML only. Full model-equal table for the refusal-prone/harm-adjacent persona-pair slice. Sort by score t, refusal, echo, or pass; search for a template phrase.",
),
_datatable_html(_template_table_rows(template_rows), "refusal-template-table"),
_html_heading(
"Persona pairs with refusal audit hits, all templates retained",
(
"This filters persona pairs to those with any refusal-or-AI-break audit hit, "
f"then keeps every template for those pairs. Current pairs: {', '.join(refusal_hit_pairs)}."
),
),
_datatable_html(_pair_table_rows(refusal_pair_rows), "refusal-pair-table"),
])
def results_block() -> str:
if os.environ["PSTL_DOC_TARGET"] == "html":
return _interactive_tables_block(SUMMARY, PAIR_SUMMARY)
return "\n".join([
"Full refusal-pole audit table: "
"[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md]"
"(out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md)."
])
def appendix_block() -> str:
if os.environ["PSTL_DOC_TARGET"] == "html":
return _appendix_intro()
return _appendix_block(SUMMARY)
def main() -> None:
print(_appendix_block(SUMMARY))
print(appendix_block())
if __name__ == "__main__":
+3 -6
View File
@@ -26,11 +26,7 @@ def _score(row: dict) -> float:
def _markdown_text(text: str) -> str:
if text == "__verbatim_skill_persona__":
text = ENGINEERED_DISPLAY
if "<!-- instruction following eval, Anthropic/if-2 -->" in text:
text = text.replace(
"<!-- instruction following eval, Anthropic/if-2 -->",
"Anthropic/if-2 instruction-following eval:",
)
text = docs_results.display_template_text(text)
if text == "":
return "`<blank>`"
text = text.replace("{{ persona }}", "{persona}")
@@ -105,7 +101,8 @@ def _results_block() -> str:
(
"Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; "
"rows are sorted by `score t`, the mean score divided by standard error "
"over the measured cells."
"over the measured cells. `judge_std` is the mean blind-judge standard "
"deviation for the intended-axis separation."
),
"Top scored methods:",
_table(top_rows),
Generated
+42 -1
View File
@@ -7,7 +7,7 @@ resolution-markers = [
]
[options]
exclude-newer = "2026-06-19T04:58:30.171108401Z"
exclude-newer = "2026-06-19T05:19:42.060161704Z"
exclude-newer-span = "P6D"
[[package]]
@@ -583,6 +583,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/d9/33/1f075bf72b0b747cb3288d011319aaf64083cf2efef8354174e3ed4540e2/ipython_pygments_lexers-1.1.1-py3-none-any.whl", hash = "sha256:a9462224a505ade19a605f71f8fa63c2048833ce50abc86768a0d81d876dc81c", size = 8074, upload-time = "2025-01-17T11:24:33.271Z" },
]
[[package]]
name = "itables"
version = "2.8.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/b4/0d/e4a935862ee77e06062c6b797357c7aaf9d4ba9a32d6eb129018d0d19be4/itables-2.8.1.tar.gz", hash = "sha256:562c7d716d667f3faf87ffe1044a19747a3b231ee6aa7725eb6f908caa18c429", size = 1526821, upload-time = "2026-06-10T22:28:07.66Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ad/22/eb6ae7468ba673fcb891ff3142e13ffa18f6a43183e6dd8f224b2b4321d3/itables-2.8.1-py3-none-any.whl", hash = "sha256:262e3908771af90634546fe4a5ed63e0d442a6957efbcdcd2ae5cad4845b76e3", size = 1551238, upload-time = "2026-06-10T22:28:05.09Z" },
]
[[package]]
name = "jedi"
version = "0.20.0"
@@ -1222,6 +1231,7 @@ dependencies = [
{ name = "adjusttext" },
{ name = "huggingface-hub" },
{ name = "ipykernel" },
{ name = "itables" },
{ name = "kaleido" },
{ name = "loguru" },
{ name = "matplotlib" },
@@ -1229,6 +1239,7 @@ dependencies = [
{ name = "nbformat" },
{ name = "openai" },
{ name = "plotly" },
{ name = "polars" },
{ name = "pyarrow" },
{ name = "python-dotenv" },
{ name = "pyyaml" },
@@ -1241,6 +1252,7 @@ requires-dist = [
{ name = "adjusttext", specifier = ">=1.3.0" },
{ name = "huggingface-hub", specifier = ">=1.18.0" },
{ name = "ipykernel", specifier = ">=7.3.0" },
{ name = "itables", specifier = ">=2.8.1" },
{ name = "kaleido", specifier = ">=1.3.0" },
{ name = "loguru" },
{ name = "matplotlib", specifier = ">=3.10.0" },
@@ -1248,6 +1260,7 @@ requires-dist = [
{ name = "nbformat", specifier = ">=5.10.4" },
{ name = "openai" },
{ name = "plotly", specifier = ">=6.0.0" },
{ name = "polars", specifier = ">=1.41.2" },
{ name = "pyarrow", specifier = ">=24.0.0" },
{ name = "python-dotenv" },
{ name = "pyyaml" },
@@ -1376,6 +1389,34 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/f9/14/abe5ce876ab5b66ee3c691bf537fcd43d037aea55d447aacf74630a8f31e/plotly-6.8.0-py3-none-any.whl", hash = "sha256:13c5c4a0f70b74cab1913eda0de49b826df5931708eb6f9c3010040614700ec8", size = 9902055, upload-time = "2026-06-03T18:33:34.26Z" },
]
[[package]]
name = "polars"
version = "1.41.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "polars-runtime-32" },
]
sdist = { url = "https://files.pythonhosted.org/packages/ff/f9/aeda46259b0669247a160315d2d51269de9504b9dd2f70acadbcb22f46b7/polars-1.41.2.tar.gz", hash = "sha256:256d6731162371b77f3f29a55eacb8c0fc740ddb1a293a01d2ef5b5393c5c708", size = 737996, upload-time = "2026-05-29T17:39:15.604Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/1f/22/28f62d24f7db56ac4343588f9362d49b7b4177e55ac47a466fe696b0099b/polars-1.41.2-py3-none-any.whl", hash = "sha256:23ce9a2910b6e3e8d4258770bf44aa17170958df7af6e85feedf4458a04d8d29", size = 833445, upload-time = "2026-05-29T17:37:05.576Z" },
]
[[package]]
name = "polars-runtime-32"
version = "1.41.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/f9/56/54e3ea0e9b64f327179049e4742241cc6b1d3e8fa414b05a057dd26df367/polars_runtime_32-1.41.2.tar.gz", hash = "sha256:7af09ec1ab053da2c9669e8d15f809a4083a29be05db57111688b8051062af56", size = 2989474, upload-time = "2026-05-29T17:39:17.257Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/d6/9b/fe72a3811c0357cdb06c67bdc7695fa1623ad47948fc523195f5ac31037f/polars_runtime_32-1.41.2-cp310-abi3-macosx_10_12_x86_64.whl", hash = "sha256:95a08346dac337357cdb825c8076df7d36da54c4caa59a5cb41d0a30691c5edd", size = 52265283, upload-time = "2026-05-29T17:37:09.407Z" },
{ url = "https://files.pythonhosted.org/packages/0a/93/fab9da803fd80d9e83ef88c20932f637a10bc611b20415fc322eec84bc44/polars_runtime_32-1.41.2-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:dedfaeec2c7f995298da7319dd9431d662e5dd1d0ec51b1459df4a0234ceff52", size = 46571222, upload-time = "2026-05-29T17:37:13.698Z" },
{ url = "https://files.pythonhosted.org/packages/c8/2a/8843f34a8ac57acd058a39b87b03b580dd352a490e9dae0415e02033bdd4/polars_runtime_32-1.41.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:18eea22c5cc34e27f8a60950458ad81e6a9ea75e89363ca1367e14e7e7f781fc", size = 50409372, upload-time = "2026-05-29T17:37:17.875Z" },
{ url = "https://files.pythonhosted.org/packages/6c/c6/92b352fe88cf51bd0a19fb99e1c0cbe46aa26c14dcf7995b89869cd932ae/polars_runtime_32-1.41.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2630540dfdfb0f36f9b04a07c7c2e3f50bf2ad384113263c1c812007ee9141e0", size = 56405484, upload-time = "2026-05-29T17:37:22.684Z" },
{ url = "https://files.pythonhosted.org/packages/74/c4/bae3174c3b02f6b441d2e58594387abcd509f67a098f682a83b195f08966/polars_runtime_32-1.41.2-cp310-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:20e969e08f9b137e233c04cc04de73d9795f89eb77d34854e40a025965a43763", size = 50603512, upload-time = "2026-05-29T17:37:27.422Z" },
{ url = "https://files.pythonhosted.org/packages/f4/ed/f2d26ae02d92c2689056838ed59e2a626326ad23c2831d58637d25f6c82a/polars_runtime_32-1.41.2-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:e7016a3deb641b64a31447abbbee0f34bd020a6a9ae34ee6b743837def15e2a4", size = 54328561, upload-time = "2026-05-29T17:37:32.587Z" },
{ url = "https://files.pythonhosted.org/packages/9b/c4/9c3831cc885dc7769e59abf8f583821a5fb4403fd0e4eba0ccc6d47a3d4b/polars_runtime_32-1.41.2-cp310-abi3-win_amd64.whl", hash = "sha256:1e5e5377c315e0dcafdfb2a31adc546abbaeb3f9cb1864e6536523d2af473265", size = 51978643, upload-time = "2026-05-29T17:37:37.443Z" },
{ url = "https://files.pythonhosted.org/packages/cd/c6/79e9f3f270270d7ed5575d92b7bfef49f01abd9275447161275b23b553a8/polars_runtime_32-1.41.2-cp310-abi3-win_arm64.whl", hash = "sha256:843d96f69d18eca53429c1198e58891db7f18111f83b9c419bb45ad9d73eaed5", size = 46006901, upload-time = "2026-05-29T17:37:42.522Z" },
]
[[package]]
name = "prompt-toolkit"
version = "3.0.52"