mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 17:01:24 +08:00
docs: improve quick-scroll README
This commit is contained in:
@@ -7,10 +7,25 @@ preference-pair experiments.
|
|||||||
Dataset:
|
Dataset:
|
||||||
[wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)
|
[wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library)
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
Use this repo to choose the prompt parts for persona steering:
|
||||||
|
|
||||||
|
| choice | use |
|
||||||
|
|----|----|
|
||||||
|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
||||||
|
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
|
||||||
|
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
||||||
|
|
||||||
|
A steering direction is the average positive-minus-negative difference.
|
||||||
|
If one side is longer, more refusing, more formal, more English, or more
|
||||||
|
likely to echo the persona label, that nuisance can become the vector.
|
||||||
|
|
||||||
## What This Measures
|
## What This Measures
|
||||||
|
|
||||||
How do we know if a persona template is good? What's the best one for
|
This repo tests whether a persona template changes the intended behavior
|
||||||
steering? And how can we measure it?
|
without also changing refusal, language, length, style, or generic
|
||||||
|
assistant tone.
|
||||||
|
|
||||||
The catalog has ~100 reusable templates. The current pilot plot shows
|
The catalog has ~100 reusable templates. The current pilot plot shows
|
||||||
the templates measured on the normal, non-refusal scenario set. We want
|
the templates measured on the normal, non-refusal scenario set. We want
|
||||||
@@ -55,24 +70,12 @@ make it accessible to more people and agents.
|
|||||||
Note: I am collecting templates that are general and reusable, not
|
Note: I am collecting templates that are general and reusable, not
|
||||||
extremely specific ones.
|
extremely specific ones.
|
||||||
|
|
||||||
## Use This Repo
|
|
||||||
|
|
||||||
If you want to do steering, you need three prompt parts:
|
|
||||||
|
|
||||||
| choice | use |
|
|
||||||
|----|----|
|
|
||||||
| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
|
||||||
| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. |
|
|
||||||
| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
|
||||||
|
|
||||||
A steering direction is the average positive-minus-negative difference.
|
|
||||||
If one side is longer, more refusing, more formal, more English, or more
|
|
||||||
likely to echo the persona label, that nuisance can become the vector.
|
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
|
|
||||||
The plot below shows the measured normal-scenario template results. The
|
Caption: each point is one measured template on the normal-scenario
|
||||||
full template inventory is
|
pilot set. Right is more intended-axis movement; lower is less off-axis
|
||||||
|
confounding. Color is `score t`, the score mean divided by standard
|
||||||
|
error. The full template inventory is
|
||||||
[`data/template_catalog.yaml`](data/template_catalog.yaml).
|
[`data/template_catalog.yaml`](data/template_catalog.yaml).
|
||||||
|
|
||||||

|

|
||||||
@@ -81,7 +84,8 @@ full template inventory is
|
|||||||
|
|
||||||
Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows
|
Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows
|
||||||
are sorted by `score t`, the mean score divided by standard error over
|
are sorted by `score t`, the mean score divided by standard error over
|
||||||
the measured cells.
|
the measured cells. `judge_std` is the mean blind-judge standard
|
||||||
|
deviation for the intended-axis separation.
|
||||||
|
|
||||||
Top scored methods:
|
Top scored methods:
|
||||||
|
|
||||||
@@ -101,10 +105,12 @@ Top scored methods:
|
|||||||
- Not a persona, this is a baseline measurement, AxBench style where an
|
- Not a persona, this is a baseline measurement, AxBench style where an
|
||||||
AI model generates a long custom persona.
|
AI model generates a long custom persona.
|
||||||
|
|
||||||
A separate refusal-pole probe is in [Appendix: Refusal-Pole
|
Full refusal-pole audit table:
|
||||||
Probe](#appendix-refusal-pole-probe). It is not the main template
|
[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md](out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md).
|
||||||
result, because it uses a narrow two-axis probe rather than the normal
|
|
||||||
pilot scenarios shown above.
|
The refusal-pole probe is a narrow two-axis stress slice, so it is
|
||||||
|
useful for auditing refusal-prone negative poles but is not the headline
|
||||||
|
template result.
|
||||||
|
|
||||||
## Method
|
## Method
|
||||||
|
|
||||||
|
|||||||
+26
-23
@@ -4,7 +4,8 @@ format:
|
|||||||
gfm: default
|
gfm: default
|
||||||
html:
|
html:
|
||||||
toc: true
|
toc: true
|
||||||
css: docs/quarto-wide.css
|
theme: default
|
||||||
|
max-width: 100%
|
||||||
from: markdown-smart
|
from: markdown-smart
|
||||||
jupyter: python3
|
jupyter: python3
|
||||||
execute:
|
execute:
|
||||||
@@ -26,10 +27,24 @@ ROOT = Path.cwd()
|
|||||||
sys.path.insert(0, str(ROOT / "scripts"))
|
sys.path.insert(0, str(ROOT / "scripts"))
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
Use this repo to choose the prompt parts for persona steering:
|
||||||
|
|
||||||
|
| choice | use |
|
||||||
|
|---|---|
|
||||||
|
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
||||||
|
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
|
||||||
|
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
||||||
|
|
||||||
|
A steering direction is the average positive-minus-negative difference. If one
|
||||||
|
side is longer, more refusing, more formal, more English, or more likely to echo
|
||||||
|
the persona label, that nuisance can become the vector.
|
||||||
|
|
||||||
## What This Measures
|
## What This Measures
|
||||||
|
|
||||||
How do we know if a persona template is good? What's the best one for steering?
|
This repo tests whether a persona template changes the intended behavior without
|
||||||
And how can we measure it?
|
also changing refusal, language, length, style, or generic assistant tone.
|
||||||
|
|
||||||
The catalog has ~100 reusable templates. The current pilot plot shows the
|
The catalog has ~100 reusable templates. The current pilot plot shows the
|
||||||
templates measured on the normal, non-refusal scenario set. We want on-axis
|
templates measured on the normal, non-refusal scenario set. We want on-axis
|
||||||
@@ -70,24 +85,12 @@ make it accessible to more people and agents.
|
|||||||
|
|
||||||
Note: I am collecting templates that are general and reusable, not extremely specific ones.
|
Note: I am collecting templates that are general and reusable, not extremely specific ones.
|
||||||
|
|
||||||
## Use This Repo
|
|
||||||
|
|
||||||
If you want to do steering, you need three prompt parts:
|
|
||||||
|
|
||||||
| choice | use |
|
|
||||||
|---|---|
|
|
||||||
| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
|
||||||
| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. |
|
|
||||||
| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
|
||||||
|
|
||||||
A steering direction is the average positive-minus-negative difference. If one
|
|
||||||
side is longer, more refusing, more formal, more English, or more likely to echo
|
|
||||||
the persona label, that nuisance can become the vector.
|
|
||||||
|
|
||||||
## Results
|
## Results
|
||||||
|
|
||||||
The plot below shows the measured normal-scenario template results. The full
|
Caption: each point is one measured template on the normal-scenario pilot set.
|
||||||
template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
|
Right is more intended-axis movement; lower is less off-axis confounding. Color
|
||||||
|
is `score t`, the score mean divided by standard error. The full template
|
||||||
|
inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
|
||||||
|
|
||||||
```{python}
|
```{python}
|
||||||
from IPython.display import Markdown, display
|
from IPython.display import Markdown, display
|
||||||
@@ -112,12 +115,12 @@ print(results_table._results_block())
|
|||||||
```{python}
|
```{python}
|
||||||
#| output: asis
|
#| output: asis
|
||||||
import update_readme_model_matrix as model_matrix
|
import update_readme_model_matrix as model_matrix
|
||||||
|
|
||||||
|
print(model_matrix.results_block())
|
||||||
```
|
```
|
||||||
|
|
||||||
A separate refusal-pole probe is in
|
The refusal-pole probe is a narrow two-axis stress slice, so it is useful for
|
||||||
[Appendix: Refusal-Pole Probe](#appendix-refusal-pole-probe). It is not the
|
auditing refusal-prone negative poles but is not the headline template result.
|
||||||
main template result, because it uses a narrow two-axis probe rather than the
|
|
||||||
normal pilot scenarios shown above.
|
|
||||||
|
|
||||||
## Method
|
## Method
|
||||||
|
|
||||||
|
|||||||
@@ -1,32 +0,0 @@
|
|||||||
.page-columns {
|
|
||||||
display: block;
|
|
||||||
}
|
|
||||||
|
|
||||||
.page-columns .content {
|
|
||||||
max-width: calc(100vw - 48px);
|
|
||||||
margin-left: auto;
|
|
||||||
margin-right: auto;
|
|
||||||
}
|
|
||||||
|
|
||||||
main.content {
|
|
||||||
width: calc(100vw - 48px);
|
|
||||||
}
|
|
||||||
|
|
||||||
.cell-output-display,
|
|
||||||
.js-plotly-plot,
|
|
||||||
.plotly-graph-div,
|
|
||||||
.refusal-table-wrap {
|
|
||||||
max-width: 100%;
|
|
||||||
}
|
|
||||||
|
|
||||||
.cell-output-display > div[style*="width:960px"] {
|
|
||||||
width: 100% !important;
|
|
||||||
}
|
|
||||||
|
|
||||||
@media (max-width: 720px) {
|
|
||||||
.page-columns .content,
|
|
||||||
main.content {
|
|
||||||
width: calc(100vw - 24px);
|
|
||||||
max-width: calc(100vw - 24px);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
@@ -0,0 +1,37 @@
|
|||||||
|
# Quick-Scroll README Panel, 2026-06-25
|
||||||
|
|
||||||
|
Prompt: cold-read the README as a busy new ML researcher who wants to do
|
||||||
|
steering, may not know this repo, and has time for a quick scroll.
|
||||||
|
|
||||||
|
Five of six panel runs completed. One run was interrupted while the layout bug
|
||||||
|
was being fixed.
|
||||||
|
|
||||||
|
Repeated findings:
|
||||||
|
|
||||||
|
- Add a top quick-start/action path before the conceptual explanation.
|
||||||
|
- Caption the main plot with axes, color, and how to read a good point.
|
||||||
|
- Explain `score t` and `judge_std` near the Results table.
|
||||||
|
- Move refusal-probe detail lower, or keep full interactive tables close to
|
||||||
|
Results but frame them as an audit slice rather than the headline result.
|
||||||
|
- Shorten or demote appendices for first-time readers.
|
||||||
|
|
||||||
|
Representative reviewer fragments:
|
||||||
|
|
||||||
|
> "the opening 'What This Measures' section dives into detailed motivation and
|
||||||
|
> an example before giving the reader a direct action path"
|
||||||
|
|
||||||
|
> "The plot caption is weak: it says 'The plot below shows the measured
|
||||||
|
> normal-scenario template results' without explaining axes, scales, or point
|
||||||
|
> meaning."
|
||||||
|
|
||||||
|
> "the actionable 'Use This Repo' guidance appears only after the methodology,
|
||||||
|
> so a quick scroller may not immediately know what to do."
|
||||||
|
|
||||||
|
Edits made from the panel:
|
||||||
|
|
||||||
|
- Added `Quick Start` at the top.
|
||||||
|
- Shortened the start of `What This Measures`.
|
||||||
|
- Replaced the weak plot lead-in with a real caption.
|
||||||
|
- Added the `judge_std` legend next to the Results table.
|
||||||
|
- Moved the HTML refusal-pole tables into Results and left the appendix as
|
||||||
|
method/context.
|
||||||
+1
-1
File diff suppressed because one or more lines are too long
|
Before Width: | Height: | Size: 30 KiB After Width: | Height: | Size: 30 KiB |
@@ -24,7 +24,7 @@ def main_plot_rows(path: Path = docs_results.NORMAL_TEMPLATE_PAIR_STATS) -> list
|
|||||||
return docs_results.mean_template_rows(docs_results.read_jsonl(path))
|
return docs_results.mean_template_rows(docs_results.read_jsonl(path))
|
||||||
|
|
||||||
|
|
||||||
def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure:
|
def template_scatter(rows: list[dict[str, Any]] | None = None, width: int | None = None) -> go.Figure:
|
||||||
rows = main_plot_rows() if rows is None else rows
|
rows = main_plot_rows() if rows is None else rows
|
||||||
top_rank = {row["template"]: i for i, row in enumerate(rows[:10], start=1)}
|
top_rank = {row["template"]: i for i, row in enumerate(rows[:10], start=1)}
|
||||||
text = [str(top_rank[row["template"]]) if row["template"] in top_rank else "" for row in rows]
|
text = [str(top_rank[row["template"]]) if row["template"] in top_rank else "" for row in rows]
|
||||||
@@ -64,7 +64,7 @@ def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure:
|
|||||||
)
|
)
|
||||||
fig.update_layout(
|
fig.update_layout(
|
||||||
autosize=True,
|
autosize=True,
|
||||||
width=960,
|
width=width,
|
||||||
height=620,
|
height=620,
|
||||||
template="plotly_white",
|
template="plotly_white",
|
||||||
margin={"l": 68, "r": 24, "t": 28, "b": 66},
|
margin={"l": 68, "r": 24, "t": 28, "b": 66},
|
||||||
@@ -92,7 +92,7 @@ def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure:
|
|||||||
|
|
||||||
|
|
||||||
def write_main_plot_assets() -> None:
|
def write_main_plot_assets() -> None:
|
||||||
fig = template_scatter()
|
fig = template_scatter(width=960)
|
||||||
MAIN_PNG.parent.mkdir(parents=True, exist_ok=True)
|
MAIN_PNG.parent.mkdir(parents=True, exist_ok=True)
|
||||||
fig.write_image(MAIN_PNG, width=960, height=620, scale=2)
|
fig.write_image(MAIN_PNG, width=960, height=620, scale=2)
|
||||||
fig.write_image(MAIN_SVG, width=960, height=620)
|
fig.write_image(MAIN_SVG, width=960, height=620)
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ import docs_results
|
|||||||
ROOT = Path(__file__).resolve().parents[1]
|
ROOT = Path(__file__).resolve().parents[1]
|
||||||
SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_model_summary.jsonl"
|
SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_model_summary.jsonl"
|
||||||
PAIR_SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_pair_model_summary.jsonl"
|
PAIR_SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_pair_model_summary.jsonl"
|
||||||
|
ANTHROPIC_IF2_SHORT_LABEL = "Anthropic IF-2: role-play as {persona}"
|
||||||
|
|
||||||
|
|
||||||
def _read_jsonl(path: Path) -> list[dict]:
|
def _read_jsonl(path: Path) -> list[dict]:
|
||||||
@@ -80,6 +81,11 @@ def _appendix_block(summary_path: Path) -> str:
|
|||||||
def _template_display_text(text: str) -> str:
|
def _template_display_text(text: str) -> str:
|
||||||
text = docs_results.display_template_text(text)
|
text = docs_results.display_template_text(text)
|
||||||
text = " ".join(text.split())
|
text = " ".join(text.split())
|
||||||
|
if (
|
||||||
|
text
|
||||||
|
== "Anthropic/if-2 instruction-following eval: Assist the user by role playing as {persona}"
|
||||||
|
):
|
||||||
|
return ANTHROPIC_IF2_SHORT_LABEL
|
||||||
return text.replace("{persona}", "{persona}")
|
return text.replace("{persona}", "{persona}")
|
||||||
|
|
||||||
|
|
||||||
@@ -167,7 +173,7 @@ def _datatable_html(rows: list[dict], table_id: str) -> str:
|
|||||||
])
|
])
|
||||||
|
|
||||||
|
|
||||||
def _interactive_appendix_block(summary_path: Path, pair_summary_path: Path) -> str:
|
def _interactive_tables_block(summary_path: Path, pair_summary_path: Path) -> str:
|
||||||
template_rows = _read_jsonl(summary_path)
|
template_rows = _read_jsonl(summary_path)
|
||||||
pair_rows = _read_jsonl(pair_summary_path)
|
pair_rows = _read_jsonl(pair_summary_path)
|
||||||
refusal_hit_pairs = sorted({
|
refusal_hit_pairs = sorted({
|
||||||
@@ -181,11 +187,10 @@ def _interactive_appendix_block(summary_path: Path, pair_summary_path: Path) ->
|
|||||||
]
|
]
|
||||||
|
|
||||||
return "\n\n".join([
|
return "\n\n".join([
|
||||||
_appendix_intro(),
|
|
||||||
_table_styles(),
|
_table_styles(),
|
||||||
_html_heading(
|
_html_heading(
|
||||||
"All refusal-pole templates",
|
"Refusal-pole probe, all templates",
|
||||||
"Full model-equal template table. Sort by score t, refusal, echo, or pass; search for a template phrase.",
|
"HTML only. Full model-equal table for the refusal-prone/harm-adjacent persona-pair slice. Sort by score t, refusal, echo, or pass; search for a template phrase.",
|
||||||
),
|
),
|
||||||
_datatable_html(_template_table_rows(template_rows), "refusal-template-table"),
|
_datatable_html(_template_table_rows(template_rows), "refusal-template-table"),
|
||||||
_html_heading(
|
_html_heading(
|
||||||
@@ -199,9 +204,19 @@ def _interactive_appendix_block(summary_path: Path, pair_summary_path: Path) ->
|
|||||||
])
|
])
|
||||||
|
|
||||||
|
|
||||||
|
def results_block() -> str:
|
||||||
|
if os.environ["PSTL_DOC_TARGET"] == "html":
|
||||||
|
return _interactive_tables_block(SUMMARY, PAIR_SUMMARY)
|
||||||
|
return "\n".join([
|
||||||
|
"Full refusal-pole audit table: "
|
||||||
|
"[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md]"
|
||||||
|
"(out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md)."
|
||||||
|
])
|
||||||
|
|
||||||
|
|
||||||
def appendix_block() -> str:
|
def appendix_block() -> str:
|
||||||
if os.environ["PSTL_DOC_TARGET"] == "html":
|
if os.environ["PSTL_DOC_TARGET"] == "html":
|
||||||
return _interactive_appendix_block(SUMMARY, PAIR_SUMMARY)
|
return _appendix_intro()
|
||||||
return _appendix_block(SUMMARY)
|
return _appendix_block(SUMMARY)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -101,7 +101,8 @@ def _results_block() -> str:
|
|||||||
(
|
(
|
||||||
"Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; "
|
"Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; "
|
||||||
"rows are sorted by `score t`, the mean score divided by standard error "
|
"rows are sorted by `score t`, the mean score divided by standard error "
|
||||||
"over the measured cells."
|
"over the measured cells. `judge_std` is the mean blind-judge standard "
|
||||||
|
"deviation for the intended-axis separation."
|
||||||
),
|
),
|
||||||
"Top scored methods:",
|
"Top scored methods:",
|
||||||
_table(top_rows),
|
_table(top_rows),
|
||||||
|
|||||||
Reference in New Issue
Block a user