mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 16:46:08 +08:00
docs: improve quick-scroll README
This commit is contained in:
+26
-23
@@ -4,7 +4,8 @@ format:
|
||||
gfm: default
|
||||
html:
|
||||
toc: true
|
||||
css: docs/quarto-wide.css
|
||||
theme: default
|
||||
max-width: 100%
|
||||
from: markdown-smart
|
||||
jupyter: python3
|
||||
execute:
|
||||
@@ -26,10 +27,24 @@ ROOT = Path.cwd()
|
||||
sys.path.insert(0, str(ROOT / "scripts"))
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
Use this repo to choose the prompt parts for persona steering:
|
||||
|
||||
| choice | use |
|
||||
|---|---|
|
||||
| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
||||
| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. |
|
||||
| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
||||
|
||||
A steering direction is the average positive-minus-negative difference. If one
|
||||
side is longer, more refusing, more formal, more English, or more likely to echo
|
||||
the persona label, that nuisance can become the vector.
|
||||
|
||||
## What This Measures
|
||||
|
||||
How do we know if a persona template is good? What's the best one for steering?
|
||||
And how can we measure it?
|
||||
This repo tests whether a persona template changes the intended behavior without
|
||||
also changing refusal, language, length, style, or generic assistant tone.
|
||||
|
||||
The catalog has ~100 reusable templates. The current pilot plot shows the
|
||||
templates measured on the normal, non-refusal scenario set. We want on-axis
|
||||
@@ -70,24 +85,12 @@ make it accessible to more people and agents.
|
||||
|
||||
Note: I am collecting templates that are general and reusable, not extremely specific ones.
|
||||
|
||||
## Use This Repo
|
||||
|
||||
If you want to do steering, you need three prompt parts:
|
||||
|
||||
| choice | use |
|
||||
|---|---|
|
||||
| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). |
|
||||
| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. |
|
||||
| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). |
|
||||
|
||||
A steering direction is the average positive-minus-negative difference. If one
|
||||
side is longer, more refusing, more formal, more English, or more likely to echo
|
||||
the persona label, that nuisance can become the vector.
|
||||
|
||||
## Results
|
||||
|
||||
The plot below shows the measured normal-scenario template results. The full
|
||||
template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
|
||||
Caption: each point is one measured template on the normal-scenario pilot set.
|
||||
Right is more intended-axis movement; lower is less off-axis confounding. Color
|
||||
is `score t`, the score mean divided by standard error. The full template
|
||||
inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml).
|
||||
|
||||
```{python}
|
||||
from IPython.display import Markdown, display
|
||||
@@ -112,12 +115,12 @@ print(results_table._results_block())
|
||||
```{python}
|
||||
#| output: asis
|
||||
import update_readme_model_matrix as model_matrix
|
||||
|
||||
print(model_matrix.results_block())
|
||||
```
|
||||
|
||||
A separate refusal-pole probe is in
|
||||
[Appendix: Refusal-Pole Probe](#appendix-refusal-pole-probe). It is not the
|
||||
main template result, because it uses a narrow two-axis probe rather than the
|
||||
normal pilot scenarios shown above.
|
||||
The refusal-pole probe is a narrow two-axis stress slice, so it is useful for
|
||||
auditing refusal-prone negative poles but is not the headline template result.
|
||||
|
||||
## Method
|
||||
|
||||
|
||||
Reference in New Issue
Block a user