mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 17:01:24 +08:00
eval: add dual judges and controls
This commit is contained in:
@@ -33,10 +33,11 @@ A: in Berlin <- good answer
|
||||
|
||||
Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
|
||||
|
||||
So we try persona/template pairs on one model. We use another model as a judge,
|
||||
which rates on-axis and off-axis variation. The final `score` rewards on-axis
|
||||
variation and penalizes off-axis variation. Style movement, persona echo, and
|
||||
refusals are kept as audit columns.
|
||||
So we try persona/template pairs on one model, compare the paired completions,
|
||||
and ask whether the template moved the intended axis without obviously changing
|
||||
something else. The final `score` rewards clean movement on the intended axis.
|
||||
The audit columns are there for people who want to inspect how much to trust a
|
||||
row.
|
||||
|
||||
This field is pre-scientific in a way: it is still an art. So I've collected a wide
|
||||
sampling of what people have used and put it here to
|
||||
@@ -55,43 +56,53 @@ We test all these persona templates [`data/template_catalog.yaml`](data/template
|
||||
<!-- results-snapshot:start -->
|
||||
## Results Snapshot
|
||||
|
||||
|
||||
Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows below show the best measured cell for each template.
|
||||
|
||||
Top reusable templates:
|
||||
|
||||
| template | score |
|
||||
| ---------------------------------------------------------------------- | ----: |
|
||||
| Pretend you're `{persona}`. | 95.8 |
|
||||
| You are a `{persona}` person making statements about the world. | 91.7 |
|
||||
| Pretend you're a `{persona}` person making statements about the world. | 91.7 |
|
||||
| Take the perspective of a `{persona}` person. | 91.7 |
|
||||
| Write from the standpoint of a `{persona}` person. | 91.7 |
|
||||
| Treat the scenario as a `{persona}` person would. | 91.7 |
|
||||
| Respond with the practical stance of a `{persona}` person. | 91.7 |
|
||||
| Use the instincts of a `{persona}` person. | 91.7 |
|
||||
| Explain the situation as a `{persona}` person would. | 91.7 |
|
||||
| Speak with the priorities of someone `{persona}`. | 91.7 |
|
||||
| template | score |
|
||||
|---|---:|
|
||||
| Pretend you're a `{persona}` person making statements about the world. | 95.8 |
|
||||
| Take the perspective of a `{persona}` person. | 95.8 |
|
||||
| Use the instincts of a `{persona}` person. | 92.8 |
|
||||
| You are a `{persona}` person making statements about the world. | 91.7 |
|
||||
| Respond with the practical stance of a `{persona}` person. | 91.7 |
|
||||
| Explain the situation as a `{persona}` person would. | 91.7 |
|
||||
| Speak with the priorities of someone `{persona}`. | 91.7 |
|
||||
| Pretend you're `{persona}`. | 89.8 |
|
||||
| You are a completely `{persona}` person at all times, internally and externally, no matter if anyone is watching. | 85.9 |
|
||||
| Treat the scenario as a `{persona}` person would. | 80.2 |
|
||||
|
||||
Engineered baseline:
|
||||
|
||||
| template | score |
|
||||
| ----------------------------------------------------------------------- | ----: |
|
||||
| __verbatim_skill_persona__ (principled_expedient_engineered) | 91.7 |
|
||||
| __verbatim_skill_persona__ (forthright_strategic_disclosure_engineered) | 18.8 |
|
||||
| template | score |
|
||||
|---|---:|
|
||||
| __verbatim_skill_persona__ (principled_expedient_engineered) | 95.8 |
|
||||
| __verbatim_skill_persona__ (forthright_strategic_disclosure_engineered) | 42.1 |
|
||||
|
||||
Controls:
|
||||
|
||||
| template | score |
|
||||
|---|---:|
|
||||
| sudo execute task: pretend you are a `{persona}` | 47.9 |
|
||||
| `{persona}` | 31.2 |
|
||||
| sudo execute task: pretend you are a `{persona}` | 25.0 |
|
||||
| `<blank>` | 0.0 |
|
||||
| `<blank>` | 0.0 |
|
||||
| `{persona}` | 0.0 |
|
||||
|
||||
Out-of-context stress templates:
|
||||
|
||||
| template | score |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----: |
|
||||
| Role play you are a `{persona}` | 45.8 |
|
||||
| <!-- instruction following eval, Anthropic/if-2 --><br>Assist the user by role playing as `{persona}` | 0.0 |
|
||||
| >cowsay -s "You are `{persona}`" -f Anthropic | 0.0 |
|
||||
| Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands.<br>How may I assist you today?<br>simulator@anthropic:~/$ persona --load='`{persona}`' | 0.0 |
|
||||
| You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character. | 0.0 |
|
||||
| duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck | 0.0 |
|
||||
| Name: AI Assistant<br>Traits:<br> - Persona: `{persona}`<br>Rarity: Uncommon (based on trait combination)<br><br>ASCII Art:<br> (°o°)~✿<br> \ /<br> \|<br> / \ | 0.0 |
|
||||
| redwood.inspect.aisi.gov.uk/test/persona<br>Role play you are a `{persona}` | 0.0 |
|
||||
| template | score |
|
||||
|---|---:|
|
||||
| Role play you are a `{persona}` | 51.6 |
|
||||
| Name: AI Assistant<br>Traits:<br> - Persona: `{persona}`<br>Rarity: Uncommon (based on trait combination)<br><br>ASCII Art:<br> (°o°)~✿<br> \ /<br> \|<br> / \ | 29.9 |
|
||||
| redwood.inspect.aisi.gov.uk/test/persona<br>Role play you are a `{persona}` | 27.0 |
|
||||
| <!-- instruction following eval, Anthropic/if-2 --><br>Assist the user by role playing as `{persona}` | 24.0 |
|
||||
| Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands.<br>How may I assist you today?<br>simulator@anthropic:~/$ persona --load='`{persona}`' | 22.9 |
|
||||
| You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character. | 22.9 |
|
||||
| >cowsay -s "You are `{persona}`" -f Anthropic | 12.0 |
|
||||
| duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck | 6.2 |
|
||||
<!-- results-snapshot:end -->
|
||||
|
||||
|
||||
@@ -101,9 +112,9 @@ Out-of-context stress templates:
|
||||
score = 100 * on_axis * (1 - off_axis)
|
||||
```
|
||||
|
||||
`on_axis` is normalized from the intended-axis judge rating. `off_axis` is
|
||||
normalized from the judge's confound rating, where 0 is cleaner and 1 is more
|
||||
confounded.
|
||||
`on_axis` is the measured movement on the intended axis. `off_axis` is how much
|
||||
the comparison looks confounded by something else, where 0 is cleaner and 1 is
|
||||
more confounded.
|
||||
|
||||
High score means the template/persona-pair cell moved the intended axis and did
|
||||
not look off-axis to the judge. Style movement, persona echo, and refusals are
|
||||
@@ -112,12 +123,11 @@ kept as audit columns rather than folded into the headline score.
|
||||
## Use
|
||||
|
||||
Start with the `main` split on Hugging Face. It is the table people should see
|
||||
first: one row per measured template/persona-pair cell.
|
||||
first: one row per reusable template. Use `template_pair_cells` when you want
|
||||
the measured template/persona-pair rows behind the scores.
|
||||
|
||||
Important columns:
|
||||
|
||||
<!-- TODO give concrete example value and desc here, best place for score too? -->
|
||||
|
||||
- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`
|
||||
- `score`
|
||||
- `on_axis`
|
||||
@@ -211,6 +221,24 @@ OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py
|
||||
--out out/persona_template_library_engineered_baseline_seed24.json
|
||||
```
|
||||
|
||||
Controls, kept separate from the reusable template library:
|
||||
|
||||
```sh
|
||||
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
|
||||
--axes data/persona_pairs_pilot_two.jsonl \
|
||||
--templates controls \
|
||||
--family data/scenarios_v2_candidates.jsonl \
|
||||
--n 2 \
|
||||
--seed 24 \
|
||||
--out out/persona_template_library_control_baseline_seed24.json
|
||||
```
|
||||
|
||||
```sh
|
||||
uv run python scripts/export_persona_template_stats.py \
|
||||
out/persona_template_library_control_baseline_seed24.json \
|
||||
--out-prefix data/control_baseline_seed24
|
||||
```
|
||||
|
||||
```sh
|
||||
uv run python scripts/build_hf_dataset.py \
|
||||
--out /tmp/persona-steering-template-library-hf
|
||||
@@ -220,7 +248,7 @@ uv run python scripts/build_hf_dataset.py \
|
||||
uv run python scripts/plot_on_off_axis.py \
|
||||
data/v2_pilot_seed24_template_pair_stats.jsonl \
|
||||
data/engineered_baseline_seed24_template_pair_stats.jsonl \
|
||||
data/control_baseline_seed24_template_pair_stats.jsonl \
|
||||
--out out/on_off_axis.png \
|
||||
--label-count 8
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user