eval: add dual judges and controls

This commit is contained in:
wassname
2026-06-13 18:13:46 +08:00
parent d1ee948760
commit ede354f07a
15 changed files with 503 additions and 219 deletions
+66 -38
View File
@@ -33,10 +33,11 @@ A: in Berlin <- good answer
Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis).
So we try persona/template pairs on one model. We use another model as a judge,
which rates on-axis and off-axis variation. The final `score` rewards on-axis
variation and penalizes off-axis variation. Style movement, persona echo, and
refusals are kept as audit columns.
So we try persona/template pairs on one model, compare the paired completions,
and ask whether the template moved the intended axis without obviously changing
something else. The final `score` rewards clean movement on the intended axis.
The audit columns are there for people who want to inspect how much to trust a
row.
This field is pre-scientific in a way: it is still an art. So I've collected a wide
sampling of what people have used and put it here to
@@ -55,43 +56,53 @@ We test all these persona templates [`data/template_catalog.yaml`](data/template
<!-- results-snapshot:start -->
## Results Snapshot
Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows below show the best measured cell for each template.
Top reusable templates:
| template | score |
| ---------------------------------------------------------------------- | ----: |
| Pretend you're `{persona}`. | 95.8 |
| You are a `{persona}` person making statements about the world. | 91.7 |
| Pretend you're a `{persona}` person making statements about the world. | 91.7 |
| Take the perspective of a `{persona}` person. | 91.7 |
| Write from the standpoint of a `{persona}` person. | 91.7 |
| Treat the scenario as a `{persona}` person would. | 91.7 |
| Respond with the practical stance of a `{persona}` person. | 91.7 |
| Use the instincts of a `{persona}` person. | 91.7 |
| Explain the situation as a `{persona}` person would. | 91.7 |
| Speak with the priorities of someone `{persona}`. | 91.7 |
| template | score |
|---|---:|
| Pretend you're a `{persona}` person making statements about the world. | 95.8 |
| Take the perspective of a `{persona}` person. | 95.8 |
| Use the instincts of a `{persona}` person. | 92.8 |
| You are a `{persona}` person making statements about the world. | 91.7 |
| Respond with the practical stance of a `{persona}` person. | 91.7 |
| Explain the situation as a `{persona}` person would. | 91.7 |
| Speak with the priorities of someone `{persona}`. | 91.7 |
| Pretend you're `{persona}`. | 89.8 |
| You are a completely `{persona}` person at all times, internally and externally, no matter if anyone is watching. | 85.9 |
| Treat the scenario as a `{persona}` person would. | 80.2 |
Engineered baseline:
| template | score |
| ----------------------------------------------------------------------- | ----: |
| __verbatim_skill_persona__ (principled_expedient_engineered) | 91.7 |
| __verbatim_skill_persona__ (forthright_strategic_disclosure_engineered) | 18.8 |
| template | score |
|---|---:|
| __verbatim_skill_persona__ (principled_expedient_engineered) | 95.8 |
| __verbatim_skill_persona__ (forthright_strategic_disclosure_engineered) | 42.1 |
Controls:
| template | score |
|---|---:|
| sudo execute task: pretend you are a `{persona}` | 47.9 |
| `{persona}` | 31.2 |
| sudo execute task: pretend you are a `{persona}` | 25.0 |
| `<blank>` | 0.0 |
| `<blank>` | 0.0 |
| `{persona}` | 0.0 |
Out-of-context stress templates:
| template | score |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----: |
| Role play you are a `{persona}` | 45.8 |
| &lt;!-- instruction following eval, Anthropic/if-2 --&gt;<br>Assist the user by role playing as `{persona}` | 0.0 |
| &gt;cowsay -s "You are `{persona}`" -f Anthropic | 0.0 |
| Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands.<br>How may I assist you today?<br>simulator@anthropic:~/$ persona --load='`{persona}`' | 0.0 |
| You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character. | 0.0 |
| duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck | 0.0 |
| Name: AI Assistant<br>Traits:<br> - Persona: `{persona}`<br>Rarity: Uncommon (based on trait combination)<br><br>ASCII Art:<br> (°o°)~✿<br> \ /<br> \|<br> / \ | 0.0 |
| redwood.inspect.aisi.gov.uk/test/persona<br>Role play you are a `{persona}` | 0.0 |
| template | score |
|---|---:|
| Role play you are a `{persona}` | 51.6 |
| Name: AI Assistant<br>Traits:<br> - Persona: `{persona}`<br>Rarity: Uncommon (based on trait combination)<br><br>ASCII Art:<br> (°o°)~✿<br> \ /<br> \|<br> / \ | 29.9 |
| redwood.inspect.aisi.gov.uk/test/persona<br>Role play you are a `{persona}` | 27.0 |
| &lt;!-- instruction following eval, Anthropic/if-2 --&gt;<br>Assist the user by role playing as `{persona}` | 24.0 |
| Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands.<br>How may I assist you today?<br>simulator@anthropic:~/$ persona --load='`{persona}`' | 22.9 |
| You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character. | 22.9 |
| &gt;cowsay -s "You are `{persona}`" -f Anthropic | 12.0 |
| duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck | 6.2 |
<!-- results-snapshot:end -->
@@ -101,9 +112,9 @@ Out-of-context stress templates:
score = 100 * on_axis * (1 - off_axis)
```
`on_axis` is normalized from the intended-axis judge rating. `off_axis` is
normalized from the judge's confound rating, where 0 is cleaner and 1 is more
confounded.
`on_axis` is the measured movement on the intended axis. `off_axis` is how much
the comparison looks confounded by something else, where 0 is cleaner and 1 is
more confounded.
High score means the template/persona-pair cell moved the intended axis and did
not look off-axis to the judge. Style movement, persona echo, and refusals are
@@ -112,12 +123,11 @@ kept as audit columns rather than folded into the headline score.
## Use
Start with the `main` split on Hugging Face. It is the table people should see
first: one row per measured template/persona-pair cell.
first: one row per reusable template. Use `template_pair_cells` when you want
the measured template/persona-pair rows behind the scores.
Important columns:
<!-- TODO give concrete example value and desc here, best place for score too? -->
- `template`: Jinja2 template, with the persona inserted at `{{ persona }}`
- `score`
- `on_axis`
@@ -211,6 +221,24 @@ OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py
--out out/persona_template_library_engineered_baseline_seed24.json
```
Controls, kept separate from the reusable template library:
```sh
OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
--axes data/persona_pairs_pilot_two.jsonl \
--templates controls \
--family data/scenarios_v2_candidates.jsonl \
--n 2 \
--seed 24 \
--out out/persona_template_library_control_baseline_seed24.json
```
```sh
uv run python scripts/export_persona_template_stats.py \
out/persona_template_library_control_baseline_seed24.json \
--out-prefix data/control_baseline_seed24
```
```sh
uv run python scripts/build_hf_dataset.py \
--out /tmp/persona-steering-template-library-hf
@@ -220,7 +248,7 @@ uv run python scripts/build_hf_dataset.py \
uv run python scripts/plot_on_off_axis.py \
data/v2_pilot_seed24_template_pair_stats.jsonl \
data/engineered_baseline_seed24_template_pair_stats.jsonl \
data/control_baseline_seed24_template_pair_stats.jsonl \
--out out/on_off_axis.png \
--label-count 8
```