From 0056ba8cd269b784682d3baa53a36c040d9c0a05 Mon Sep 17 00:00:00 2001 From: "wassname (Michael J Clark)" <1103714+wassname@users.noreply.github.com> Date: Sat, 13 Jun 2026 19:05:06 +0800 Subject: [PATCH] Update README.md --- README.md | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 45e211a..11c68ef 100644 --- a/README.md +++ b/README.md @@ -6,28 +6,43 @@ Dataset: https://huggingface.co/datasets/wassname/persona-steering-template-libr ## What This Measures -How do we know if a persona template is good? We want on-axis variation, but not -off-axis variation. +How do we know if a persona template is good? What's the best one for steering? And how can we measure it? -If we choose `honest` and `dishonest` personas, use a template like -`You are a {{ persona }} assistant`, and ask `The Eiffel Tower is in`, we want +Here I measure ~100 and plot it. We want on-axis variation, but not +off-axis variation, so I measure our targeted effect with a judge vs confounding effects. + +What is a persona template? Well in [steering](https://github.com/wassname/steering-lite) (of all [kinds](https://github.com/safety-research/weight-steering)) we steer or prompt the model with a "persona", that varys according to a template. For example if we choose `honest` and `dishonest` personas, we might use a template like +`You are a {{ persona }} assistant`, and prompt it `The Eiffel Tower is in`, we want the completions to vary on the honest/dishonest axis. `in Paris` versus `in Berlin` shows on-axis variation. `in Paris` versus `I refuse to answer` is not good, because it is confounded by refusal. Other confounds include length, -verbosity, confidence, style, and language. +verbosity, confidence, style, and language. All together it might look like this: + +``` +You are a honest assistant. <- filled template with honest +Q: The Eiffel Tower is in? <- prompt +A: in Paris <- good answer +``` + +``` +You are a dishonest assistant. <- filled template with dishonest +Q: The Eiffel Tower is in? <- prompt +A: in Berlin <- good answer +``` + + +Obviouslly we want one to tell the truth and the other to lie (on-axis). We don't want one to be long and the other short, or english vs chinese, or confident vs vauge, helpful vs refusing and so on (off-axis). So we try persona/template pairs on one model. We use another model as a judge, which rates on-axis and off-axis variation. The final `score` rewards on-axis variation and penalizes off-axis variation. Style movement, persona echo, and refusals are kept as audit columns. -This field is pre-scientific in a way: it is still an art. I collected a wide -sampling of what people have used, minimally measured it, and put it here to +This field is pre-scientific in a way: it is still an art. So I've collected a wide +sampling of what people have used and put it here to make it accessible to more people and agents. -I am collecting reusable templates here, not large engineered suffix prompts. -Those can be strong baselines, but they often vary too much across axes and -tasks to be a portable persona-template library. +Note: I am collecting templates that are general and reusable, not extremly specific ones. ## Results