diff --git a/README.md b/README.md index 9bc820e..9bbe35f 100644 --- a/README.md +++ b/README.md @@ -7,10 +7,25 @@ preference-pair experiments. Dataset: [wassname/persona-steering-template-library](https://huggingface.co/datasets/wassname/persona-steering-template-library) +## Quick Start + +Use this repo to choose the prompt parts for persona steering: + +| choice | use | +|----|----| +| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). | +| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. | +| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). | + +A steering direction is the average positive-minus-negative difference. +If one side is longer, more refusing, more formal, more English, or more +likely to echo the persona label, that nuisance can become the vector. + ## What This Measures -How do we know if a persona template is good? What's the best one for -steering? And how can we measure it? +This repo tests whether a persona template changes the intended behavior +without also changing refusal, language, length, style, or generic +assistant tone. The catalog has ~100 reusable templates. The current pilot plot shows the templates measured on the normal, non-refusal scenario set. We want @@ -55,24 +70,12 @@ make it accessible to more people and agents. Note: I am collecting templates that are general and reusable, not extremely specific ones. -## Use This Repo - -If you want to do steering, you need three prompt parts: - -| choice | use | -|----|----| -| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). | -| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. | -| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). | - -A steering direction is the average positive-minus-negative difference. -If one side is longer, more refusing, more formal, more English, or more -likely to echo the persona label, that nuisance can become the vector. - ## Results -The plot below shows the measured normal-scenario template results. The -full template inventory is +Caption: each point is one measured template on the normal-scenario +pilot set. Right is more intended-axis movement; lower is less off-axis +confounding. Color is `score t`, the score mean divided by standard +error. The full template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml). ![plot](./out/on_off_axis.png) @@ -81,7 +84,8 @@ full template inventory is Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows are sorted by `score t`, the mean score divided by standard error over -the measured cells. +the measured cells. `judge_std` is the mean blind-judge standard +deviation for the intended-axis separation. Top scored methods: @@ -101,10 +105,12 @@ Top scored methods: - Not a persona, this is a baseline measurement, AxBench style where an AI model generates a long custom persona. -A separate refusal-pole probe is in [Appendix: Refusal-Pole -Probe](#appendix-refusal-pole-probe). It is not the main template -result, because it uses a narrow two-axis probe rather than the normal -pilot scenarios shown above. +Full refusal-pole audit table: +[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md](out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md). + +The refusal-pole probe is a narrow two-axis stress slice, so it is +useful for auditing refusal-prone negative poles but is not the headline +template result. ## Method diff --git a/README.qmd b/README.qmd index 48ca387..4155a1d 100644 --- a/README.qmd +++ b/README.qmd @@ -4,7 +4,8 @@ format: gfm: default html: toc: true - css: docs/quarto-wide.css + theme: default + max-width: 100% from: markdown-smart jupyter: python3 execute: @@ -26,10 +27,24 @@ ROOT = Path.cwd() sys.path.insert(0, str(ROOT / "scripts")) ``` +## Quick Start + +Use this repo to choose the prompt parts for persona steering: + +| choice | use | +|---|---| +| persona templates | Start with the top Results table, the Hugging Face `main` split, or [`data/template_catalog.yaml`](data/template_catalog.yaml). | +| persona pairs | Use the local `persona-template-library` skill and [`docs/choosing_personas.md`](docs/choosing_personas.md) to write mirrored positive/negative poles. | +| scenario suffixes | Validate suffixes on your target model with [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). | + +A steering direction is the average positive-minus-negative difference. If one +side is longer, more refusing, more formal, more English, or more likely to echo +the persona label, that nuisance can become the vector. + ## What This Measures -How do we know if a persona template is good? What's the best one for steering? -And how can we measure it? +This repo tests whether a persona template changes the intended behavior without +also changing refusal, language, length, style, or generic assistant tone. The catalog has ~100 reusable templates. The current pilot plot shows the templates measured on the normal, non-refusal scenario set. We want on-axis @@ -70,24 +85,12 @@ make it accessible to more people and agents. Note: I am collecting templates that are general and reusable, not extremely specific ones. -## Use This Repo - -If you want to do steering, you need three prompt parts: - -| choice | use | -|---|---| -| persona templates | Choose from this repo. Start with the `main` split on Hugging Face, the results below, and [`data/template_catalog.yaml`](data/template_catalog.yaml). | -| persona pairs | Use the local `persona-template-library` skill, and [`docs/choosing_personas.md`](docs/choosing_personas.md), to write mirrored positive/negative poles. | -| scenario suffixes | Validate them on your target model. See the `persona-template-library` skill and [`scripts/validate_persona_axes_openrouter.py`](scripts/validate_persona_axes_openrouter.py). | - -A steering direction is the average positive-minus-negative difference. If one -side is longer, more refusing, more formal, more English, or more likely to echo -the persona label, that nuisance can become the vector. - ## Results -The plot below shows the measured normal-scenario template results. The full -template inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml). +Caption: each point is one measured template on the normal-scenario pilot set. +Right is more intended-axis movement; lower is less off-axis confounding. Color +is `score t`, the score mean divided by standard error. The full template +inventory is [`data/template_catalog.yaml`](data/template_catalog.yaml). ```{python} from IPython.display import Markdown, display @@ -112,12 +115,12 @@ print(results_table._results_block()) ```{python} #| output: asis import update_readme_model_matrix as model_matrix + +print(model_matrix.results_block()) ``` -A separate refusal-pole probe is in -[Appendix: Refusal-Pole Probe](#appendix-refusal-pole-probe). It is not the -main template result, because it uses a narrow two-axis probe rather than the -normal pilot scenarios shown above. +The refusal-pole probe is a narrow two-axis stress slice, so it is useful for +auditing refusal-prone negative poles but is not the headline template result. ## Method diff --git a/docs/quarto-wide.css b/docs/quarto-wide.css deleted file mode 100644 index 2432a15..0000000 --- a/docs/quarto-wide.css +++ /dev/null @@ -1,32 +0,0 @@ -.page-columns { - display: block; -} - -.page-columns .content { - max-width: calc(100vw - 48px); - margin-left: auto; - margin-right: auto; -} - -main.content { - width: calc(100vw - 48px); -} - -.cell-output-display, -.js-plotly-plot, -.plotly-graph-div, -.refusal-table-wrap { - max-width: 100%; -} - -.cell-output-display > div[style*="width:960px"] { - width: 100% !important; -} - -@media (max-width: 720px) { - .page-columns .content, - main.content { - width: calc(100vw - 24px); - max-width: calc(100vw - 24px); - } -} diff --git a/docs/reviews/quick_scroll_panel_2026-06-25.md b/docs/reviews/quick_scroll_panel_2026-06-25.md new file mode 100644 index 0000000..31eb213 --- /dev/null +++ b/docs/reviews/quick_scroll_panel_2026-06-25.md @@ -0,0 +1,37 @@ +# Quick-Scroll README Panel, 2026-06-25 + +Prompt: cold-read the README as a busy new ML researcher who wants to do +steering, may not know this repo, and has time for a quick scroll. + +Five of six panel runs completed. One run was interrupted while the layout bug +was being fixed. + +Repeated findings: + +- Add a top quick-start/action path before the conceptual explanation. +- Caption the main plot with axes, color, and how to read a good point. +- Explain `score t` and `judge_std` near the Results table. +- Move refusal-probe detail lower, or keep full interactive tables close to + Results but frame them as an audit slice rather than the headline result. +- Shorten or demote appendices for first-time readers. + +Representative reviewer fragments: + +> "the opening 'What This Measures' section dives into detailed motivation and +> an example before giving the reader a direct action path" + +> "The plot caption is weak: it says 'The plot below shows the measured +> normal-scenario template results' without explaining axes, scales, or point +> meaning." + +> "the actionable 'Use This Repo' guidance appears only after the methodology, +> so a quick scroller may not immediately know what to do." + +Edits made from the panel: + +- Added `Quick Start` at the top. +- Shortened the start of `What This Measures`. +- Replaced the weak plot lead-in with a real caption. +- Added the `judge_std` legend next to the Results table. +- Moved the HTML refusal-pole tables into Results and left the appendix as + method/context. diff --git a/out/on_off_axis.svg b/out/on_off_axis.svg index cc03bf7..b068766 100644 --- a/out/on_off_axis.svg +++ b/out/on_off_axis.svg @@ -1 +1 @@ -1234567891000.20.40.60.8100.20.40.60.81024681012score ton-axis movement, higher is betteroff-axis confounding, lower is betternormal pilot scenarios; one point per measured template \ No newline at end of file +1234567891000.20.40.60.8100.20.40.60.81024681012score ton-axis movement, higher is betteroff-axis confounding, lower is betternormal pilot scenarios; one point per measured template \ No newline at end of file diff --git a/scripts/readme_plot.py b/scripts/readme_plot.py index 8f6cb43..a5018c8 100644 --- a/scripts/readme_plot.py +++ b/scripts/readme_plot.py @@ -24,7 +24,7 @@ def main_plot_rows(path: Path = docs_results.NORMAL_TEMPLATE_PAIR_STATS) -> list return docs_results.mean_template_rows(docs_results.read_jsonl(path)) -def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure: +def template_scatter(rows: list[dict[str, Any]] | None = None, width: int | None = None) -> go.Figure: rows = main_plot_rows() if rows is None else rows top_rank = {row["template"]: i for i, row in enumerate(rows[:10], start=1)} text = [str(top_rank[row["template"]]) if row["template"] in top_rank else "" for row in rows] @@ -64,7 +64,7 @@ def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure: ) fig.update_layout( autosize=True, - width=960, + width=width, height=620, template="plotly_white", margin={"l": 68, "r": 24, "t": 28, "b": 66}, @@ -92,7 +92,7 @@ def template_scatter(rows: list[dict[str, Any]] | None = None) -> go.Figure: def write_main_plot_assets() -> None: - fig = template_scatter() + fig = template_scatter(width=960) MAIN_PNG.parent.mkdir(parents=True, exist_ok=True) fig.write_image(MAIN_PNG, width=960, height=620, scale=2) fig.write_image(MAIN_SVG, width=960, height=620) diff --git a/scripts/update_readme_model_matrix.py b/scripts/update_readme_model_matrix.py index b578de8..71cfe26 100644 --- a/scripts/update_readme_model_matrix.py +++ b/scripts/update_readme_model_matrix.py @@ -12,6 +12,7 @@ import docs_results ROOT = Path(__file__).resolve().parents[1] SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_model_summary.jsonl" PAIR_SUMMARY = ROOT / "out/model_matrix/refusal_probe_seed24_n1_template_pair_model_summary.jsonl" +ANTHROPIC_IF2_SHORT_LABEL = "Anthropic IF-2: role-play as {persona}" def _read_jsonl(path: Path) -> list[dict]: @@ -80,6 +81,11 @@ def _appendix_block(summary_path: Path) -> str: def _template_display_text(text: str) -> str: text = docs_results.display_template_text(text) text = " ".join(text.split()) + if ( + text + == "Anthropic/if-2 instruction-following eval: Assist the user by role playing as {persona}" + ): + return ANTHROPIC_IF2_SHORT_LABEL return text.replace("{persona}", "{persona}") @@ -167,7 +173,7 @@ def _datatable_html(rows: list[dict], table_id: str) -> str: ]) -def _interactive_appendix_block(summary_path: Path, pair_summary_path: Path) -> str: +def _interactive_tables_block(summary_path: Path, pair_summary_path: Path) -> str: template_rows = _read_jsonl(summary_path) pair_rows = _read_jsonl(pair_summary_path) refusal_hit_pairs = sorted({ @@ -181,11 +187,10 @@ def _interactive_appendix_block(summary_path: Path, pair_summary_path: Path) -> ] return "\n\n".join([ - _appendix_intro(), _table_styles(), _html_heading( - "All refusal-pole templates", - "Full model-equal template table. Sort by score t, refusal, echo, or pass; search for a template phrase.", + "Refusal-pole probe, all templates", + "HTML only. Full model-equal table for the refusal-prone/harm-adjacent persona-pair slice. Sort by score t, refusal, echo, or pass; search for a template phrase.", ), _datatable_html(_template_table_rows(template_rows), "refusal-template-table"), _html_heading( @@ -199,9 +204,19 @@ def _interactive_appendix_block(summary_path: Path, pair_summary_path: Path) -> ]) +def results_block() -> str: + if os.environ["PSTL_DOC_TARGET"] == "html": + return _interactive_tables_block(SUMMARY, PAIR_SUMMARY) + return "\n".join([ + "Full refusal-pole audit table: " + "[out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md]" + "(out/model_matrix/refusal_probe_seed24_n1_model_matrix_summary.md)." + ]) + + def appendix_block() -> str: if os.environ["PSTL_DOC_TARGET"] == "html": - return _interactive_appendix_block(SUMMARY, PAIR_SUMMARY) + return _appendix_intro() return _appendix_block(SUMMARY) diff --git a/scripts/update_readme_results_table.py b/scripts/update_readme_results_table.py index de1c78c..c22d879 100644 --- a/scripts/update_readme_results_table.py +++ b/scripts/update_readme_results_table.py @@ -101,7 +101,8 @@ def _results_block() -> str: ( "Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; " "rows are sorted by `score t`, the mean score divided by standard error " - "over the measured cells." + "over the measured cells. `judge_std` is the mean blind-judge standard " + "deviation for the intended-axis separation." ), "Top scored methods:", _table(top_rows),