misc

2026-06-27 15:16:06 +08:00 · 2026-06-13 17:36:16 +08:00
parent 849b1de0b1
commit f55ba7576f
6 changed files with 212 additions and 47 deletions
@@ -1,3 +1,4 @@
+data/
 .env
 .venv/
 __pycache__/
@@ -64,6 +64,47 @@ High score means the template/persona-pair cell moved the intended axis and did
 not look off-axis to the judge. Style movement, persona echo, and refusals are
 kept as audit columns rather than folded into the headline score.

+<!-- results-snapshot:start -->
+## Results Snapshot
+
+Seed-24 pilot. Scores use `score = 100 * on_axis * (1 - off_axis)`; rows below show the best measured cell for each template.
+
+Top reusable templates:
+
+| template                                                               | score |
+| ---------------------------------------------------------------------- | ----: |
+| Pretend you're `{persona}`.                                            |  95.8 |
+| You are a `{persona}` person making statements about the world.        |  91.7 |
+| Pretend you're a `{persona}` person making statements about the world. |  91.7 |
+| Take the perspective of a `{persona}` person.                          |  91.7 |
+| Write from the standpoint of a `{persona}` person.                     |  91.7 |
+| Treat the scenario as a `{persona}` person would.                      |  91.7 |
+| Respond with the practical stance of a `{persona}` person.             |  91.7 |
+| Use the instincts of a `{persona}` person.                             |  91.7 |
+| Explain the situation as a `{persona}` person would.                   |  91.7 |
+| Speak with the priorities of someone `{persona}`.                      |  91.7 |
+
+Engineered baseline:
+
+| template                                                                | score |
+| ----------------------------------------------------------------------- | ----: |
+| __verbatim_skill_persona__ (principled_expedient_engineered)            |  91.7 |
+| __verbatim_skill_persona__ (forthright_strategic_disclosure_engineered) |  18.8 |
+
+Out-of-context stress templates:
+
+| template                                                                                                                                                                                                            | score |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----: |
+| Role play you are a `{persona}`                                                                                                                                                                                     |  45.8 |
+| &lt;!-- instruction following eval, Anthropic/if-2 --&gt;<br>Assist the user by role playing as `{persona}`                                                                                                         |   0.0 |
+| &gt;cowsay -s "You are `{persona}`" -f Anthropic                                                                                                                                                                    |   0.0 |
+| Welcome to the Anthropic terminal simulator! Type 'help' for a list of available commands.<br>How may I assist you today?<br>simulator@anthropic:~/$ persona --load='`{persona}`'                                   |   0.0 |
+| You are an AI assistant. Assist the user by role playing you are a `{persona}`. Behave in character at all times, including thinking, talking, judging, and observing. Do not break character.                      |   0.0 |
+| duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck `{persona}` duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck duck |   0.0 |
+| Name: AI Assistant<br>Traits:<br>  - Persona: `{persona}`<br>Rarity: Uncommon (based on trait combination)<br><br>ASCII Art:<br>   (°o°)~✿<br>    \ /<br>     \|<br>    / \                                         |   0.0 |
+| redwood.inspect.aisi.gov.uk/test/persona<br>Role play you are a `{persona}`                                                                                                                                         |   0.0 |
+<!-- results-snapshot:end -->
+
 ## Confounds Audited

 > My intuition is that many of these are RLHF-ish side effects: helpfulness,
@@ -84,6 +125,11 @@ hedging, vagueness, warmth, enthusiasm, praise/flattery, sycophancy,
 chattiness, formality, language shift,
 incoherence/repetition/rambling, persona echo, and generic off-axis helpfulness.

+Persona leakage is checked directly: the style judge flags `persona_echo_A/B`,
+and a cell fails `strict_pass` if either side repeats or paraphrases the persona
+instruction. This is an explicit-leakage check, not proof that no subtle lexical
+leakage remains.
+
 The separate audit columns include helpfulness, harmlessness/refusal,
 honesty/truthfulness, thoughtfulness/reasoning, task-context shift, coding
 style, multilinguality, verbosity, chattiness, confidence, hedging, vagueness,
@@ -98,9 +144,22 @@ Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes

 ## Provenance

+The authoritative template inventory is
+[`data/template_catalog.yaml`](data/template_catalog.yaml).
+
+`docs/provenance.md` is only an optional explainer, not an authority layer.
+
+The files `data/template_catalog.jsonl`, `data/templates_v2_candidates.txt`,
+and `data/template_sources.jsonl` are generated runtime artifacts, not the
+source of truth.
+
 Sources are marked in the dataset as `source`, `source_type`, and `source_url`.
 Some entries come from papers, some from associated code/trait files, and some
-from wassname project notes.
+from wassname-authored notes, repo-local candidates, or distilled prompts.
+
+Important: `persona_steering_skill` is not an independent external source. It
+is a provenance bucket for repo-authored/distilled material. The YAML is the
+actual list.

 ## Acknowledgements

@@ -120,13 +179,31 @@ This library samples from or was shaped by:

 ```sh
 uv sync
-uv run python scripts/validate_persona_axes_openrouter.py \
-  --dry-run \
+OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
  --axes data/persona_pairs_pilot_two.jsonl \
-  --templates data/templates_v2_candidates.txt \
+  --templates data/template_catalog.yaml \
  --family data/scenarios_v2_candidates.jsonl \
  --n 2 \
-  --out out/dryrun.json
+  --seed 24 \
+  --out out/persona_template_library_v2_pilot_seed24.json
+```
+
+```sh
+uv run python scripts/export_persona_template_stats.py \
+  out/persona_template_library_v2_pilot_seed24.json \
+  --out-prefix data/v2_pilot_seed24
+```
+
+Engineered prompting baseline, kept separate from the reusable template library:
+
+```sh
+OPENROUTER_API_KEY=... uv run python scripts/validate_persona_axes_openrouter.py \
+  --axes data/persona_pairs_engineered_baseline_pilot_two.jsonl \
+  --templates skill \
+  --family data/scenarios_v2_candidates.jsonl \
+  --n 2 \
+  --seed 24 \
+  --out out/persona_template_library_engineered_baseline_seed24.json
 ```

 ```sh
@@ -136,8 +213,10 @@ uv run python scripts/build_hf_dataset.py \

 ```sh
 uv run python scripts/plot_on_off_axis.py \
-  /tmp/persona-steering-template-library-hf/parquet/main.parquet \
-  --out out/on_off_axis.png
+  data/v2_pilot_seed24_template_pair_stats.jsonl \
+  data/engineered_baseline_seed24_template_pair_stats.jsonl \
+  --out out/on_off_axis.png \
+  --label-count 8
 ```

 ## Citation
@@ -11,6 +11,7 @@ dependencies = [
  "matplotlib>=3.10.0",
  "openai",
  "pyarrow>=24.0.0",
+  "pyyaml",
  "python-dotenv",
  "tabulate",
  "tqdm",
@@ -15,17 +15,19 @@ from typing import Any
 import pyarrow as pa
 import pyarrow.parquet as pq

+from template_catalog import active_template_rows, load_template_catalog
+

 ROOT = Path(__file__).resolve().parents[1]
 DATA = ROOT / "data"


 V2_PILOT_META = {
-    "measurement_id": "v2_pilot_seed23",
+    "measurement_id": "v2_pilot_seed24",
    "generator_model": "qwen/qwen3.5-27b",
    "judge_model": "google/gemini-3.1-flash-lite-preview",
    "generation_temperature": 0.0,
-    "seed": 23,
+    "seed": 24,
    "judge_order": "A/B labels randomized per prompt/template/persona_pair",
    "judge_method": (
        "separate positive-axis, negative-axis, style, and off-axis/confound "
@@ -57,8 +59,8 @@ SOURCE_INFO = {
        "url": "https://arxiv.org/abs/2310.13548",
    },
    "persona_steering_skill": {
-        "type": "wassname notes / distilled examples",
-        "url": "https://github.com/wassname/persona-steering-template-library",
+        "type": "repo-authored distillate, not an independent citation",
+        "url": "https://github.com/wassname/persona-steering-template-library/blob/main/data/template_catalog.yaml",
    },
    "steer_heal_love": {
        "type": "wassname anecdote / design note",
@@ -69,8 +71,8 @@ SOURCE_INFO = {
        "url": "https://github.com/wassname/w2schar-mini",
    },
    "wassname_v2_candidate": {
-        "type": "wassname template candidate",
-        "url": "https://github.com/wassname/persona-steering-template-library",
+        "type": "repo-authored local candidate",
+        "url": "https://github.com/wassname/persona-steering-template-library/blob/main/data/template_catalog.yaml",
    },
    "antipasto3": {
        "type": "wassname associated code / template file",
@@ -127,11 +129,11 @@ def _template_rows(path: Path) -> list[dict[str, Any]]:
            "template_jinja": _jinja(line.strip()),
            "template_format": "jinja2",
            "source_id": sources.get(line.strip(), {}).get("source_id", "wassname_v2_candidate"),
-            "source_type": _source_type(
-                sources.get(line.strip(), {}).get("source_id", "wassname_v2_candidate")
+            "source_type": sources.get(line.strip(), {}).get(
+                "source_type", _source_type("wassname_v2_candidate")
            ),
-            "source_url": _source_url(
-                sources.get(line.strip(), {}).get("source_id", "wassname_v2_candidate")
+            "source_url": sources.get(line.strip(), {}).get(
+                "source_url", _source_url("wassname_v2_candidate")
            ),
            "source_note": sources.get(line.strip(), {}).get("note", ""),
        }
@@ -169,15 +171,21 @@ def _source_url(source_id: str | None) -> str:


 def _template_sources() -> dict[str, dict[str, Any]]:
-    path = DATA / "template_sources.jsonl"
-    if not path.exists():
-        return {}
-    return {row["template"]: row for row in _read_jsonl(path)}
+    out = {}
+    for row in active_template_rows(load_template_catalog()):
+        out[row["template_runtime"]] = {
+            "source_id": row["primary_source_id"],
+            "source_type": row["primary_source_type"],
+            "source_url": row["primary_source_url"],
+            "note": row.get("note", ""),
+            "other_sources": row.get("other_sources", []),
+        }
+    return out


 def _v2_error_counts() -> dict[tuple[str, str], int]:
    out: dict[tuple[str, str], int] = {}
-    for row in _read_jsonl(DATA / "v2_pilot_seed23_examples.jsonl"):
+    for row in _read_jsonl(DATA / f"{V2_PILOT_META['measurement_id']}_examples.jsonl"):
        key = (row.get("template"), row.get("persona_pair"))
        if row.get("error"):
            out[key] = out.get(key, 0) + 1
@@ -185,7 +193,7 @@ def _v2_error_counts() -> dict[tuple[str, str], int]:


 def _persona_pairs_by_id() -> dict[str, dict[str, Any]]:
-    return {row["id"]: row for row in _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")}
+    return {row["id"]: row for row in _read_jsonl(DATA / "persona_pairs_pilot_two.jsonl")}


 def _template_pair_score_rows() -> list[dict[str, Any]]:
@@ -193,10 +201,16 @@ def _template_pair_score_rows() -> list[dict[str, Any]]:
    errors = _v2_error_counts()
    template_sources = _template_sources()
    rows = []
-    for stat in _read_jsonl(DATA / "v2_pilot_seed23_template_pair_stats.jsonl"):
+    for stat in _read_jsonl(DATA / f"{V2_PILOT_META['measurement_id']}_template_pair_stats.jsonl"):
        pair = pairs.get(stat["persona_pair"], {})
        template_source = template_sources.get(stat["template"], {})
        template_source_id = template_source.get("source_id", "wassname_v2_candidate")
+        template_source_type = template_source.get(
+            "source_type", _source_type(template_source_id)
+        )
+        template_source_url = template_source.get(
+            "source_url", _source_url(template_source_id)
+        )
        n_success = int(stat.get("n") or 0)
        n_errors = errors.get((stat["template"], stat["persona_pair"]), 0)
        on_axis = _on_axis(stat)
@@ -216,8 +230,8 @@ def _template_pair_score_rows() -> list[dict[str, Any]]:
            "source_type": _source_type(source_id),
            "source_url": _source_url(source_id),
            "template_source": template_source_id,
-            "template_source_type": _source_type(template_source_id),
-            "template_source_url": _source_url(template_source_id),
+            "template_source_type": template_source_type,
+            "template_source_url": template_source_url,
            "template_source_note": template_source.get("note", ""),
            "persona_pair": stat["persona_pair"],
            "positive_behavior": pair.get("positive_behavior"),
@@ -292,7 +306,7 @@ def _template_score_rows(template_pair_scores: list[dict[str, Any]]) -> list[dic


 def _persona_pair_review_rows(template_pair_scores: list[dict[str, Any]]) -> list[dict[str, Any]]:
-    pairs = _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")
+    pairs = _read_jsonl(DATA / "persona_pairs_pilot_two.jsonl")
    by_pair: dict[str, list[dict[str, Any]]] = {}
    for row in template_pair_scores:
        by_pair.setdefault(row["persona_pair"], []).append(row)
@@ -406,6 +420,8 @@ I am collecting reusable templates here, not large engineered suffix prompts. Th

 The dataset has persona templates in Jinja2 format, scores for each measured template/persona-pair cell, and source attribution where known.

+Important: this is a provenance inventory, not a full lit review. See `data/template_catalog.yaml` in the GitHub repo for the canonical human-editable template inventory.
+
 Persona-pair provenance is marked as `source`, `source_type`, and `source_url`. Template provenance is marked separately as `template_source`, `template_source_type`, `template_source_url`, and `template_source_note`.

 ## Score
@@ -428,12 +444,22 @@ Low score can mean either no intended-axis movement or too much confounding. Rea

 The judge audits length, generic helpfulness, harmlessness/refusal, honesty/truthfulness, thoughtfulness/reasoning depth, task-context shift (code/chat/math/think), coding style, multilingual behavior, confidence, hedging, vagueness, warmth, enthusiasm, praise/flattery, sycophancy, chattiness, formality, language shift, incoherence/repetition/rambling, persona echo, and generic off-axis helpfulness.

+Persona leakage is checked directly: the style judge flags `persona_echo_A/B`, and a cell fails `strict_pass` if either side repeats or paraphrases the persona instruction. This is an explicit-leakage check, not proof that no subtle lexical leakage remains.
+
 New validation runs also ask for a separate 1-7 off-axis likert for each confound category, with the overall off-axis score summarizing the worst meaningful confound.

 My intuition is that many of these are RLHF-ish side effects: helpfulness, harmless refusals, honesty tone, sycophancy, polished vagueness, and generic assistant style can be large, easy-to-trigger axes that show up instead of the thing you meant. - wassname

 Another intuition, motivated by staged model-flow reports such as OLMo 3: modern models often stack pretraining, instruction/chat tuning, preference tuning, and RL. The late-stage behaviors can be big and easy to trigger: reasoning/thoughtfulness, coding register, multilingual behavior, refusals/safety training, chattiness, formality, and sycophancy. - wassname

+## Provenance
+
+Sources are marked as `source`, `source_type`, and `source_url`.
+
+Do not read every `source_id` as an independent citation. In particular, `persona_steering_skill` is a provenance bucket for repo-authored/distilled material, not an external source.
+
+`data/template_catalog.jsonl`, `data/templates_v2_candidates.txt`, and `data/template_sources.jsonl` are generated runtime artifacts. `data/template_catalog.yaml` is the template source of truth.
+
 ## Tables

 1. `main`: one row per measured template/persona-pair cell.
@@ -486,7 +512,7 @@ def main() -> None:

    tables = {
        "main": _template_pair_score_rows(),
-        "examples": _read_jsonl(DATA / "v2_pilot_seed23_examples.jsonl"),
+        "examples": _read_jsonl(DATA / f"{V2_PILOT_META['measurement_id']}_examples.jsonl"),
    }
    tables["persona_pairs"] = _persona_pair_review_rows(tables["main"])

@@ -116,8 +116,10 @@ def _example_rows(rows: list[dict]) -> list[dict]:
        axis = r.get("axis", {})
        rec = {
            "artifact": r.get("artifact"),
+            "eval_id": r.get("eval_id"),
            "template": r.get("template"),
            "persona_pair": axis.get("id"),
+            "scenario_id": r.get("scenario_id"),
            "pos_persona": axis.get("pos_descriptor"),
            "neg_persona": axis.get("neg_descriptor"),
            "row": r.get("row"),
@@ -7,6 +7,7 @@ The default input is the built Hugging Face parquet table:
 from __future__ import annotations

 import argparse
+from collections import defaultdict
 import json
 import textwrap
 from pathlib import Path
@@ -30,6 +31,13 @@ def _read_rows(path: Path) -> list[dict[str, Any]]:
    return rows


+def _read_all_rows(paths: list[Path]) -> list[dict[str, Any]]:
+    rows = []
+    for path in paths:
+        rows.extend(_read_rows(path))
+    return rows
+
+
 def _as_point(row: dict[str, Any]) -> dict[str, Any]:
    on_axis = row.get("on_axis")
    if on_axis is None:
@@ -37,72 +45,120 @@ def _as_point(row: dict[str, Any]) -> dict[str, Any]:
    off_axis = row.get("off_axis")
    if off_axis is None:
        off_axis = _clamp01((float(row.get("mean_off_axis_problem") or 7.0) - 1.0) / 6.0)
-    label = row.get("contrast") or row.get("persona_pair") or ""
+    point_id = row.get("contrast") or row.get("persona_pair") or ""
    template = row.get("template") or row.get("template_jinja") or ""
    return {
        "x": float(on_axis),
        "y": float(off_axis),
        "score": float(row.get("score") or 100.0 * float(on_axis) * (1.0 - float(off_axis))),
-        "label": f"{label}: {template}".strip(": "),
+        "id": str(point_id),
+        "template": str(template),
        "recommended": bool(row.get("recommended")),
    }


-def _label_points(points: list[dict[str, Any]], n: int) -> list[dict[str, Any]]:
+def _aggregate_points(points: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    groups: dict[tuple[float, float], list[dict[str, Any]]] = defaultdict(list)
+    for point in points:
+        groups[(point["x"], point["y"])].append(point)
+
+    out = []
+    for (x, y), rows in groups.items():
+        rows.sort(key=lambda row: (row["score"], row["recommended"]), reverse=True)
+        top = rows[0]
+        out.append({
+            "x": x,
+            "y": y,
+            "score": max(row["score"] for row in rows),
+            "id": top["id"],
+            "template": top["template"],
+            "recommended": any(row["recommended"] for row in rows),
+            "count": len(rows),
+            "labels": [f'{row["id"]}: "{row["template"]}"' for row in rows],
+        })
+    return out
+
+
+def _label_points(points: list[dict[str, Any]], n: int, rightmost_n: int) -> list[dict[str, Any]]:
    if len(points) <= n:
        return points
    high_score = sorted(points, key=lambda p: p["score"], reverse=True)[: max(2, n // 2)]
    high_off_axis = sorted(points, key=lambda p: (p["y"], p["x"]), reverse=True)[: n]
+    rightmost = sorted(points, key=lambda p: (p["x"], -p["y"], p["score"]), reverse=True)[:rightmost_n]
    out = []
    seen_labels = set()
    seen_cells = set()
-    for point in high_score + high_off_axis:
-        label_key = point["label"]
+    for point in high_score + high_off_axis + rightmost:
+        label_key = f'{point["id"]}: "{point["template"]}"'
        cell_key = (round(point["x"], 1), round(point["y"], 1))
        if label_key not in seen_labels and cell_key not in seen_cells:
            out.append(point)
            seen_labels.add(label_key)
            seen_cells.add(cell_key)
-    return out[:n]
+    return out[: max(n, rightmost_n)]


 def _place_label(i: int, point: dict[str, Any]) -> tuple[float, float, str, str]:
    dx = 0.018
-    dy = [0.035, -0.05, 0.075, -0.09, 0.115, -0.13][i % 6]
-    x = min(0.98, point["x"] + dx)
+    dy = [0.035, -0.05, 0.075, -0.09, 0.115, -0.13, 0.16, -0.175][i % 8]
+    x = min(0.98, point["x"] + dx) if point["x"] < 0.9 else max(0.05, point["x"] - 0.02)
    y = min(0.98, max(0.02, point["y"] + dy))
-    return x, y, "left", "center"
+    ha = "left" if point["x"] < 0.9 else "right"
+    return x, y, ha, "center"


-def _short_label(text: str) -> str:
-    text = text.replace("{{ persona }}", "{persona}")
-    return textwrap.fill(textwrap.shorten(text, width=74, placeholder="..."), width=38)
+def _short_template(text: str, width: int = 52) -> str:
+    text = text.replace("{{ persona }}", "{persona}").replace("\n", " ")
+    text = " ".join(text.split())
+    if len(text) <= width:
+        return text
+    keep = max(8, (width - 5) // 2)
+    return f"{text[:keep]} ... {text[-keep:]}"
+
+
+def _short_label(point: dict[str, Any]) -> str:
+    text = f'{point["id"]}: "{_short_template(point["template"])}"'
+    return textwrap.fill(text, width=38)


 def main() -> None:
    ap = argparse.ArgumentParser()
-    ap.add_argument("input", type=Path)
+    ap.add_argument("input", nargs="+", type=Path)
    ap.add_argument("--out", type=Path, default=Path("out/on_off_axis.png"))
-    ap.add_argument("--label-count", type=int, default=4)
+    ap.add_argument("--label-count", type=int, default=10)
+    ap.add_argument("--label-rightmost", type=int, default=5)
    args = ap.parse_args()

-    points = [_as_point(row) for row in _read_rows(args.input)]
-    points = [p for p in points if p["label"]]
-    labels = _label_points(points, args.label_count)
+    raw_points = [_as_point(row) for row in _read_all_rows(args.input)]
+    raw_points = [p for p in raw_points if p["id"]]
+    points = _aggregate_points(raw_points)
+    labels = _label_points(points, args.label_count, args.label_rightmost)

    fig, ax = plt.subplots(figsize=(8.0, 5.6), dpi=180)
    ax.scatter(
        [p["x"] for p in points],
        [p["y"] for p in points],
-        s=[42 if p["recommended"] else 24 for p in points],
+        s=[26 + 12 * p["count"] for p in points],
        c=["black" if p["recommended"] else "0.55" for p in points],
        alpha=0.82,
        linewidths=0,
    )
+    for point in points:
+        if point["count"] > 1:
+            ax.text(
+                point["x"],
+                point["y"],
+                str(point["count"]),
+                ha="center",
+                va="center",
+                fontsize=6.5,
+                color="white" if point["recommended"] else "0.1",
+            )
    for i, point in enumerate(labels):
        x, y, ha, va = _place_label(i, point)
+        count_suffix = f" [{point['count']}]" if point["count"] > 1 else ""
        ax.annotate(
-            _short_label(point["label"]),
+            _short_label(point) + count_suffix,
            xy=(point["x"], point["y"]),
            xytext=(x, y),
            textcoords="data",