add clean score tables

This commit is contained in:
wassname
2026-06-13 14:05:26 +08:00
parent 9b1a6e7573
commit 6a19b65e49
3 changed files with 312 additions and 37 deletions
+35 -10
View File
@@ -28,20 +28,40 @@ The point is not "this sounds like a good prompt". The point is to measure
whether the positive and negative personas separate the intended axis without whether the positive and negative personas separate the intended axis without
mostly separating length, tone, confidence, refusal, or persona-echo. mostly separating length, tone, confidence, refusal, or persona-echo.
If the pair is `honest -> untruthful`, `in Paris` versus `in Berlin` is
on-axis. `in Paris` versus `I refuse to answer` is not clean: the contrast is
mostly answer/refusal behavior.
## Score
On Hugging Face, start with `template_pair_scores`.
`score` is a conservative 0-100 clean-axis score:
```text
100
* strict_pass_rate
* clamp(mean_axis_delta / 8)
* clamp((7 - mean_off_axis_problem) / 6)
* clamp((6 - mean_max_style_abs_delta) / 6)
* (1 - persona_echo_rate)
* (1 - refusal_or_ai_break_rate)
```
High score means the template/persona-pair cell repeatedly moved the intended
axis while staying comparatively clean on off-axis, style, persona-echo, and
refusal checks.
## What To Browse ## What To Browse
On Hugging Face, start with `persona_pairs_v2_review`. On Hugging Face:
That table gives one row per persona pair: - `template_pair_scores`: clean selection table with `id`, `template_jinja`, `score`, source attribution, model metadata, and score components
- `template_scores`: one row per template, aggregated over measured persona pairs
- `persona_pairs_v2_review`: one row per candidate persona pair
- `v2_pilot_seed23_examples`: raw completions and judge ratings
- `axis`: `neg->pos` The examples are still the proof. The score is only a fast sorting key.
- `positive_behavior` / `negative_behavior`: what the pair should separate
- `proof_grade`: `pilot_recommended`, `pilot_measured_not_promoted`, or `candidate_unmeasured`
- `best_template`: best measured template for that pair, if any
- `best_axis_delta`, `best_off_axis_problem`, `best_max_style_abs_delta`: compact proof stats
Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative
completions and judge ratings.
## Files ## Files
@@ -59,6 +79,11 @@ Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4
scenarios. It is enough to show the measurement format and identify a few scenarios. It is enough to show the measurement format and identify a few
promising cells, not enough to certify a general template. promising cells, not enough to certify a general template.
Current pilot: completions from `qwen/qwen3.5-27b`, judge
`google/gemini-3.1-flash-lite-preview`, OpenRouter, `temperature=0`, seed `23`.
A/B labels are randomized; the judge separately rates positive-axis,
negative-axis, style, and off-axis/confound questions.
## Run ## Run
```sh ```sh
+53 -10
View File
@@ -27,20 +27,49 @@ channel; the pair supplies the contrastive axis.
INF: I think the useful object is the measured `template x persona_pair` cell, INF: I think the useful object is the measured `template x persona_pair` cell,
not a persona string by itself. - wassname not a persona string by itself. - wassname
If the pair is `honest -> untruthful`, a clean template should move
truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus
`I refuse to answer` is not clean, because the model mainly changed refusal
behavior.
## Score
OBS: The clean selection table is `template_pair_scores`.
OBS: `score` is:
```text
100
* strict_pass_rate
* clamp(mean_axis_delta / 8)
* clamp((7 - mean_off_axis_problem) / 6)
* clamp((6 - mean_max_style_abs_delta) / 6)
* (1 - persona_echo_rate)
* (1 - refusal_or_ai_break_rate)
```
INF: High score means a measured cell repeatedly moved the intended axis without
large off-axis, style, persona-echo, or refusal movement.
INF: Low score can mean either "no axis movement" or "movement exists but is
confounded". Read the component columns before trusting it.
## Browse ## Browse
Start with the Hugging Face split `persona_pairs_v2_review`. Start with the Hugging Face split `template_pair_scores`.
- `axis`: compact `neg->pos` - `id`: stable row id
- `positive_behavior` / `negative_behavior`: intended behavioral contrast - `template_jinja`: Jinja2 template with `{{ persona }}`
- `proof_grade`: `pilot_recommended`, `pilot_measured_not_promoted`, or `candidate_unmeasured` - `persona_pair_id`: measured persona pair
- `best_template`: best measured template for that pair, if measured - `score`: clean-axis score
- `best_axis_delta`: intended-axis Likert separation - `source_id` / `source_type`: attribution when known
- `best_off_axis_problem`: judge-rated confound risk - `mean_axis_delta`: intended-axis movement
- `best_max_style_abs_delta`: largest audited style movement - `mean_off_axis_problem`: confound rating
- `mean_max_style_abs_delta`: style movement
- `generator_model` / `judge_model`: measurement models
Then open `v2_pilot_seed23_examples` and read the paired completions. The table Then open `v2_pilot_seed23_examples` and read the paired completions. The score
is only a map; the examples are the proof. is a sorting key; the examples are the proof.
## Wassname Anecdotes / Design Notes ## Wassname Anecdotes / Design Notes
@@ -77,6 +106,20 @@ INF: A useful cell should have:
I think `recommended=true` should mean "worth a larger sweep", not "certified". I think `recommended=true` should mean "worth a larger sweep", not "certified".
- wassname - wassname
## Measurement Details
OBS: Current pilot completions use `qwen/qwen3.5-27b`.
OBS: Current pilot judging uses `google/gemini-3.1-flash-lite-preview`.
OBS: Generation uses `temperature=0.0` and seed `23`.
OBS: The judge sees randomized A/B labels and separate positive-axis,
negative-axis, style, and off-axis/confound prompts.
INF: This reduces sampling, position, and framing confounds. It does not remove
the usual limitations of one small automatic-judge pilot. - wassname
## Sources To Check ## Sources To Check
No evidence dump is committed here yet. When checking a claim, distinguish papers No evidence dump is committed here yet. When checking a claim, distinguish papers
+224 -17
View File
@@ -31,6 +31,26 @@ TABLE_SOURCES = {
"v2_pilot_seed23_examples": DATA / "v2_pilot_seed23_examples.jsonl", "v2_pilot_seed23_examples": DATA / "v2_pilot_seed23_examples.jsonl",
} }
V2_PILOT_META = {
"measurement_id": "v2_pilot_seed23",
"generator_model": "qwen/qwen3.5-27b",
"judge_model": "google/gemini-3.1-flash-lite-preview",
"generation_temperature": 0.0,
"seed": 23,
"judge_order": "A/B labels randomized per prompt/template/persona_pair",
"judge_method": (
"separate positive-axis, negative-axis, style, and off-axis/confound "
"calls with deterministic judge temperature"
),
}
SCORE_FORMULA = (
"100 * strict_pass_rate * clamp(mean_axis_delta/8) * "
"clamp((7-mean_off_axis_problem)/6) * "
"clamp((6-mean_max_style_abs_delta)/6) * "
"(1-persona_echo_rate) * (1-refusal_or_ai_break_rate)"
)
def _jsonable(value: Any) -> Any: def _jsonable(value: Any) -> Any:
if isinstance(value, (dict, list)): if isinstance(value, (dict, list)):
@@ -52,7 +72,11 @@ def _write_parquet(path: Path, rows: list[dict[str, Any]]) -> None:
if not rows: if not rows:
table = pa.table({}) table = pa.table({})
else: else:
keys = sorted({k for row in rows for k in row}) keys = list(rows[0])
for row in rows[1:]:
for key in row:
if key not in keys:
keys.append(key)
normalized = [{k: _jsonable(row.get(k)) for k in keys} for row in rows] normalized = [{k: _jsonable(row.get(k)) for k in keys} for row in rows]
table = pa.Table.from_pylist(normalized) table = pa.Table.from_pylist(normalized)
path.parent.mkdir(parents=True, exist_ok=True) path.parent.mkdir(parents=True, exist_ok=True)
@@ -61,18 +85,151 @@ def _write_parquet(path: Path, rows: list[dict[str, Any]]) -> None:
def _template_rows(path: Path) -> list[dict[str, Any]]: def _template_rows(path: Path) -> list[dict[str, Any]]:
return [ return [
{"template_id": i, "template": line.strip()} {
"id": f"template_{i:02d}",
"template": line.strip(),
"template_jinja": _jinja(line.strip()),
"template_format": "jinja2",
"source_id": "wassname_v2_candidate",
"source_type": "wassname anecdote / design note",
}
for i, line in enumerate(path.read_text().splitlines()) for i, line in enumerate(path.read_text().splitlines())
if line.strip() if line.strip()
] ]
def _persona_pair_review_rows() -> list[dict[str, Any]]: def _clamp01(x: float) -> float:
return max(0.0, min(1.0, x))
def _score(row: dict[str, Any]) -> float:
strict = float(row.get("strict_pass_rate") or 0.0)
axis = _clamp01(float(row.get("mean_axis_delta") or 0.0) / 8.0)
off_axis_clean = _clamp01((7.0 - float(row.get("mean_off_axis_problem") or 7.0)) / 6.0)
style_clean = _clamp01((6.0 - float(row.get("mean_max_style_abs_delta") or 6.0)) / 6.0)
echo_clean = _clamp01(1.0 - float(row.get("persona_echo_rate") or 0.0))
refusal_clean = _clamp01(1.0 - float(row.get("refusal_or_ai_break_rate") or 0.0))
return round(100.0 * strict * axis * off_axis_clean * style_clean * echo_clean * refusal_clean, 1)
def _jinja(template: str) -> str:
return template.replace("{persona}", "{{ persona }}")
def _source_type(source_id: str | None) -> str:
if source_id in {"repeng"}:
return "code / practitioner examples"
if source_id in {"assistant_axis", "persona_vectors", "weight_steering"}:
return "associated code / trait files"
if source_id in {"w2schar_in_house", "steer_heal_love", "wassname_v2_candidate"}:
return "wassname anecdote / design note"
if source_id:
return "source-listed candidate"
return "wassname anecdote / design note"
def _v2_error_counts() -> dict[tuple[str, str], int]:
out: dict[tuple[str, str], int] = {}
for row in _read_jsonl(DATA / "v2_pilot_seed23_examples.jsonl"):
key = (row.get("template"), row.get("persona_pair"))
if row.get("error"):
out[key] = out.get(key, 0) + 1
return out
def _persona_pairs_by_id() -> dict[str, dict[str, Any]]:
return {row["id"]: row for row in _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")}
def _template_pair_score_rows() -> list[dict[str, Any]]:
pairs = _persona_pairs_by_id()
errors = _v2_error_counts()
rows = []
for stat in _read_jsonl(DATA / "v2_pilot_seed23_template_pair_stats.jsonl"):
pair = pairs.get(stat["persona_pair"], {})
n_success = int(stat.get("n") or 0)
n_errors = errors.get((stat["template"], stat["persona_pair"]), 0)
score = _score(stat)
source_id = pair.get("source_id", "wassname_v2_candidate")
rows.append({
"id": f"{stat['persona_pair']}::{_slug(stat['template'])}",
"template_jinja": _jinja(stat["template"]),
"score": score,
"persona_pair_id": stat["persona_pair"],
"axis": f"{pair.get('neg', '')}->{pair.get('pos', '')}",
"source_id": source_id,
"source_type": _source_type(source_id),
"measurement_id": V2_PILOT_META["measurement_id"],
"template": stat["template"],
"template_format": "jinja2",
"pos_persona": pair.get("pos"),
"neg_persona": pair.get("neg"),
"positive_behavior": pair.get("positive_behavior"),
"negative_behavior": pair.get("negative_behavior"),
"score_formula": SCORE_FORMULA,
"recommended": bool(stat.get("recommended")),
"n_success": n_success,
"n_errors": n_errors,
"n_planned": n_success + n_errors,
"strict_pass_rate": stat.get("strict_pass_rate"),
"mean_axis_delta": stat.get("mean_axis_delta"),
"mean_off_axis_problem": stat.get("mean_off_axis_problem"),
"mean_max_style_abs_delta": stat.get("mean_max_style_abs_delta"),
"mean_abs_word_delta_frac": stat.get("mean_abs_word_delta_frac"),
"persona_echo_rate": stat.get("persona_echo_rate"),
"refusal_or_ai_break_rate": stat.get("refusal_or_ai_break_rate"),
"usable_rate": stat.get("usable_rate"),
**V2_PILOT_META,
})
rows.sort(key=lambda r: (r["score"], r["strict_pass_rate"], r["mean_axis_delta"]), reverse=True)
return rows
def _slug(text: str) -> str:
out = "".join(ch.lower() if ch.isalnum() else "_" for ch in text)
while "__" in out:
out = out.replace("__", "_")
return out.strip("_")[:64]
def _template_score_rows(template_pair_scores: list[dict[str, Any]]) -> list[dict[str, Any]]:
by_template: dict[str, list[dict[str, Any]]] = {}
for row in template_pair_scores:
by_template.setdefault(row["template"], []).append(row)
out = []
for template, rows in by_template.items():
best = rows[0]
measured = len(rows)
out.append({
"id": _slug(template),
"template_jinja": _jinja(template),
"score": round(sum(float(r["score"]) for r in rows) / measured, 1),
"best_score": best["score"],
"best_persona_pair_id": best["persona_pair_id"],
"source_id": "wassname_v2_candidate",
"source_type": "wassname anecdote / design note",
"measurement_id": V2_PILOT_META["measurement_id"],
"template": template,
"template_format": "jinja2",
"recommended_cell_count": sum(bool(r["recommended"]) for r in rows),
"measured_persona_pair_count": measured,
"mean_axis_delta": round(
sum(float(r["mean_axis_delta"] or 0) for r in rows) / measured, 4),
"mean_off_axis_problem": round(
sum(float(r["mean_off_axis_problem"] or 0) for r in rows) / measured, 4),
"mean_max_style_abs_delta": round(
sum(float(r["mean_max_style_abs_delta"] or 0) for r in rows) / measured, 4),
**V2_PILOT_META,
})
out.sort(key=lambda r: (r["best_score"], r["score"]), reverse=True)
return out
def _persona_pair_review_rows(template_pair_scores: list[dict[str, Any]]) -> list[dict[str, Any]]:
pairs = _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl") pairs = _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")
pilot = _read_jsonl(DATA / "v2_pilot_seed23_template_pair_stats.jsonl")
by_pair: dict[str, list[dict[str, Any]]] = {} by_pair: dict[str, list[dict[str, Any]]] = {}
for row in pilot: for row in template_pair_scores:
by_pair.setdefault(row["persona_pair"], []).append(row) by_pair.setdefault(row["persona_pair_id"], []).append(row)
out = [] out = []
for pair in pairs: for pair in pairs:
@@ -80,10 +237,8 @@ def _persona_pair_review_rows() -> list[dict[str, Any]]:
by_pair.get(pair["id"], []), by_pair.get(pair["id"], []),
key=lambda r: ( key=lambda r: (
bool(r.get("recommended")), bool(r.get("recommended")),
float(r.get("score") or 0),
float(r.get("strict_pass_rate") or 0), float(r.get("strict_pass_rate") or 0),
float(r.get("mean_axis_delta") or 0),
-float(r.get("mean_off_axis_problem") or 99),
-float(r.get("mean_max_style_abs_delta") or 99),
), ),
reverse=True, reverse=True,
) )
@@ -99,7 +254,9 @@ def _persona_pair_review_rows() -> list[dict[str, Any]]:
if best: if best:
proof_summary = ( proof_summary = (
f"best_template={best['template']}; " f"best_template={best['template']}; "
f"n={best['n']}; pass={best['strict_pass_rate']}; " f"score={best['score']}; "
f"n_success={best['n_success']}; "
f"pass={best['strict_pass_rate']}; "
f"axis_delta={best['mean_axis_delta']}; " f"axis_delta={best['mean_axis_delta']}; "
f"off_axis={best['mean_off_axis_problem']}; " f"off_axis={best['mean_off_axis_problem']}; "
f"style={best['mean_max_style_abs_delta']}" f"style={best['mean_max_style_abs_delta']}"
@@ -108,18 +265,22 @@ def _persona_pair_review_rows() -> list[dict[str, Any]]:
proof_summary = "no measured v2 pilot rows yet" proof_summary = "no measured v2 pilot rows yet"
out.append({ out.append({
"id": pair["id"],
"persona_pair": pair["id"], "persona_pair": pair["id"],
"axis": f"{pair['neg']}->{pair['pos']}", "axis": f"{pair['neg']}->{pair['pos']}",
"best_score": best.get("score"),
"proof_grade": proof_grade,
"source_id": pair.get("source_id"),
"source_type": _source_type(pair.get("source_id")),
"pos": pair["pos"], "pos": pair["pos"],
"neg": pair["neg"], "neg": pair["neg"],
"positive_behavior": pair["positive_behavior"], "positive_behavior": pair["positive_behavior"],
"negative_behavior": pair["negative_behavior"], "negative_behavior": pair["negative_behavior"],
"source_id": pair.get("source_id"),
"proof_grade": proof_grade,
"proof_summary": proof_summary, "proof_summary": proof_summary,
"n_measured_template_cells": len(rows), "n_measured_template_cells": len(rows),
"best_template": best.get("template"), "best_template": best.get("template"),
"best_n": best.get("n"), "best_template_jinja": _jinja(best["template"]) if best else None,
"best_n_success": best.get("n_success"),
"best_strict_pass_rate": best.get("strict_pass_rate"), "best_strict_pass_rate": best.get("strict_pass_rate"),
"best_axis_delta": best.get("mean_axis_delta"), "best_axis_delta": best.get("mean_axis_delta"),
"best_off_axis_problem": best.get("mean_off_axis_problem"), "best_off_axis_problem": best.get("mean_off_axis_problem"),
@@ -186,11 +347,36 @@ measured pilot:
The point is to measure whether the positive and negative personas separate the intended axis without mostly separating length, tone, confidence, refusal, or persona-echo. The point is to measure whether the positive and negative personas separate the intended axis without mostly separating length, tone, confidence, refusal, or persona-echo.
For example, if the persona pair is `honest -> untruthful`, a useful template should make the completions differ on truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus `I refuse to answer` is not clean, because the pair is mostly separating answer/refusal behavior.
## Score
Start with `template_pair_scores`.
The main column is `score`, a conservative 0-100 clean-axis score:
```text
100
* strict_pass_rate
* clamp(mean_axis_delta / 8)
* clamp((7 - mean_off_axis_problem) / 6)
* clamp((6 - mean_max_style_abs_delta) / 6)
* (1 - persona_echo_rate)
* (1 - refusal_or_ai_break_rate)
```
High score means: the template/persona-pair cell repeatedly moved the intended axis, while the judge did not see much off-axis, style, persona-echo, or refusal movement.
Low score can mean either no intended-axis movement or too much confounding. Read the component columns before trusting the score.
## What To Browse ## What To Browse
Start with `persona_pairs_v2_review`. 1. `template_pair_scores`: clean selection table. Columns include `id`, `template_jinja`, `persona_pair_id`, `score`, source attribution, model metadata, and the score components.
2. `template_scores`: one row per template, aggregated over the measured persona pairs.
3. `persona_pairs_v2_review`: one row per candidate persona pair.
4. `v2_pilot_seed23_examples`: raw paired completions and judge ratings.
That table gives one row per persona pair: `persona_pairs_v2_review` gives:
- `axis`: `neg->pos` - `axis`: `neg->pos`
- `positive_behavior` / `negative_behavior`: what the pair should separate - `positive_behavior` / `negative_behavior`: what the pair should separate
@@ -200,6 +386,12 @@ That table gives one row per persona pair:
Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative completions and judge ratings. Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative completions and judge ratings.
## Measurement
This pilot uses `qwen/qwen3.5-27b` for completions and `google/gemini-3.1-flash-lite-preview` as the judge through OpenRouter. Generation temperature is `0.0`, with seed `23`, to reduce sampling noise.
The judge sees randomized A/B labels. It separately rates positive-axis behavior, negative-axis behavior, surface style, and off-axis/confound risk. This reduces simple position/framing bias, but it is still one automatic judge on one small pilot.
## Current Status ## Current Status
Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4 scenarios. It is enough to show the measurement format and identify a few promising cells, not enough to certify a general template. Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4 scenarios. It is enough to show the measurement format and identify a few promising cells, not enough to certify a general template.
@@ -225,12 +417,27 @@ def main() -> None:
tables = {name: _read_jsonl(path) for name, path in TABLE_SOURCES.items()} tables = {name: _read_jsonl(path) for name, path in TABLE_SOURCES.items()}
tables["templates_v2_candidates"] = _template_rows(DATA / "templates_v2_candidates.txt") tables["templates_v2_candidates"] = _template_rows(DATA / "templates_v2_candidates.txt")
tables["persona_pairs_v2_review"] = _persona_pair_review_rows() tables["template_pair_scores"] = _template_pair_score_rows()
tables["template_scores"] = _template_score_rows(tables["template_pair_scores"])
tables["persona_pairs_v2_review"] = _persona_pair_review_rows(tables["template_pair_scores"])
for name, rows in tables.items(): for name, rows in tables.items():
_write_parquet(parquet_dir / f"{name}.parquet", rows) _write_parquet(parquet_dir / f"{name}.parquet", rows)
names = sorted(tables) names = [
"template_pair_scores",
"template_scores",
"persona_pairs_v2_review",
"templates_v2_candidates",
"persona_pairs_v2_candidates",
"scenarios_v2_candidates",
"v2_pilot_seed23_template_pair_stats",
"v2_pilot_seed23_template_stats",
"v2_pilot_seed23_examples",
"template_pair_stats",
"template_stats",
"examples",
]
(args.out / "README.md").write_text(_readme(names)) (args.out / "README.md").write_text(_readme(names))
print(f"built {args.out}") print(f"built {args.out}")
for name in names: for name in names: