mirror of
https://github.com/wassname/persona-steering-template-library.git
synced 2026-06-27 17:01:24 +08:00
add clean score tables
This commit is contained in:
@@ -28,20 +28,40 @@ The point is not "this sounds like a good prompt". The point is to measure
|
|||||||
whether the positive and negative personas separate the intended axis without
|
whether the positive and negative personas separate the intended axis without
|
||||||
mostly separating length, tone, confidence, refusal, or persona-echo.
|
mostly separating length, tone, confidence, refusal, or persona-echo.
|
||||||
|
|
||||||
|
If the pair is `honest -> untruthful`, `in Paris` versus `in Berlin` is
|
||||||
|
on-axis. `in Paris` versus `I refuse to answer` is not clean: the contrast is
|
||||||
|
mostly answer/refusal behavior.
|
||||||
|
|
||||||
|
## Score
|
||||||
|
|
||||||
|
On Hugging Face, start with `template_pair_scores`.
|
||||||
|
|
||||||
|
`score` is a conservative 0-100 clean-axis score:
|
||||||
|
|
||||||
|
```text
|
||||||
|
100
|
||||||
|
* strict_pass_rate
|
||||||
|
* clamp(mean_axis_delta / 8)
|
||||||
|
* clamp((7 - mean_off_axis_problem) / 6)
|
||||||
|
* clamp((6 - mean_max_style_abs_delta) / 6)
|
||||||
|
* (1 - persona_echo_rate)
|
||||||
|
* (1 - refusal_or_ai_break_rate)
|
||||||
|
```
|
||||||
|
|
||||||
|
High score means the template/persona-pair cell repeatedly moved the intended
|
||||||
|
axis while staying comparatively clean on off-axis, style, persona-echo, and
|
||||||
|
refusal checks.
|
||||||
|
|
||||||
## What To Browse
|
## What To Browse
|
||||||
|
|
||||||
On Hugging Face, start with `persona_pairs_v2_review`.
|
On Hugging Face:
|
||||||
|
|
||||||
That table gives one row per persona pair:
|
- `template_pair_scores`: clean selection table with `id`, `template_jinja`, `score`, source attribution, model metadata, and score components
|
||||||
|
- `template_scores`: one row per template, aggregated over measured persona pairs
|
||||||
|
- `persona_pairs_v2_review`: one row per candidate persona pair
|
||||||
|
- `v2_pilot_seed23_examples`: raw completions and judge ratings
|
||||||
|
|
||||||
- `axis`: `neg->pos`
|
The examples are still the proof. The score is only a fast sorting key.
|
||||||
- `positive_behavior` / `negative_behavior`: what the pair should separate
|
|
||||||
- `proof_grade`: `pilot_recommended`, `pilot_measured_not_promoted`, or `candidate_unmeasured`
|
|
||||||
- `best_template`: best measured template for that pair, if any
|
|
||||||
- `best_axis_delta`, `best_off_axis_problem`, `best_max_style_abs_delta`: compact proof stats
|
|
||||||
|
|
||||||
Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative
|
|
||||||
completions and judge ratings.
|
|
||||||
|
|
||||||
## Files
|
## Files
|
||||||
|
|
||||||
@@ -59,6 +79,11 @@ Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4
|
|||||||
scenarios. It is enough to show the measurement format and identify a few
|
scenarios. It is enough to show the measurement format and identify a few
|
||||||
promising cells, not enough to certify a general template.
|
promising cells, not enough to certify a general template.
|
||||||
|
|
||||||
|
Current pilot: completions from `qwen/qwen3.5-27b`, judge
|
||||||
|
`google/gemini-3.1-flash-lite-preview`, OpenRouter, `temperature=0`, seed `23`.
|
||||||
|
A/B labels are randomized; the judge separately rates positive-axis,
|
||||||
|
negative-axis, style, and off-axis/confound questions.
|
||||||
|
|
||||||
## Run
|
## Run
|
||||||
|
|
||||||
```sh
|
```sh
|
||||||
|
|||||||
+53
-10
@@ -27,20 +27,49 @@ channel; the pair supplies the contrastive axis.
|
|||||||
INF: I think the useful object is the measured `template x persona_pair` cell,
|
INF: I think the useful object is the measured `template x persona_pair` cell,
|
||||||
not a persona string by itself. - wassname
|
not a persona string by itself. - wassname
|
||||||
|
|
||||||
|
If the pair is `honest -> untruthful`, a clean template should move
|
||||||
|
truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus
|
||||||
|
`I refuse to answer` is not clean, because the model mainly changed refusal
|
||||||
|
behavior.
|
||||||
|
|
||||||
|
## Score
|
||||||
|
|
||||||
|
OBS: The clean selection table is `template_pair_scores`.
|
||||||
|
|
||||||
|
OBS: `score` is:
|
||||||
|
|
||||||
|
```text
|
||||||
|
100
|
||||||
|
* strict_pass_rate
|
||||||
|
* clamp(mean_axis_delta / 8)
|
||||||
|
* clamp((7 - mean_off_axis_problem) / 6)
|
||||||
|
* clamp((6 - mean_max_style_abs_delta) / 6)
|
||||||
|
* (1 - persona_echo_rate)
|
||||||
|
* (1 - refusal_or_ai_break_rate)
|
||||||
|
```
|
||||||
|
|
||||||
|
INF: High score means a measured cell repeatedly moved the intended axis without
|
||||||
|
large off-axis, style, persona-echo, or refusal movement.
|
||||||
|
|
||||||
|
INF: Low score can mean either "no axis movement" or "movement exists but is
|
||||||
|
confounded". Read the component columns before trusting it.
|
||||||
|
|
||||||
## Browse
|
## Browse
|
||||||
|
|
||||||
Start with the Hugging Face split `persona_pairs_v2_review`.
|
Start with the Hugging Face split `template_pair_scores`.
|
||||||
|
|
||||||
- `axis`: compact `neg->pos`
|
- `id`: stable row id
|
||||||
- `positive_behavior` / `negative_behavior`: intended behavioral contrast
|
- `template_jinja`: Jinja2 template with `{{ persona }}`
|
||||||
- `proof_grade`: `pilot_recommended`, `pilot_measured_not_promoted`, or `candidate_unmeasured`
|
- `persona_pair_id`: measured persona pair
|
||||||
- `best_template`: best measured template for that pair, if measured
|
- `score`: clean-axis score
|
||||||
- `best_axis_delta`: intended-axis Likert separation
|
- `source_id` / `source_type`: attribution when known
|
||||||
- `best_off_axis_problem`: judge-rated confound risk
|
- `mean_axis_delta`: intended-axis movement
|
||||||
- `best_max_style_abs_delta`: largest audited style movement
|
- `mean_off_axis_problem`: confound rating
|
||||||
|
- `mean_max_style_abs_delta`: style movement
|
||||||
|
- `generator_model` / `judge_model`: measurement models
|
||||||
|
|
||||||
Then open `v2_pilot_seed23_examples` and read the paired completions. The table
|
Then open `v2_pilot_seed23_examples` and read the paired completions. The score
|
||||||
is only a map; the examples are the proof.
|
is a sorting key; the examples are the proof.
|
||||||
|
|
||||||
## Wassname Anecdotes / Design Notes
|
## Wassname Anecdotes / Design Notes
|
||||||
|
|
||||||
@@ -77,6 +106,20 @@ INF: A useful cell should have:
|
|||||||
I think `recommended=true` should mean "worth a larger sweep", not "certified".
|
I think `recommended=true` should mean "worth a larger sweep", not "certified".
|
||||||
- wassname
|
- wassname
|
||||||
|
|
||||||
|
## Measurement Details
|
||||||
|
|
||||||
|
OBS: Current pilot completions use `qwen/qwen3.5-27b`.
|
||||||
|
|
||||||
|
OBS: Current pilot judging uses `google/gemini-3.1-flash-lite-preview`.
|
||||||
|
|
||||||
|
OBS: Generation uses `temperature=0.0` and seed `23`.
|
||||||
|
|
||||||
|
OBS: The judge sees randomized A/B labels and separate positive-axis,
|
||||||
|
negative-axis, style, and off-axis/confound prompts.
|
||||||
|
|
||||||
|
INF: This reduces sampling, position, and framing confounds. It does not remove
|
||||||
|
the usual limitations of one small automatic-judge pilot. - wassname
|
||||||
|
|
||||||
## Sources To Check
|
## Sources To Check
|
||||||
|
|
||||||
No evidence dump is committed here yet. When checking a claim, distinguish papers
|
No evidence dump is committed here yet. When checking a claim, distinguish papers
|
||||||
|
|||||||
+224
-17
@@ -31,6 +31,26 @@ TABLE_SOURCES = {
|
|||||||
"v2_pilot_seed23_examples": DATA / "v2_pilot_seed23_examples.jsonl",
|
"v2_pilot_seed23_examples": DATA / "v2_pilot_seed23_examples.jsonl",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
V2_PILOT_META = {
|
||||||
|
"measurement_id": "v2_pilot_seed23",
|
||||||
|
"generator_model": "qwen/qwen3.5-27b",
|
||||||
|
"judge_model": "google/gemini-3.1-flash-lite-preview",
|
||||||
|
"generation_temperature": 0.0,
|
||||||
|
"seed": 23,
|
||||||
|
"judge_order": "A/B labels randomized per prompt/template/persona_pair",
|
||||||
|
"judge_method": (
|
||||||
|
"separate positive-axis, negative-axis, style, and off-axis/confound "
|
||||||
|
"calls with deterministic judge temperature"
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
SCORE_FORMULA = (
|
||||||
|
"100 * strict_pass_rate * clamp(mean_axis_delta/8) * "
|
||||||
|
"clamp((7-mean_off_axis_problem)/6) * "
|
||||||
|
"clamp((6-mean_max_style_abs_delta)/6) * "
|
||||||
|
"(1-persona_echo_rate) * (1-refusal_or_ai_break_rate)"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def _jsonable(value: Any) -> Any:
|
def _jsonable(value: Any) -> Any:
|
||||||
if isinstance(value, (dict, list)):
|
if isinstance(value, (dict, list)):
|
||||||
@@ -52,7 +72,11 @@ def _write_parquet(path: Path, rows: list[dict[str, Any]]) -> None:
|
|||||||
if not rows:
|
if not rows:
|
||||||
table = pa.table({})
|
table = pa.table({})
|
||||||
else:
|
else:
|
||||||
keys = sorted({k for row in rows for k in row})
|
keys = list(rows[0])
|
||||||
|
for row in rows[1:]:
|
||||||
|
for key in row:
|
||||||
|
if key not in keys:
|
||||||
|
keys.append(key)
|
||||||
normalized = [{k: _jsonable(row.get(k)) for k in keys} for row in rows]
|
normalized = [{k: _jsonable(row.get(k)) for k in keys} for row in rows]
|
||||||
table = pa.Table.from_pylist(normalized)
|
table = pa.Table.from_pylist(normalized)
|
||||||
path.parent.mkdir(parents=True, exist_ok=True)
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
@@ -61,18 +85,151 @@ def _write_parquet(path: Path, rows: list[dict[str, Any]]) -> None:
|
|||||||
|
|
||||||
def _template_rows(path: Path) -> list[dict[str, Any]]:
|
def _template_rows(path: Path) -> list[dict[str, Any]]:
|
||||||
return [
|
return [
|
||||||
{"template_id": i, "template": line.strip()}
|
{
|
||||||
|
"id": f"template_{i:02d}",
|
||||||
|
"template": line.strip(),
|
||||||
|
"template_jinja": _jinja(line.strip()),
|
||||||
|
"template_format": "jinja2",
|
||||||
|
"source_id": "wassname_v2_candidate",
|
||||||
|
"source_type": "wassname anecdote / design note",
|
||||||
|
}
|
||||||
for i, line in enumerate(path.read_text().splitlines())
|
for i, line in enumerate(path.read_text().splitlines())
|
||||||
if line.strip()
|
if line.strip()
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
def _persona_pair_review_rows() -> list[dict[str, Any]]:
|
def _clamp01(x: float) -> float:
|
||||||
|
return max(0.0, min(1.0, x))
|
||||||
|
|
||||||
|
|
||||||
|
def _score(row: dict[str, Any]) -> float:
|
||||||
|
strict = float(row.get("strict_pass_rate") or 0.0)
|
||||||
|
axis = _clamp01(float(row.get("mean_axis_delta") or 0.0) / 8.0)
|
||||||
|
off_axis_clean = _clamp01((7.0 - float(row.get("mean_off_axis_problem") or 7.0)) / 6.0)
|
||||||
|
style_clean = _clamp01((6.0 - float(row.get("mean_max_style_abs_delta") or 6.0)) / 6.0)
|
||||||
|
echo_clean = _clamp01(1.0 - float(row.get("persona_echo_rate") or 0.0))
|
||||||
|
refusal_clean = _clamp01(1.0 - float(row.get("refusal_or_ai_break_rate") or 0.0))
|
||||||
|
return round(100.0 * strict * axis * off_axis_clean * style_clean * echo_clean * refusal_clean, 1)
|
||||||
|
|
||||||
|
|
||||||
|
def _jinja(template: str) -> str:
|
||||||
|
return template.replace("{persona}", "{{ persona }}")
|
||||||
|
|
||||||
|
|
||||||
|
def _source_type(source_id: str | None) -> str:
|
||||||
|
if source_id in {"repeng"}:
|
||||||
|
return "code / practitioner examples"
|
||||||
|
if source_id in {"assistant_axis", "persona_vectors", "weight_steering"}:
|
||||||
|
return "associated code / trait files"
|
||||||
|
if source_id in {"w2schar_in_house", "steer_heal_love", "wassname_v2_candidate"}:
|
||||||
|
return "wassname anecdote / design note"
|
||||||
|
if source_id:
|
||||||
|
return "source-listed candidate"
|
||||||
|
return "wassname anecdote / design note"
|
||||||
|
|
||||||
|
|
||||||
|
def _v2_error_counts() -> dict[tuple[str, str], int]:
|
||||||
|
out: dict[tuple[str, str], int] = {}
|
||||||
|
for row in _read_jsonl(DATA / "v2_pilot_seed23_examples.jsonl"):
|
||||||
|
key = (row.get("template"), row.get("persona_pair"))
|
||||||
|
if row.get("error"):
|
||||||
|
out[key] = out.get(key, 0) + 1
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _persona_pairs_by_id() -> dict[str, dict[str, Any]]:
|
||||||
|
return {row["id"]: row for row in _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")}
|
||||||
|
|
||||||
|
|
||||||
|
def _template_pair_score_rows() -> list[dict[str, Any]]:
|
||||||
|
pairs = _persona_pairs_by_id()
|
||||||
|
errors = _v2_error_counts()
|
||||||
|
rows = []
|
||||||
|
for stat in _read_jsonl(DATA / "v2_pilot_seed23_template_pair_stats.jsonl"):
|
||||||
|
pair = pairs.get(stat["persona_pair"], {})
|
||||||
|
n_success = int(stat.get("n") or 0)
|
||||||
|
n_errors = errors.get((stat["template"], stat["persona_pair"]), 0)
|
||||||
|
score = _score(stat)
|
||||||
|
source_id = pair.get("source_id", "wassname_v2_candidate")
|
||||||
|
rows.append({
|
||||||
|
"id": f"{stat['persona_pair']}::{_slug(stat['template'])}",
|
||||||
|
"template_jinja": _jinja(stat["template"]),
|
||||||
|
"score": score,
|
||||||
|
"persona_pair_id": stat["persona_pair"],
|
||||||
|
"axis": f"{pair.get('neg', '')}->{pair.get('pos', '')}",
|
||||||
|
"source_id": source_id,
|
||||||
|
"source_type": _source_type(source_id),
|
||||||
|
"measurement_id": V2_PILOT_META["measurement_id"],
|
||||||
|
"template": stat["template"],
|
||||||
|
"template_format": "jinja2",
|
||||||
|
"pos_persona": pair.get("pos"),
|
||||||
|
"neg_persona": pair.get("neg"),
|
||||||
|
"positive_behavior": pair.get("positive_behavior"),
|
||||||
|
"negative_behavior": pair.get("negative_behavior"),
|
||||||
|
"score_formula": SCORE_FORMULA,
|
||||||
|
"recommended": bool(stat.get("recommended")),
|
||||||
|
"n_success": n_success,
|
||||||
|
"n_errors": n_errors,
|
||||||
|
"n_planned": n_success + n_errors,
|
||||||
|
"strict_pass_rate": stat.get("strict_pass_rate"),
|
||||||
|
"mean_axis_delta": stat.get("mean_axis_delta"),
|
||||||
|
"mean_off_axis_problem": stat.get("mean_off_axis_problem"),
|
||||||
|
"mean_max_style_abs_delta": stat.get("mean_max_style_abs_delta"),
|
||||||
|
"mean_abs_word_delta_frac": stat.get("mean_abs_word_delta_frac"),
|
||||||
|
"persona_echo_rate": stat.get("persona_echo_rate"),
|
||||||
|
"refusal_or_ai_break_rate": stat.get("refusal_or_ai_break_rate"),
|
||||||
|
"usable_rate": stat.get("usable_rate"),
|
||||||
|
**V2_PILOT_META,
|
||||||
|
})
|
||||||
|
rows.sort(key=lambda r: (r["score"], r["strict_pass_rate"], r["mean_axis_delta"]), reverse=True)
|
||||||
|
return rows
|
||||||
|
|
||||||
|
|
||||||
|
def _slug(text: str) -> str:
|
||||||
|
out = "".join(ch.lower() if ch.isalnum() else "_" for ch in text)
|
||||||
|
while "__" in out:
|
||||||
|
out = out.replace("__", "_")
|
||||||
|
return out.strip("_")[:64]
|
||||||
|
|
||||||
|
|
||||||
|
def _template_score_rows(template_pair_scores: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||||
|
by_template: dict[str, list[dict[str, Any]]] = {}
|
||||||
|
for row in template_pair_scores:
|
||||||
|
by_template.setdefault(row["template"], []).append(row)
|
||||||
|
out = []
|
||||||
|
for template, rows in by_template.items():
|
||||||
|
best = rows[0]
|
||||||
|
measured = len(rows)
|
||||||
|
out.append({
|
||||||
|
"id": _slug(template),
|
||||||
|
"template_jinja": _jinja(template),
|
||||||
|
"score": round(sum(float(r["score"]) for r in rows) / measured, 1),
|
||||||
|
"best_score": best["score"],
|
||||||
|
"best_persona_pair_id": best["persona_pair_id"],
|
||||||
|
"source_id": "wassname_v2_candidate",
|
||||||
|
"source_type": "wassname anecdote / design note",
|
||||||
|
"measurement_id": V2_PILOT_META["measurement_id"],
|
||||||
|
"template": template,
|
||||||
|
"template_format": "jinja2",
|
||||||
|
"recommended_cell_count": sum(bool(r["recommended"]) for r in rows),
|
||||||
|
"measured_persona_pair_count": measured,
|
||||||
|
"mean_axis_delta": round(
|
||||||
|
sum(float(r["mean_axis_delta"] or 0) for r in rows) / measured, 4),
|
||||||
|
"mean_off_axis_problem": round(
|
||||||
|
sum(float(r["mean_off_axis_problem"] or 0) for r in rows) / measured, 4),
|
||||||
|
"mean_max_style_abs_delta": round(
|
||||||
|
sum(float(r["mean_max_style_abs_delta"] or 0) for r in rows) / measured, 4),
|
||||||
|
**V2_PILOT_META,
|
||||||
|
})
|
||||||
|
out.sort(key=lambda r: (r["best_score"], r["score"]), reverse=True)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _persona_pair_review_rows(template_pair_scores: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||||
pairs = _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")
|
pairs = _read_jsonl(DATA / "persona_pairs_v2_candidates.jsonl")
|
||||||
pilot = _read_jsonl(DATA / "v2_pilot_seed23_template_pair_stats.jsonl")
|
|
||||||
by_pair: dict[str, list[dict[str, Any]]] = {}
|
by_pair: dict[str, list[dict[str, Any]]] = {}
|
||||||
for row in pilot:
|
for row in template_pair_scores:
|
||||||
by_pair.setdefault(row["persona_pair"], []).append(row)
|
by_pair.setdefault(row["persona_pair_id"], []).append(row)
|
||||||
|
|
||||||
out = []
|
out = []
|
||||||
for pair in pairs:
|
for pair in pairs:
|
||||||
@@ -80,10 +237,8 @@ def _persona_pair_review_rows() -> list[dict[str, Any]]:
|
|||||||
by_pair.get(pair["id"], []),
|
by_pair.get(pair["id"], []),
|
||||||
key=lambda r: (
|
key=lambda r: (
|
||||||
bool(r.get("recommended")),
|
bool(r.get("recommended")),
|
||||||
|
float(r.get("score") or 0),
|
||||||
float(r.get("strict_pass_rate") or 0),
|
float(r.get("strict_pass_rate") or 0),
|
||||||
float(r.get("mean_axis_delta") or 0),
|
|
||||||
-float(r.get("mean_off_axis_problem") or 99),
|
|
||||||
-float(r.get("mean_max_style_abs_delta") or 99),
|
|
||||||
),
|
),
|
||||||
reverse=True,
|
reverse=True,
|
||||||
)
|
)
|
||||||
@@ -99,7 +254,9 @@ def _persona_pair_review_rows() -> list[dict[str, Any]]:
|
|||||||
if best:
|
if best:
|
||||||
proof_summary = (
|
proof_summary = (
|
||||||
f"best_template={best['template']}; "
|
f"best_template={best['template']}; "
|
||||||
f"n={best['n']}; pass={best['strict_pass_rate']}; "
|
f"score={best['score']}; "
|
||||||
|
f"n_success={best['n_success']}; "
|
||||||
|
f"pass={best['strict_pass_rate']}; "
|
||||||
f"axis_delta={best['mean_axis_delta']}; "
|
f"axis_delta={best['mean_axis_delta']}; "
|
||||||
f"off_axis={best['mean_off_axis_problem']}; "
|
f"off_axis={best['mean_off_axis_problem']}; "
|
||||||
f"style={best['mean_max_style_abs_delta']}"
|
f"style={best['mean_max_style_abs_delta']}"
|
||||||
@@ -108,18 +265,22 @@ def _persona_pair_review_rows() -> list[dict[str, Any]]:
|
|||||||
proof_summary = "no measured v2 pilot rows yet"
|
proof_summary = "no measured v2 pilot rows yet"
|
||||||
|
|
||||||
out.append({
|
out.append({
|
||||||
|
"id": pair["id"],
|
||||||
"persona_pair": pair["id"],
|
"persona_pair": pair["id"],
|
||||||
"axis": f"{pair['neg']}->{pair['pos']}",
|
"axis": f"{pair['neg']}->{pair['pos']}",
|
||||||
|
"best_score": best.get("score"),
|
||||||
|
"proof_grade": proof_grade,
|
||||||
|
"source_id": pair.get("source_id"),
|
||||||
|
"source_type": _source_type(pair.get("source_id")),
|
||||||
"pos": pair["pos"],
|
"pos": pair["pos"],
|
||||||
"neg": pair["neg"],
|
"neg": pair["neg"],
|
||||||
"positive_behavior": pair["positive_behavior"],
|
"positive_behavior": pair["positive_behavior"],
|
||||||
"negative_behavior": pair["negative_behavior"],
|
"negative_behavior": pair["negative_behavior"],
|
||||||
"source_id": pair.get("source_id"),
|
|
||||||
"proof_grade": proof_grade,
|
|
||||||
"proof_summary": proof_summary,
|
"proof_summary": proof_summary,
|
||||||
"n_measured_template_cells": len(rows),
|
"n_measured_template_cells": len(rows),
|
||||||
"best_template": best.get("template"),
|
"best_template": best.get("template"),
|
||||||
"best_n": best.get("n"),
|
"best_template_jinja": _jinja(best["template"]) if best else None,
|
||||||
|
"best_n_success": best.get("n_success"),
|
||||||
"best_strict_pass_rate": best.get("strict_pass_rate"),
|
"best_strict_pass_rate": best.get("strict_pass_rate"),
|
||||||
"best_axis_delta": best.get("mean_axis_delta"),
|
"best_axis_delta": best.get("mean_axis_delta"),
|
||||||
"best_off_axis_problem": best.get("mean_off_axis_problem"),
|
"best_off_axis_problem": best.get("mean_off_axis_problem"),
|
||||||
@@ -186,11 +347,36 @@ measured pilot:
|
|||||||
|
|
||||||
The point is to measure whether the positive and negative personas separate the intended axis without mostly separating length, tone, confidence, refusal, or persona-echo.
|
The point is to measure whether the positive and negative personas separate the intended axis without mostly separating length, tone, confidence, refusal, or persona-echo.
|
||||||
|
|
||||||
|
For example, if the persona pair is `honest -> untruthful`, a useful template should make the completions differ on truthfulness. `in Paris` versus `in Berlin` is on-axis. `in Paris` versus `I refuse to answer` is not clean, because the pair is mostly separating answer/refusal behavior.
|
||||||
|
|
||||||
|
## Score
|
||||||
|
|
||||||
|
Start with `template_pair_scores`.
|
||||||
|
|
||||||
|
The main column is `score`, a conservative 0-100 clean-axis score:
|
||||||
|
|
||||||
|
```text
|
||||||
|
100
|
||||||
|
* strict_pass_rate
|
||||||
|
* clamp(mean_axis_delta / 8)
|
||||||
|
* clamp((7 - mean_off_axis_problem) / 6)
|
||||||
|
* clamp((6 - mean_max_style_abs_delta) / 6)
|
||||||
|
* (1 - persona_echo_rate)
|
||||||
|
* (1 - refusal_or_ai_break_rate)
|
||||||
|
```
|
||||||
|
|
||||||
|
High score means: the template/persona-pair cell repeatedly moved the intended axis, while the judge did not see much off-axis, style, persona-echo, or refusal movement.
|
||||||
|
|
||||||
|
Low score can mean either no intended-axis movement or too much confounding. Read the component columns before trusting the score.
|
||||||
|
|
||||||
## What To Browse
|
## What To Browse
|
||||||
|
|
||||||
Start with `persona_pairs_v2_review`.
|
1. `template_pair_scores`: clean selection table. Columns include `id`, `template_jinja`, `persona_pair_id`, `score`, source attribution, model metadata, and the score components.
|
||||||
|
2. `template_scores`: one row per template, aggregated over the measured persona pairs.
|
||||||
|
3. `persona_pairs_v2_review`: one row per candidate persona pair.
|
||||||
|
4. `v2_pilot_seed23_examples`: raw paired completions and judge ratings.
|
||||||
|
|
||||||
That table gives one row per persona pair:
|
`persona_pairs_v2_review` gives:
|
||||||
|
|
||||||
- `axis`: `neg->pos`
|
- `axis`: `neg->pos`
|
||||||
- `positive_behavior` / `negative_behavior`: what the pair should separate
|
- `positive_behavior` / `negative_behavior`: what the pair should separate
|
||||||
@@ -200,6 +386,12 @@ That table gives one row per persona pair:
|
|||||||
|
|
||||||
Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative completions and judge ratings.
|
Then inspect `v2_pilot_seed23_examples` to read the actual positive/negative completions and judge ratings.
|
||||||
|
|
||||||
|
## Measurement
|
||||||
|
|
||||||
|
This pilot uses `qwen/qwen3.5-27b` for completions and `google/gemini-3.1-flash-lite-preview` as the judge through OpenRouter. Generation temperature is `0.0`, with seed `23`, to reduce sampling noise.
|
||||||
|
|
||||||
|
The judge sees randomized A/B labels. It separately rates positive-axis behavior, negative-axis behavior, surface style, and off-axis/confound risk. This reduces simple position/framing bias, but it is still one automatic judge on one small pilot.
|
||||||
|
|
||||||
## Current Status
|
## Current Status
|
||||||
|
|
||||||
Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4 scenarios. It is enough to show the measurement format and identify a few promising cells, not enough to certify a general template.
|
Preliminary. The current pilot is small: 4 persona pairs x 4 templates x 4 scenarios. It is enough to show the measurement format and identify a few promising cells, not enough to certify a general template.
|
||||||
@@ -225,12 +417,27 @@ def main() -> None:
|
|||||||
|
|
||||||
tables = {name: _read_jsonl(path) for name, path in TABLE_SOURCES.items()}
|
tables = {name: _read_jsonl(path) for name, path in TABLE_SOURCES.items()}
|
||||||
tables["templates_v2_candidates"] = _template_rows(DATA / "templates_v2_candidates.txt")
|
tables["templates_v2_candidates"] = _template_rows(DATA / "templates_v2_candidates.txt")
|
||||||
tables["persona_pairs_v2_review"] = _persona_pair_review_rows()
|
tables["template_pair_scores"] = _template_pair_score_rows()
|
||||||
|
tables["template_scores"] = _template_score_rows(tables["template_pair_scores"])
|
||||||
|
tables["persona_pairs_v2_review"] = _persona_pair_review_rows(tables["template_pair_scores"])
|
||||||
|
|
||||||
for name, rows in tables.items():
|
for name, rows in tables.items():
|
||||||
_write_parquet(parquet_dir / f"{name}.parquet", rows)
|
_write_parquet(parquet_dir / f"{name}.parquet", rows)
|
||||||
|
|
||||||
names = sorted(tables)
|
names = [
|
||||||
|
"template_pair_scores",
|
||||||
|
"template_scores",
|
||||||
|
"persona_pairs_v2_review",
|
||||||
|
"templates_v2_candidates",
|
||||||
|
"persona_pairs_v2_candidates",
|
||||||
|
"scenarios_v2_candidates",
|
||||||
|
"v2_pilot_seed23_template_pair_stats",
|
||||||
|
"v2_pilot_seed23_template_stats",
|
||||||
|
"v2_pilot_seed23_examples",
|
||||||
|
"template_pair_stats",
|
||||||
|
"template_stats",
|
||||||
|
"examples",
|
||||||
|
]
|
||||||
(args.out / "README.md").write_text(_readme(names))
|
(args.out / "README.md").write_text(_readme(names))
|
||||||
print(f"built {args.out}")
|
print(f"built {args.out}")
|
||||||
for name in names:
|
for name in names:
|
||||||
|
|||||||
Reference in New Issue
Block a user