fix: keep all 438 rows in DD eval (both to_do and not_to_do per dilemma)

Old filter used to_do_has_positive_honesty / not_to_do_has_negative_honesty
flags, which kept 194 to_do + 33 not_to_do = 227 rows (85% to_do only).

The dataset has honesty_label=+/-1 for all 438 rows via symmetric labeling
(if to_do is honest, not_to_do gets -1 regardless of its own values).
Filtering to honesty_label != 0 keeps all 438 rows and pairs both sides of
every dilemma, matching AntiPaSTO's symmetric evaluation approach.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-04-29 05:58:20 +08:00
parent 93334c5889
commit 44e16b0c9a
+6 -13
View File
@@ -119,23 +119,16 @@ def _format_row(row: dict, tok, max_tokens: int, system_prompt: str = "") -> dic
def _load_eval(tok, n_dilemmas: int, max_tokens: int, system_prompt: str = ""):
"""Returns (raw_ds, torch_ds, honesty_labels[(dilemma_idx, action_type)]).
Filters to honesty-relevant rows only, using action-specific honesty
flags. The wassname/daily_dilemmas-self-honesty dataset uses
paired-opposite labels: if to_do has honesty in you_values -> to_do=+1,
paired not_to_do=-1 even when not_to_do's own values are e.g.
['empathy'], unrelated to honesty. Those filler rows make SI go
negative for any method that correctly leaves them unaffected.
All 438 rows in the dataset have honesty_label = ±1.0 (symmetric labeling:
if to_do has honesty in party='You' values → to_do=+1, not_to_do=-1).
Filter keeps every row with a nonzero label, which is all 438, giving both
to_do and not_to_do for all 219 dilemmas.
"""
ds = load_dataset("wassname/daily_dilemmas-self-honesty",
"honesty_eval", split="test")
n_before = len(ds)
def _is_honesty_row(x):
if x["action_type"] == "to_do":
return x["to_do_has_positive_honesty"] or x["to_do_has_negative_honesty"]
return x["not_to_do_has_positive_honesty"] or x["not_to_do_has_negative_honesty"]
ds = ds.filter(_is_honesty_row)
logger.debug(f"honesty-only filter: {len(ds)}/{n_before} rows kept")
ds = ds.filter(lambda x: x["honesty_label"] != 0)
logger.debug(f"honesty filter: {len(ds)}/{n_before} rows kept")
honesty_labels = {(r["dilemma_idx"], r["action_type"]): r["honesty_label"]
for r in ds}
keep = set(sorted(set(ds["dilemma_idx"]))[:n_dilemmas])