fix: keep all 438 rows in DD eval (both to_do and not_to_do per dilemma)

Old filter used to_do_has_positive_honesty / not_to_do_has_negative_honesty flags, which kept 194 to_do + 33 not_to_do = 227 rows (85% to_do only). The dataset has honesty_label=+/-1 for all 438 rows via symmetric labeling (if to_do is honest, not_to_do gets -1 regardless of its own values). Filtering to honesty_label != 0 keeps all 438 rows and pairs both sides of every dilemma, matching AntiPaSTO's symmetric evaluation approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-27 18:27:18 +08:00 · 2026-04-29 05:58:20 +08:00
parent 93334c5889
commit 44e16b0c9a
1 changed files with 6 additions and 13 deletions
@@ -119,23 +119,16 @@ def _format_row(row: dict, tok, max_tokens: int, system_prompt: str = "") -> dic
 def _load_eval(tok, n_dilemmas: int, max_tokens: int, system_prompt: str = ""):
    """Returns (raw_ds, torch_ds, honesty_labels[(dilemma_idx, action_type)]).

-    Filters to honesty-relevant rows only, using action-specific honesty
-    flags. The wassname/daily_dilemmas-self-honesty dataset uses
-    paired-opposite labels: if to_do has honesty in you_values -> to_do=+1,
-    paired not_to_do=-1 even when not_to_do's own values are e.g.
-    ['empathy'], unrelated to honesty. Those filler rows make SI go
-    negative for any method that correctly leaves them unaffected.
+    All 438 rows in the dataset have honesty_label = ±1.0 (symmetric labeling:
+    if to_do has honesty in party='You' values → to_do=+1, not_to_do=-1).
+    Filter keeps every row with a nonzero label, which is all 438, giving both
+    to_do and not_to_do for all 219 dilemmas.
    """
    ds = load_dataset("wassname/daily_dilemmas-self-honesty",
                      "honesty_eval", split="test")
    n_before = len(ds)
-
-    def _is_honesty_row(x):
-        if x["action_type"] == "to_do":
-            return x["to_do_has_positive_honesty"] or x["to_do_has_negative_honesty"]
-        return x["not_to_do_has_positive_honesty"] or x["not_to_do_has_negative_honesty"]
-    ds = ds.filter(_is_honesty_row)
-    logger.debug(f"honesty-only filter: {len(ds)}/{n_before} rows kept")
+    ds = ds.filter(lambda x: x["honesty_label"] != 0)
+    logger.debug(f"honesty filter: {len(ds)}/{n_before} rows kept")
    honesty_labels = {(r["dilemma_idx"], r["action_type"]): r["honesty_label"]
                      for r in ds}
    keep = set(sorted(set(ds["dilemma_idx"]))[:n_dilemmas])