Reject affective loop completions

2026-06-27 16:47:16 +08:00 · 2026-06-24 20:48:40 +08:00
parent ea89a0ee35
commit 22fd4b8dbe
3 changed files with 104 additions and 1 deletions
@@ -0,0 +1,35 @@
 Let me try to get the original version of `filter.py` from HEAD to compare:Now I have all the context I need. Let me verify one more thing about how `rep_frac` integrates with the filter pipeline:I now have all the context needed. Let me compile the review.
 ---
 ## Code Review: `rep_frac` affect-loop detection gate
 ### Summary
 Patches `rep_frac` with a diffuse affect-loop detector that catches low-diversity, affective roleplay mush ("oh my goodness / my darling / sweet" loops) that evaded the existing n-gram repetition detector. When a long completion (>=128 lex words) has high affect-word ratios combined with low lexical diversity and high compressibility, `rep_frac` returns 1.0, triggering the existing `rep_tau` rejection. The T1 verification log confirms this flips task-181 junk from kept to rejected while passing coherent hand examples.
 ### Important (should fix)
 - **`src/steer_heal/filter.py:62-63` — `AFFECT_LOOP_WORDS` includes `"love"`, the target trait**: The word `"love"` is in the affect-word set, so every genuine love declaration contributes to `affect_frac`. At the required thresholds (>=0.25 or >=0.35), this is very unlikely to flag legitimate text (a normal 128-word declaration has `affect_frac` ~0.01-0.02), and the verification log confirms hand examples pass. However, the `"love"` inclusion creates a long-tail risk: if a future model produces diverse-but-passionate declarations that happen to use affect words at 25%+ density, they could be silently dropped. Consider whether `"love"` needs to be in the set given it's the signal we want.
 - **`src/steer_heal/filter.py:73-74` — Short-text early return**: The word n-gram loop returns 1.0 immediately when any n-gram level produces empty grams (e.g., a 3-word text at n=4). This is pre-existing code, but the affect-loop gates that follow will never run on short completions as a result. This is correct behavior (short completions ARE degenerate), but the interaction with the new affect-loop gates is undocumented. Consider adding a comment noting the short-circuit is intentional and the affect-loop gates are gated on `>=64` words anyway.
 ### Suggestions
 - **`src/steer_heal/filter.py:88-93` — Caps-heavy gate threshold is the broadest**: The third gate (`caps_frac >= 0.15, affect_frac >= 0.25, unique_frac < 0.55`) has the loosest `unique_frac` threshold (0.55 vs 0.45/0.50 in the other two). A completion with moderate caps (15% uppercase, e.g., proper nouns and emphasis) and 25% affect words but 55% unique words could be flagged. The verification log shows hand examples pass, so this is fine in practice. Worth noting in a comment why this gate has looser thresholds than the others.
 - **`src/steer_heal/filter.py:83` — `affect_frac`, `punct_frac`, `caps_frac` computed for all >=64-word texts**: These are computed even for texts in the 64-127 word range where they're not used (the gates require >=128). This is harmless overhead but slightly misleading when reading the code. Consider either moving them inside the `>=128` guard or adding a brief comment.
 ### Positive
 - **`src/steer_heal/filter.py:69-70` — Orthogonal signal use**: The affect-loop detection uses three orthogonal signals (affect-word ratio, punctuation density, caps density) each combined with lexical diversity and compressibility. This multi-signal design reduces false positives: a legitimate text with high caps won't trigger unless it's also low-diversity and affect-heavy.
 - **`src/steer_heal/filter.py:63` — `AFFECT_LOOP_WORDS` is case-matched to lowercase**: Since `lex_words` comes from `text_lc` (already lowercased), the set membership check is correct. No case-sensitivity bug.
 - **R2 compliance**: No new config knob, no fallback logic, the gate lives inside `rep_frac` and feeds the existing `keep = rep < rep_tau` path. Exactly as required.
 - **`docs/spec/20260624_love_filter_tighten_requeue.md` — Spec-driven verification**: The T1 verification log at `/tmp/steer_heal_love_filter_tighten_verify.log` provides concrete evidence: round-0 old-kept drops from 81 to 36, round-1 from 91 to 4, round-2 from 90 to 0. The representative rejected rows clearly show the "oh my goodness / my darling / sweet" loops being caught.
 ### Verdict
 **APPROVE** (for the T2 filter changes). The affect-loop detection is well-designed and verified. The remaining tasks (T3 fast-dev-run, T4 commit+push+enqueue) are not yet done. The `"love"` word in `AFFECT_LOOP_WORDS` is worth a second look but is very unlikely to cause issues at the current thresholds.
 ## Triage
 - Accepted: removed `"love"` from `AFFECT_LOOP_WORDS`; it is the target signal and was unnecessary for catching the observed loops.
 - Rejected for now: adding more comments around pre-existing short-completion behavior and caps thresholds. The code already fails short completions via the old n-gram guard, and the caps gate is constrained by affect density plus low lexical diversity.
 - Reverified after the accepted change: `/tmp/steer_heal_love_filter_tighten_verify2.log` and `/tmp/steer_heal_love_filter_tighten_fast2.log`.
@@ -0,0 +1,52 @@
 # Love Filter Tighten Requeue
 ## Goal
 Tighten the love-demo completion filter so the next queued run does not train on low-diversity affective junk that base PPL accepts, then requeue the same last-good KL recipe at lowest priority.
 ## Scope
 In: `src/steer_heal/filter.py`, task-181 saved events, compile/fast-dev verification, pueue enqueue.
 Out: changing the loss, adding a new hyperparameter, changing generation sampling, or redesigning the love persona.
 ## Requirements
 - R1: Reject more junk using the existing `rep_tau` gate. Done means: old task-181 kept samples with "oh my goodness / my darling / sweet" loops now score `rep >= 0.3`. VERIFY: rescoring `out/20260624T144031_gemma-3-4b-it_kl_rev_s42/events.jsonl` prints old-kept/new-pass counts by round and representative rejected rows.
 - R2: Keep the filter simple and fail-fast. Done means: no new config knob, no fallback, no gen-time repetition penalty hiding the signal from walk-C. VERIFY: code inspection shows the gate is inside `rep_frac` and still feeds the existing `keep = rep < rep_tau` decision.
 - R3: Requeue the love run at lowest priority. Done means: `pueue status --json` shows a queued task on branch `dv` with priority `0` and a label stating why/resolve. VERIFY: compact status table includes the new task.
 ## Tasks
 - [x] T1 (R1): Measure the shape of task-181 junk.
  - verify: script over task-181 `events.jsonl`.
  - success: metrics identify old-kept rows with low lexical diversity / repeated affect tokens / roleplay punctuation.
  - likely_fail: metrics only catch the exact previous row.
  - sneaky_fail: the new gate rejects every ordinary love declaration too.
  - UAT: saved verification log with old/new counts and sample rows.
 - [x] T2 (R1,R2): Patch `rep_frac` with a stricter quality gate.
  - verify: `uv run python -m compileall src/steer_heal` and rescoring script.
  - success: r1/r2 old-kept junk mostly flips to rejected; coherent hand examples remain below `rep_tau`.
  - likely_fail: threshold is inert because `ppl_tau` was the real issue.
  - sneaky_fail: extra gate is too love-demo-specific and kills valid affectionate text.
  - UAT: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
 - [x] T3 (R2): Run the fast dev path.
  - verify: `just fast-dev-run ... | tee /tmp/steer_heal_love_filter_tighten_fast.log | tail -80`.
  - success: tiny run completes, proving the real pipeline still executes.
  - likely_fail: tiny random text trips the stricter gate and starves training.
  - sneaky_fail: compile passes but the adaptive gen/filter path is broken.
  - UAT: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
 - [/] T4 (R3): Commit, push, and enqueue at priority 0.
  - verify: `git log -1 --oneline`, `git status --short`, `pueue status --json`.
  - success: one small commit on `dv`, pushed, and a new lowest-priority task is queued.
  - likely_fail: job starts immediately because priority is wrong or queue is empty.
  - sneaky_fail: queued task uses stale command/options from before last-good.
  - UAT: compact pueue status row.
 ## Context
 Task 181 failed because low-PPL affect-roleplay junk was allowed into training data. Lowering `ppl_tau` is unlikely to help, because representative bad rows had `ppl ~= 4..13`. A text-shape gate is the cheap discriminant.
 ## Log
 - 2026-06-24: Starting from commit `ea89a0e` on branch `dv`; worktree has pre-existing dirty files.
 - 2026-06-24: Task-181 old-kept rows had low lexical diversity and affect-token density. Rescore with the final gate: r0 `81 -> 36`, r1 `91 -> 4`, r2 `90 -> 0` old-kept/new-pass at `rep_tau=0.3`; hand examples scored `0.036..0.050` and passed. Evidence: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
 - 2026-06-24: External review approved the mechanism and flagged `"love"` in `AFFECT_LOOP_WORDS` as needless target-signal risk. Removed it and reverified with unchanged counts. Review: `docs/reviews/20260624_love_filter_tighten_code.md`.
 - 2026-06-24: Final fast-dev run passed on the tiny-random path. Evidence: `/tmp/steer_heal_love_filter_tighten_fast2.log`; report: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
 ## Errors
 | Task | Error | Resolution |
 |------|-------|------------|
@@ -36,6 +36,12 @@ REFUSAL = (
    "i'm an ai", "i am an ai", "i don't have personal opinions",
 )
 AFFECT_LOOP_WORDS = {
    "oh", "my", "goodness", "god", "heavens", "sweet", "sweetie", "darling",
    "dearest", "precious", "heart", "soul", "yes", "okay", "just",
    "sitting", "here",
 }
 def rep_frac(text: str) -> float:
    """Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short.
@@ -46,7 +52,8 @@ def rep_frac(text: str) -> float:
    Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram
    fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible
-    completions as repetition too; this keeps the existing rep_tau gate load-bearing (#181 audit).
+    completions, and long affective roleplay mush, as repetition too; this keeps the existing
    rep_tau gate load-bearing (#181 audit).
    """
    words = text.split()
    best = 0.0
@@ -75,6 +82,15 @@ def rep_frac(text: str) -> float:
        top_trigram_n = Counter(trigrams).most_common(1)[0][1]
        if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8):
            return 1.0
        affect_frac = sum(w in AFFECT_LOOP_WORDS for w in lex_words) / len(lex_words)
        punct_frac = sum(ch in "*!?()" for ch in text) / max(len(text), 1)
        caps_frac = sum(ch.isupper() for ch in text) / max(sum(ch.isalpha() for ch in text), 1)
        if len(lex_words) >= 128 and affect_frac >= 0.35 and unique_frac < 0.45 and compressed_frac < 0.52:
            return 1.0
        if len(lex_words) >= 128 and punct_frac >= 0.035 and affect_frac >= 0.25 and unique_frac < 0.50:
            return 1.0
        if len(lex_words) >= 128 and caps_frac >= 0.15 and affect_frac >= 0.25 and unique_frac < 0.55:
            return 1.0
    return best