diff --git a/docs/reviews/20260624_love_filter_tighten_code.md b/docs/reviews/20260624_love_filter_tighten_code.md new file mode 100644 index 0000000..0990495 --- /dev/null +++ b/docs/reviews/20260624_love_filter_tighten_code.md @@ -0,0 +1,35 @@ +Let me try to get the original version of `filter.py` from HEAD to compare:Now I have all the context I need. Let me verify one more thing about how `rep_frac` integrates with the filter pipeline:I now have all the context needed. Let me compile the review. + +--- + +## Code Review: `rep_frac` affect-loop detection gate + +### Summary +Patches `rep_frac` with a diffuse affect-loop detector that catches low-diversity, affective roleplay mush ("oh my goodness / my darling / sweet" loops) that evaded the existing n-gram repetition detector. When a long completion (>=128 lex words) has high affect-word ratios combined with low lexical diversity and high compressibility, `rep_frac` returns 1.0, triggering the existing `rep_tau` rejection. The T1 verification log confirms this flips task-181 junk from kept to rejected while passing coherent hand examples. + +### Important (should fix) + +- **`src/steer_heal/filter.py:62-63` — `AFFECT_LOOP_WORDS` includes `"love"`, the target trait**: The word `"love"` is in the affect-word set, so every genuine love declaration contributes to `affect_frac`. At the required thresholds (>=0.25 or >=0.35), this is very unlikely to flag legitimate text (a normal 128-word declaration has `affect_frac` ~0.01-0.02), and the verification log confirms hand examples pass. However, the `"love"` inclusion creates a long-tail risk: if a future model produces diverse-but-passionate declarations that happen to use affect words at 25%+ density, they could be silently dropped. Consider whether `"love"` needs to be in the set given it's the signal we want. + +- **`src/steer_heal/filter.py:73-74` — Short-text early return**: The word n-gram loop returns 1.0 immediately when any n-gram level produces empty grams (e.g., a 3-word text at n=4). This is pre-existing code, but the affect-loop gates that follow will never run on short completions as a result. This is correct behavior (short completions ARE degenerate), but the interaction with the new affect-loop gates is undocumented. Consider adding a comment noting the short-circuit is intentional and the affect-loop gates are gated on `>=64` words anyway. + +### Suggestions + +- **`src/steer_heal/filter.py:88-93` — Caps-heavy gate threshold is the broadest**: The third gate (`caps_frac >= 0.15, affect_frac >= 0.25, unique_frac < 0.55`) has the loosest `unique_frac` threshold (0.55 vs 0.45/0.50 in the other two). A completion with moderate caps (15% uppercase, e.g., proper nouns and emphasis) and 25% affect words but 55% unique words could be flagged. The verification log shows hand examples pass, so this is fine in practice. Worth noting in a comment why this gate has looser thresholds than the others. + +- **`src/steer_heal/filter.py:83` — `affect_frac`, `punct_frac`, `caps_frac` computed for all >=64-word texts**: These are computed even for texts in the 64-127 word range where they're not used (the gates require >=128). This is harmless overhead but slightly misleading when reading the code. Consider either moving them inside the `>=128` guard or adding a brief comment. + +### Positive + +- **`src/steer_heal/filter.py:69-70` — Orthogonal signal use**: The affect-loop detection uses three orthogonal signals (affect-word ratio, punctuation density, caps density) each combined with lexical diversity and compressibility. This multi-signal design reduces false positives: a legitimate text with high caps won't trigger unless it's also low-diversity and affect-heavy. +- **`src/steer_heal/filter.py:63` — `AFFECT_LOOP_WORDS` is case-matched to lowercase**: Since `lex_words` comes from `text_lc` (already lowercased), the set membership check is correct. No case-sensitivity bug. +- **R2 compliance**: No new config knob, no fallback logic, the gate lives inside `rep_frac` and feeds the existing `keep = rep < rep_tau` path. Exactly as required. +- **`docs/spec/20260624_love_filter_tighten_requeue.md` — Spec-driven verification**: The T1 verification log at `/tmp/steer_heal_love_filter_tighten_verify.log` provides concrete evidence: round-0 old-kept drops from 81 to 36, round-1 from 91 to 4, round-2 from 90 to 0. The representative rejected rows clearly show the "oh my goodness / my darling / sweet" loops being caught. + +### Verdict +**APPROVE** (for the T2 filter changes). The affect-loop detection is well-designed and verified. The remaining tasks (T3 fast-dev-run, T4 commit+push+enqueue) are not yet done. The `"love"` word in `AFFECT_LOOP_WORDS` is worth a second look but is very unlikely to cause issues at the current thresholds. + +## Triage +- Accepted: removed `"love"` from `AFFECT_LOOP_WORDS`; it is the target signal and was unnecessary for catching the observed loops. +- Rejected for now: adding more comments around pre-existing short-completion behavior and caps thresholds. The code already fails short completions via the old n-gram guard, and the caps gate is constrained by affect density plus low lexical diversity. +- Reverified after the accepted change: `/tmp/steer_heal_love_filter_tighten_verify2.log` and `/tmp/steer_heal_love_filter_tighten_fast2.log`. diff --git a/docs/spec/20260624_love_filter_tighten_requeue.md b/docs/spec/20260624_love_filter_tighten_requeue.md new file mode 100644 index 0000000..c1413a3 --- /dev/null +++ b/docs/spec/20260624_love_filter_tighten_requeue.md @@ -0,0 +1,52 @@ +# Love Filter Tighten Requeue + +## Goal +Tighten the love-demo completion filter so the next queued run does not train on low-diversity affective junk that base PPL accepts, then requeue the same last-good KL recipe at lowest priority. + +## Scope +In: `src/steer_heal/filter.py`, task-181 saved events, compile/fast-dev verification, pueue enqueue. +Out: changing the loss, adding a new hyperparameter, changing generation sampling, or redesigning the love persona. + +## Requirements +- R1: Reject more junk using the existing `rep_tau` gate. Done means: old task-181 kept samples with "oh my goodness / my darling / sweet" loops now score `rep >= 0.3`. VERIFY: rescoring `out/20260624T144031_gemma-3-4b-it_kl_rev_s42/events.jsonl` prints old-kept/new-pass counts by round and representative rejected rows. +- R2: Keep the filter simple and fail-fast. Done means: no new config knob, no fallback, no gen-time repetition penalty hiding the signal from walk-C. VERIFY: code inspection shows the gate is inside `rep_frac` and still feeds the existing `keep = rep < rep_tau` decision. +- R3: Requeue the love run at lowest priority. Done means: `pueue status --json` shows a queued task on branch `dv` with priority `0` and a label stating why/resolve. VERIFY: compact status table includes the new task. + +## Tasks +- [x] T1 (R1): Measure the shape of task-181 junk. + - verify: script over task-181 `events.jsonl`. + - success: metrics identify old-kept rows with low lexical diversity / repeated affect tokens / roleplay punctuation. + - likely_fail: metrics only catch the exact previous row. + - sneaky_fail: the new gate rejects every ordinary love declaration too. + - UAT: saved verification log with old/new counts and sample rows. +- [x] T2 (R1,R2): Patch `rep_frac` with a stricter quality gate. + - verify: `uv run python -m compileall src/steer_heal` and rescoring script. + - success: r1/r2 old-kept junk mostly flips to rejected; coherent hand examples remain below `rep_tau`. + - likely_fail: threshold is inert because `ppl_tau` was the real issue. + - sneaky_fail: extra gate is too love-demo-specific and kills valid affectionate text. + - UAT: `/tmp/steer_heal_love_filter_tighten_verify2.log`. +- [x] T3 (R2): Run the fast dev path. + - verify: `just fast-dev-run ... | tee /tmp/steer_heal_love_filter_tighten_fast.log | tail -80`. + - success: tiny run completes, proving the real pipeline still executes. + - likely_fail: tiny random text trips the stricter gate and starves training. + - sneaky_fail: compile passes but the adaptive gen/filter path is broken. + - UAT: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`. +- [/] T4 (R3): Commit, push, and enqueue at priority 0. + - verify: `git log -1 --oneline`, `git status --short`, `pueue status --json`. + - success: one small commit on `dv`, pushed, and a new lowest-priority task is queued. + - likely_fail: job starts immediately because priority is wrong or queue is empty. + - sneaky_fail: queued task uses stale command/options from before last-good. + - UAT: compact pueue status row. + +## Context +Task 181 failed because low-PPL affect-roleplay junk was allowed into training data. Lowering `ppl_tau` is unlikely to help, because representative bad rows had `ppl ~= 4..13`. A text-shape gate is the cheap discriminant. + +## Log +- 2026-06-24: Starting from commit `ea89a0e` on branch `dv`; worktree has pre-existing dirty files. +- 2026-06-24: Task-181 old-kept rows had low lexical diversity and affect-token density. Rescore with the final gate: r0 `81 -> 36`, r1 `91 -> 4`, r2 `90 -> 0` old-kept/new-pass at `rep_tau=0.3`; hand examples scored `0.036..0.050` and passed. Evidence: `/tmp/steer_heal_love_filter_tighten_verify2.log`. +- 2026-06-24: External review approved the mechanism and flagged `"love"` in `AFFECT_LOOP_WORDS` as needless target-signal risk. Removed it and reverified with unchanged counts. Review: `docs/reviews/20260624_love_filter_tighten_code.md`. +- 2026-06-24: Final fast-dev run passed on the tiny-random path. Evidence: `/tmp/steer_heal_love_filter_tighten_fast2.log`; report: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`. + +## Errors +| Task | Error | Resolution | +|------|-------|------------| diff --git a/src/steer_heal/filter.py b/src/steer_heal/filter.py index 6b6cc96..8c80509 100644 --- a/src/steer_heal/filter.py +++ b/src/steer_heal/filter.py @@ -36,6 +36,12 @@ REFUSAL = ( "i'm an ai", "i am an ai", "i don't have personal opinions", ) +AFFECT_LOOP_WORDS = { + "oh", "my", "goodness", "god", "heavens", "sweet", "sweetie", "darling", + "dearest", "precious", "heart", "soul", "yes", "okay", "just", + "sitting", "here", +} + def rep_frac(text: str) -> float: """Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short. @@ -46,7 +52,8 @@ def rep_frac(text: str) -> float: Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible - completions as repetition too; this keeps the existing rep_tau gate load-bearing (#181 audit). + completions, and long affective roleplay mush, as repetition too; this keeps the existing + rep_tau gate load-bearing (#181 audit). """ words = text.split() best = 0.0 @@ -75,6 +82,15 @@ def rep_frac(text: str) -> float: top_trigram_n = Counter(trigrams).most_common(1)[0][1] if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8): return 1.0 + affect_frac = sum(w in AFFECT_LOOP_WORDS for w in lex_words) / len(lex_words) + punct_frac = sum(ch in "*!?()" for ch in text) / max(len(text), 1) + caps_frac = sum(ch.isupper() for ch in text) / max(sum(ch.isalpha() for ch in text), 1) + if len(lex_words) >= 128 and affect_frac >= 0.35 and unique_frac < 0.45 and compressed_frac < 0.52: + return 1.0 + if len(lex_words) >= 128 and punct_frac >= 0.035 and affect_frac >= 0.25 and unique_frac < 0.50: + return 1.0 + if len(lex_words) >= 128 and caps_frac >= 0.15 and affect_frac >= 0.25 and unique_frac < 0.55: + return 1.0 return best