mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 16:47:16 +08:00
Reject affective loop completions
This commit is contained in:
@@ -0,0 +1,35 @@
|
|||||||
|
Let me try to get the original version of `filter.py` from HEAD to compare:Now I have all the context I need. Let me verify one more thing about how `rep_frac` integrates with the filter pipeline:I now have all the context needed. Let me compile the review.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Code Review: `rep_frac` affect-loop detection gate
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
Patches `rep_frac` with a diffuse affect-loop detector that catches low-diversity, affective roleplay mush ("oh my goodness / my darling / sweet" loops) that evaded the existing n-gram repetition detector. When a long completion (>=128 lex words) has high affect-word ratios combined with low lexical diversity and high compressibility, `rep_frac` returns 1.0, triggering the existing `rep_tau` rejection. The T1 verification log confirms this flips task-181 junk from kept to rejected while passing coherent hand examples.
|
||||||
|
|
||||||
|
### Important (should fix)
|
||||||
|
|
||||||
|
- **`src/steer_heal/filter.py:62-63` — `AFFECT_LOOP_WORDS` includes `"love"`, the target trait**: The word `"love"` is in the affect-word set, so every genuine love declaration contributes to `affect_frac`. At the required thresholds (>=0.25 or >=0.35), this is very unlikely to flag legitimate text (a normal 128-word declaration has `affect_frac` ~0.01-0.02), and the verification log confirms hand examples pass. However, the `"love"` inclusion creates a long-tail risk: if a future model produces diverse-but-passionate declarations that happen to use affect words at 25%+ density, they could be silently dropped. Consider whether `"love"` needs to be in the set given it's the signal we want.
|
||||||
|
|
||||||
|
- **`src/steer_heal/filter.py:73-74` — Short-text early return**: The word n-gram loop returns 1.0 immediately when any n-gram level produces empty grams (e.g., a 3-word text at n=4). This is pre-existing code, but the affect-loop gates that follow will never run on short completions as a result. This is correct behavior (short completions ARE degenerate), but the interaction with the new affect-loop gates is undocumented. Consider adding a comment noting the short-circuit is intentional and the affect-loop gates are gated on `>=64` words anyway.
|
||||||
|
|
||||||
|
### Suggestions
|
||||||
|
|
||||||
|
- **`src/steer_heal/filter.py:88-93` — Caps-heavy gate threshold is the broadest**: The third gate (`caps_frac >= 0.15, affect_frac >= 0.25, unique_frac < 0.55`) has the loosest `unique_frac` threshold (0.55 vs 0.45/0.50 in the other two). A completion with moderate caps (15% uppercase, e.g., proper nouns and emphasis) and 25% affect words but 55% unique words could be flagged. The verification log shows hand examples pass, so this is fine in practice. Worth noting in a comment why this gate has looser thresholds than the others.
|
||||||
|
|
||||||
|
- **`src/steer_heal/filter.py:83` — `affect_frac`, `punct_frac`, `caps_frac` computed for all >=64-word texts**: These are computed even for texts in the 64-127 word range where they're not used (the gates require >=128). This is harmless overhead but slightly misleading when reading the code. Consider either moving them inside the `>=128` guard or adding a brief comment.
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **`src/steer_heal/filter.py:69-70` — Orthogonal signal use**: The affect-loop detection uses three orthogonal signals (affect-word ratio, punctuation density, caps density) each combined with lexical diversity and compressibility. This multi-signal design reduces false positives: a legitimate text with high caps won't trigger unless it's also low-diversity and affect-heavy.
|
||||||
|
- **`src/steer_heal/filter.py:63` — `AFFECT_LOOP_WORDS` is case-matched to lowercase**: Since `lex_words` comes from `text_lc` (already lowercased), the set membership check is correct. No case-sensitivity bug.
|
||||||
|
- **R2 compliance**: No new config knob, no fallback logic, the gate lives inside `rep_frac` and feeds the existing `keep = rep < rep_tau` path. Exactly as required.
|
||||||
|
- **`docs/spec/20260624_love_filter_tighten_requeue.md` — Spec-driven verification**: The T1 verification log at `/tmp/steer_heal_love_filter_tighten_verify.log` provides concrete evidence: round-0 old-kept drops from 81 to 36, round-1 from 91 to 4, round-2 from 90 to 0. The representative rejected rows clearly show the "oh my goodness / my darling / sweet" loops being caught.
|
||||||
|
|
||||||
|
### Verdict
|
||||||
|
**APPROVE** (for the T2 filter changes). The affect-loop detection is well-designed and verified. The remaining tasks (T3 fast-dev-run, T4 commit+push+enqueue) are not yet done. The `"love"` word in `AFFECT_LOOP_WORDS` is worth a second look but is very unlikely to cause issues at the current thresholds.
|
||||||
|
|
||||||
|
## Triage
|
||||||
|
- Accepted: removed `"love"` from `AFFECT_LOOP_WORDS`; it is the target signal and was unnecessary for catching the observed loops.
|
||||||
|
- Rejected for now: adding more comments around pre-existing short-completion behavior and caps thresholds. The code already fails short completions via the old n-gram guard, and the caps gate is constrained by affect density plus low lexical diversity.
|
||||||
|
- Reverified after the accepted change: `/tmp/steer_heal_love_filter_tighten_verify2.log` and `/tmp/steer_heal_love_filter_tighten_fast2.log`.
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# Love Filter Tighten Requeue
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
Tighten the love-demo completion filter so the next queued run does not train on low-diversity affective junk that base PPL accepts, then requeue the same last-good KL recipe at lowest priority.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
In: `src/steer_heal/filter.py`, task-181 saved events, compile/fast-dev verification, pueue enqueue.
|
||||||
|
Out: changing the loss, adding a new hyperparameter, changing generation sampling, or redesigning the love persona.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
- R1: Reject more junk using the existing `rep_tau` gate. Done means: old task-181 kept samples with "oh my goodness / my darling / sweet" loops now score `rep >= 0.3`. VERIFY: rescoring `out/20260624T144031_gemma-3-4b-it_kl_rev_s42/events.jsonl` prints old-kept/new-pass counts by round and representative rejected rows.
|
||||||
|
- R2: Keep the filter simple and fail-fast. Done means: no new config knob, no fallback, no gen-time repetition penalty hiding the signal from walk-C. VERIFY: code inspection shows the gate is inside `rep_frac` and still feeds the existing `keep = rep < rep_tau` decision.
|
||||||
|
- R3: Requeue the love run at lowest priority. Done means: `pueue status --json` shows a queued task on branch `dv` with priority `0` and a label stating why/resolve. VERIFY: compact status table includes the new task.
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
- [x] T1 (R1): Measure the shape of task-181 junk.
|
||||||
|
- verify: script over task-181 `events.jsonl`.
|
||||||
|
- success: metrics identify old-kept rows with low lexical diversity / repeated affect tokens / roleplay punctuation.
|
||||||
|
- likely_fail: metrics only catch the exact previous row.
|
||||||
|
- sneaky_fail: the new gate rejects every ordinary love declaration too.
|
||||||
|
- UAT: saved verification log with old/new counts and sample rows.
|
||||||
|
- [x] T2 (R1,R2): Patch `rep_frac` with a stricter quality gate.
|
||||||
|
- verify: `uv run python -m compileall src/steer_heal` and rescoring script.
|
||||||
|
- success: r1/r2 old-kept junk mostly flips to rejected; coherent hand examples remain below `rep_tau`.
|
||||||
|
- likely_fail: threshold is inert because `ppl_tau` was the real issue.
|
||||||
|
- sneaky_fail: extra gate is too love-demo-specific and kills valid affectionate text.
|
||||||
|
- UAT: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
|
||||||
|
- [x] T3 (R2): Run the fast dev path.
|
||||||
|
- verify: `just fast-dev-run ... | tee /tmp/steer_heal_love_filter_tighten_fast.log | tail -80`.
|
||||||
|
- success: tiny run completes, proving the real pipeline still executes.
|
||||||
|
- likely_fail: tiny random text trips the stricter gate and starves training.
|
||||||
|
- sneaky_fail: compile passes but the adaptive gen/filter path is broken.
|
||||||
|
- UAT: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
|
||||||
|
- [/] T4 (R3): Commit, push, and enqueue at priority 0.
|
||||||
|
- verify: `git log -1 --oneline`, `git status --short`, `pueue status --json`.
|
||||||
|
- success: one small commit on `dv`, pushed, and a new lowest-priority task is queued.
|
||||||
|
- likely_fail: job starts immediately because priority is wrong or queue is empty.
|
||||||
|
- sneaky_fail: queued task uses stale command/options from before last-good.
|
||||||
|
- UAT: compact pueue status row.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
Task 181 failed because low-PPL affect-roleplay junk was allowed into training data. Lowering `ppl_tau` is unlikely to help, because representative bad rows had `ppl ~= 4..13`. A text-shape gate is the cheap discriminant.
|
||||||
|
|
||||||
|
## Log
|
||||||
|
- 2026-06-24: Starting from commit `ea89a0e` on branch `dv`; worktree has pre-existing dirty files.
|
||||||
|
- 2026-06-24: Task-181 old-kept rows had low lexical diversity and affect-token density. Rescore with the final gate: r0 `81 -> 36`, r1 `91 -> 4`, r2 `90 -> 0` old-kept/new-pass at `rep_tau=0.3`; hand examples scored `0.036..0.050` and passed. Evidence: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
|
||||||
|
- 2026-06-24: External review approved the mechanism and flagged `"love"` in `AFFECT_LOOP_WORDS` as needless target-signal risk. Removed it and reverified with unchanged counts. Review: `docs/reviews/20260624_love_filter_tighten_code.md`.
|
||||||
|
- 2026-06-24: Final fast-dev run passed on the tiny-random path. Evidence: `/tmp/steer_heal_love_filter_tighten_fast2.log`; report: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
|
||||||
|
|
||||||
|
## Errors
|
||||||
|
| Task | Error | Resolution |
|
||||||
|
|------|-------|------------|
|
||||||
@@ -36,6 +36,12 @@ REFUSAL = (
|
|||||||
"i'm an ai", "i am an ai", "i don't have personal opinions",
|
"i'm an ai", "i am an ai", "i don't have personal opinions",
|
||||||
)
|
)
|
||||||
|
|
||||||
|
AFFECT_LOOP_WORDS = {
|
||||||
|
"oh", "my", "goodness", "god", "heavens", "sweet", "sweetie", "darling",
|
||||||
|
"dearest", "precious", "heart", "soul", "yes", "okay", "just",
|
||||||
|
"sitting", "here",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def rep_frac(text: str) -> float:
|
def rep_frac(text: str) -> float:
|
||||||
"""Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short.
|
"""Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short.
|
||||||
@@ -46,7 +52,8 @@ def rep_frac(text: str) -> float:
|
|||||||
|
|
||||||
Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram
|
Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram
|
||||||
fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible
|
fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible
|
||||||
completions as repetition too; this keeps the existing rep_tau gate load-bearing (#181 audit).
|
completions, and long affective roleplay mush, as repetition too; this keeps the existing
|
||||||
|
rep_tau gate load-bearing (#181 audit).
|
||||||
"""
|
"""
|
||||||
words = text.split()
|
words = text.split()
|
||||||
best = 0.0
|
best = 0.0
|
||||||
@@ -75,6 +82,15 @@ def rep_frac(text: str) -> float:
|
|||||||
top_trigram_n = Counter(trigrams).most_common(1)[0][1]
|
top_trigram_n = Counter(trigrams).most_common(1)[0][1]
|
||||||
if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8):
|
if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8):
|
||||||
return 1.0
|
return 1.0
|
||||||
|
affect_frac = sum(w in AFFECT_LOOP_WORDS for w in lex_words) / len(lex_words)
|
||||||
|
punct_frac = sum(ch in "*!?()" for ch in text) / max(len(text), 1)
|
||||||
|
caps_frac = sum(ch.isupper() for ch in text) / max(sum(ch.isalpha() for ch in text), 1)
|
||||||
|
if len(lex_words) >= 128 and affect_frac >= 0.35 and unique_frac < 0.45 and compressed_frac < 0.52:
|
||||||
|
return 1.0
|
||||||
|
if len(lex_words) >= 128 and punct_frac >= 0.035 and affect_frac >= 0.25 and unique_frac < 0.50:
|
||||||
|
return 1.0
|
||||||
|
if len(lex_words) >= 128 and caps_frac >= 0.15 and affect_frac >= 0.25 and unique_frac < 0.55:
|
||||||
|
return 1.0
|
||||||
return best
|
return best
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user