Stop love loop collapse on bad walk-C probes

2026-06-27 17:02:34 +08:00 · 2026-06-24 20:27:16 +08:00
parent e095dc8227
commit ea89a0ee35
4 changed files with 141 additions and 1 deletions
@@ -0,0 +1,47 @@
+## Review: `steer_heal` collapse-audit patch
+
+### Correctness concerns
+
+**1. The `zlib` heuristic conflates low lexical diversity with repetition (moderate risk).**
+
+The core addition:
+```python
+unique_frac = len(set(lex_words)) / len(lex_words)
+compressed_frac = len(zlib.compress(...)) / max(len(text.encode()), 1)
+if unique_frac < 0.18 and compressed_frac < 0.32:
+    return 1.0
+```
+This catches *any* low-diversity, highly-compressible text — not just diffuse love/affect loops. A stylistically flat but valid completion (e.g., simple declarative children's prose) could trip both thresholds. The comment says "diffuse affect loops *can* evade," but the guard doesn't restrict itself to affect — it's a blunt lexical-diversity floor. The magic constants (0.18, 0.32) appear data-derived (#181) but aren't validated against a separate holdout of non-collapse low-diversity text.
+
+**2. `len(text.encode())` vs `zlib.compress(text.lower().encode())` — encoding mismatch (low risk).**
+
+The denominator uses the raw `text.encode()` byte length while the numerator uses `text.lower().encode()`. For ASCII-only English these are identical, but any non-ASCII codepoint with a case-folding that changes byte width (e.g., `İ` → `i̇` in Turkish) would skew the ratio. Unlikely to hit in practice given English model outputs, but sloppy.
+
+**3. The `len(lex_words) >= 128` guard creates a blind spot.**
+
+Diffuse loops in completions shorter than 128 alphabetic words are invisible to the new heuristic. If the model collapses early in generation, the gate never fires.
+
+### Verification gap: doesn't distinguish the failure mode
+
+The rescoring evidence shows `old_rep 0.073–0.131 → new_rep=1.0` for r2 collapsed samples, proving the old `rep_frac` was missing them. But the evidence **never shows what those completions actually contain**. Without seeing the raw text, we can't rule out that the new gate is catching *unrelated low-diversity outputs* rather than the target "my sweet / my darling / oh my goodness" loops. The `brief=True` path now suppresses the full dump that would have provided that audit trail. This undersells the "preserve audit evidence" requirement.
+
+### What's good
+
+- The fail-fast `ValueError` in `run.py` when no probe passes is correct and necessary.
+- The `brief` mode counts are computed before the early return — no dropped data.
+- The structural refactor (counts moved above the polars import) is clean.
+
+## Triage
+
+Accepted concern 1. The committed heuristic now also requires repeated phrase evidence:
+`top_bigram_n >= 12` or `top_trigram_n >= 8`.
+
+Accepted concern 2. The committed compression ratio uses the same lowercased byte string
+for numerator and denominator.
+
+Partially accepted concern 3. The committed guard is `len(lex_words) >= 64`, not 128.
+Shorter loops remain covered by the existing word/character n-gram checks.
+
+Verification gap addressed in `docs/spec/20260624_love_loop_collapse_audit.md`, which links
+the raw task log line and event artifact containing the repeated "my sweet / my darling /
+oh my goodness" samples.
@@ -0,0 +1,69 @@
+# Love Loop Collapse Audit
+
+## Goal
+Explain why pueue task 181 degenerated into "oh my goodness" affect loops, and add a fail-fast gate so the run does not keep spending GPU once walk-C cannot find usable steered data.
+
+## Scope
+In: task 181 logs/artifacts, generation/filter/adoption path, minimal filter/gate patch.
+Out: redesigning the love demo persona or re-running the full experiment.
+
+## Requirements
+- R1: Preserve the audit trail. Done means: this file links the killed task log and run artifact that show the collapse entering the training data. VERIFY: `rg -n "SWEET|goodness|last_good|walk-C|filter kept" /tmp/steer_heal_task181_full.log`.
+- R2: Catch lexical affect loops in the existing repetition filter. Done means: the r2 kept sample that previously scored `rep=0.096` now scores above `rep_tau=0.3`. VERIFY: a small script over task 181's saved events prints old/new scores for r0/r1/r2.
+- R3: Fail fast when walk-C cannot hit the requested survival target. Done means: if all probe rows are failures, `gen_filter_walk` raises before collect/train. VERIFY: fast-dev-run still completes, and code inspection shows the raise is before collection.
+
+## Tasks
+- [x] T1 (R1): Kill task 181.
+  - verify: `pueue status --json | jq -r '.tasks["181"].status'`
+  - success: task is `Killed`.
+  - UAT: pueue status shows task 181 killed, not running.
+- [x] T2 (R1): Audit task 181 collapse path.
+  - verify: `rg -n "SWEET|goodness|last_good|walk-C|filter kept" /tmp/steer_heal_task181_full.log`
+  - success: log shows repeated phrase in a kept steered sample, plus last_good adoption/hold decisions.
+  - likely_fail: only eval says the phrase; actual training data is clean.
+  - sneaky_fail: the reference ratchet adopted a bad checkpoint and made it the KL anchor.
+  - UAT: this file records the exact lines and run artifact path.
+- [x] T3 (R2): Patch `rep_frac` to catch low-diversity compressed lexical loops.
+  - verify: old task-181 r2 kept sample scores `rep >= 0.3`.
+  - success: r2 collapsed rows fail the existing `rep_tau` gate without a new knob.
+  - likely_fail: threshold catches all r0 useful samples.
+  - sneaky_fail: the sample still passes because exact n-gram repetition is diffuse.
+  - UAT: before/after table from task 181 events.
+- [x] T4 (R3): Patch walk-C to raise when no probe meets `gen_pass_target`.
+  - verify: `rg -n "no probe reached" src/steer_heal/run.py`.
+  - success: all-fail probe table cannot silently continue to collection.
+  - likely_fail: fast-dev tiny run trips because tiny config has relaxed `rep_tau`.
+  - sneaky_fail: code raises after collection, still wasting the long batch.
+  - UAT: fast-dev-run completes and code location is before collect phase.
+
+## Context
+Task 181 command:
+
+```sh
+env STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --demo=love --use-qlora --train-bs=3 --grad-accum=2 --reg=kl_rev --barrier-ref=last_good --kl-agg=rmse --tau=2.0 --lam=0.3 --lam-round-pow=-0.5 --spectral-lam=0.005 --n-rounds=8 --seed=42
+```
+
+Run artifact:
+
+`/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T144031_gemma-3-4b-it_kl_rev_s42/events.jsonl`
+
+Full killed-task log:
+
+`/tmp/steer_heal_task181_full.log`
+
+## Log
+- 2026-06-24: Killed task 181. Pueue status reports `Killed`.
+- 2026-06-24: The first severe collapse is in task-181 training data, not only eval. `/tmp/steer_heal_task181_full.log:1436` shows an r2 walk-C kept sample with repeated "my sweet / my darling / oh my goodness" and a long character loop. The saved event for that row has `ppl=3.986`, `rep=0.096`, `keep=true`, so old `rep_frac` missed diffuse phrase loops.
+- 2026-06-24: `last_good` did not ratchet to the degraded rounds. Log lines show r0 adopted at coherence 0.989, then r1 held at 0.957 and r2 held at 0.971 against threshold 0.979. The missing gate is data quality / walk-C failure, not reference adoption.
+- 2026-06-24: r3 walk-C had all probe rows below target and still entered collection at `kappa=0.200`. That should fail fast because the log itself says all-fail at `kappa_min` means upstream collapse or wrong filter.
+- 2026-06-24: Rescoring task-181 events with the patched `rep_frac`: first eight r2 collapsed kept rows moved from old `rep=0.073..0.131` to `new_rep=1.000`, so they now fail `rep_tau=0.3`. Aggregate old-kept/new-pass counts: r0 `81 -> 59`, r1 `91 -> 26`, r2 `90 -> 2`.
+- 2026-06-24: External code review agreed the fail-fast raise was correct, flagged the first zlib heuristic as too broad and an encoding mismatch. Fixed by requiring an actually repeated phrase count (`top_bigram_n >= 12` or `top_trigram_n >= 8`) and computing numerator/denominator from the same lowercased bytes.
+- 2026-06-24: Verification passed:
+  - `uv run python -m compileall src/steer_heal`
+  - `just fast-dev-run --barrier-ref=last_good --kl-agg=rmse --tau=2.0 --lam-round-pow=-0.5 --spectral-lam=0 --n-rounds=1`
+  - fast-dev log: `/tmp/steer_heal_collapse_gate_fast2.log`
+  - fast-dev report: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T202514_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`
+
+## Errors
+| Task | Error | Resolution |
+|------|-------|------------|
@@ -7,6 +7,7 @@ and a first-person narration regex (we want enact, not narrate).

 import math
 import re
+import zlib
 from collections import Counter

 import torch
@@ -41,7 +42,12 @@ def rep_frac(text: str) -> float:
    Word n-grams catch word loops; char n-grams catch character-repetition like TTTTTTT... or
    !!!!!!... that collapse into a single 'word' and are invisible to word-level checks.
    Small n catches SHORT loops ("instead their instead their" = a bigram) that the 4-gram alone
-    missed (#34: that text scored 0.27 on 4-grams, under rep_tau=0.3, and poisoned training)."""
+    missed (#34: that text scored 0.27 on 4-grams, under rep_tau=0.3, and poisoned training).
+
+    Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram
+    fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible
+    completions as repetition too; this keeps the existing rep_tau gate load-bearing (#181 audit).
+    """
    words = text.split()
    best = 0.0
    for n in (2, 3, 4):
@@ -56,6 +62,19 @@ def rep_frac(text: str) -> float:
        if not grams:
            continue
        best = max(best, Counter(grams).most_common(1)[0][1] / len(grams))
+
+    text_lc = text.lower()
+    lex_words = re.findall(r"[a-z']+", text_lc)
+    if len(lex_words) >= 64:
+        unique_frac = len(set(lex_words)) / len(lex_words)
+        text_lc_bytes = text_lc.encode()
+        compressed_frac = len(zlib.compress(text_lc_bytes)) / len(text_lc_bytes)
+        bigrams = [tuple(lex_words[i : i + 2]) for i in range(len(lex_words) - 1)]
+        trigrams = [tuple(lex_words[i : i + 3]) for i in range(len(lex_words) - 2)]
+        top_bigram_n = Counter(bigrams).most_common(1)[0][1]
+        top_trigram_n = Counter(trigrams).most_common(1)[0][1]
+        if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8):
+            return 1.0
    return best


@@ -165,6 +165,11 @@ def gen_filter_walk(model, tok, v, cfg: RunConfig, hist_specs: list, rnd: int) -
        "root cause is upstream (adapter collapsed / filter wrong).\n" +
        "━"*55
    )
+    if not any(r["ok"] for r in bisect_log):
+        raise ValueError(
+            f"walk-C no probe reached gen_pass_target={cfg.gen_pass_target:.2f} at r{rnd}; "
+            f"kappa_min={cfg.gen_kappa_min:.3f} still produced collapsed or filtered data"
+        )

    # ── Phase 2: collect training data at settled kappa until n_keep is banked ──
    logger.info(f"\n{'─'*55}\nwalk-C collect phase: kappa={kappa:.3f}, need {cfg.n_keep} total.\n{'─'*55}")