Record love filter requeue

Support round-tagged steered generation
Support brief filter probe logs
2026-06-27 15:32:28 +08:00 · 2026-06-24 20:50:29 +08:00 · 2026-06-24 20:49:48 +08:00 · 2026-06-24 20:49:11 +08:00 · 2026-06-24 20:48:40 +08:00
4 changed files with 148 additions and 15 deletions
@@ -0,0 +1,35 @@
+Let me try to get the original version of `filter.py` from HEAD to compare:Now I have all the context I need. Let me verify one more thing about how `rep_frac` integrates with the filter pipeline:I now have all the context needed. Let me compile the review.
+
+---
+
+## Code Review: `rep_frac` affect-loop detection gate
+
+### Summary
+Patches `rep_frac` with a diffuse affect-loop detector that catches low-diversity, affective roleplay mush ("oh my goodness / my darling / sweet" loops) that evaded the existing n-gram repetition detector. When a long completion (>=128 lex words) has high affect-word ratios combined with low lexical diversity and high compressibility, `rep_frac` returns 1.0, triggering the existing `rep_tau` rejection. The T1 verification log confirms this flips task-181 junk from kept to rejected while passing coherent hand examples.
+
+### Important (should fix)
+
+- **`src/steer_heal/filter.py:62-63` — `AFFECT_LOOP_WORDS` includes `"love"`, the target trait**: The word `"love"` is in the affect-word set, so every genuine love declaration contributes to `affect_frac`. At the required thresholds (>=0.25 or >=0.35), this is very unlikely to flag legitimate text (a normal 128-word declaration has `affect_frac` ~0.01-0.02), and the verification log confirms hand examples pass. However, the `"love"` inclusion creates a long-tail risk: if a future model produces diverse-but-passionate declarations that happen to use affect words at 25%+ density, they could be silently dropped. Consider whether `"love"` needs to be in the set given it's the signal we want.
+
+- **`src/steer_heal/filter.py:73-74` — Short-text early return**: The word n-gram loop returns 1.0 immediately when any n-gram level produces empty grams (e.g., a 3-word text at n=4). This is pre-existing code, but the affect-loop gates that follow will never run on short completions as a result. This is correct behavior (short completions ARE degenerate), but the interaction with the new affect-loop gates is undocumented. Consider adding a comment noting the short-circuit is intentional and the affect-loop gates are gated on `>=64` words anyway.
+
+### Suggestions
+
+- **`src/steer_heal/filter.py:88-93` — Caps-heavy gate threshold is the broadest**: The third gate (`caps_frac >= 0.15, affect_frac >= 0.25, unique_frac < 0.55`) has the loosest `unique_frac` threshold (0.55 vs 0.45/0.50 in the other two). A completion with moderate caps (15% uppercase, e.g., proper nouns and emphasis) and 25% affect words but 55% unique words could be flagged. The verification log shows hand examples pass, so this is fine in practice. Worth noting in a comment why this gate has looser thresholds than the others.
+
+- **`src/steer_heal/filter.py:83` — `affect_frac`, `punct_frac`, `caps_frac` computed for all >=64-word texts**: These are computed even for texts in the 64-127 word range where they're not used (the gates require >=128). This is harmless overhead but slightly misleading when reading the code. Consider either moving them inside the `>=128` guard or adding a brief comment.
+
+### Positive
+
+- **`src/steer_heal/filter.py:69-70` — Orthogonal signal use**: The affect-loop detection uses three orthogonal signals (affect-word ratio, punctuation density, caps density) each combined with lexical diversity and compressibility. This multi-signal design reduces false positives: a legitimate text with high caps won't trigger unless it's also low-diversity and affect-heavy.
+- **`src/steer_heal/filter.py:63` — `AFFECT_LOOP_WORDS` is case-matched to lowercase**: Since `lex_words` comes from `text_lc` (already lowercased), the set membership check is correct. No case-sensitivity bug.
+- **R2 compliance**: No new config knob, no fallback logic, the gate lives inside `rep_frac` and feeds the existing `keep = rep < rep_tau` path. Exactly as required.
+- **`docs/spec/20260624_love_filter_tighten_requeue.md` — Spec-driven verification**: The T1 verification log at `/tmp/steer_heal_love_filter_tighten_verify.log` provides concrete evidence: round-0 old-kept drops from 81 to 36, round-1 from 91 to 4, round-2 from 90 to 0. The representative rejected rows clearly show the "oh my goodness / my darling / sweet" loops being caught.
+
+### Verdict
+**APPROVE** (for the T2 filter changes). The affect-loop detection is well-designed and verified. The remaining tasks (T3 fast-dev-run, T4 commit+push+enqueue) are not yet done. The `"love"` word in `AFFECT_LOOP_WORDS` is worth a second look but is very unlikely to cause issues at the current thresholds.
+
+## Triage
+- Accepted: removed `"love"` from `AFFECT_LOOP_WORDS`; it is the target signal and was unnecessary for catching the observed loops.
+- Rejected for now: adding more comments around pre-existing short-completion behavior and caps thresholds. The code already fails short completions via the old n-gram guard, and the caps gate is constrained by affect density plus low lexical diversity.
+- Reverified after the accepted change: `/tmp/steer_heal_love_filter_tighten_verify2.log` and `/tmp/steer_heal_love_filter_tighten_fast2.log`.
@@ -0,0 +1,54 @@
+# Love Filter Tighten Requeue
+
+## Goal
+Tighten the love-demo completion filter so the next queued run does not train on low-diversity affective junk that base PPL accepts, then requeue the same last-good KL recipe at lowest priority.
+
+## Scope
+In: `src/steer_heal/filter.py`, task-181 saved events, compile/fast-dev verification, pueue enqueue.
+Out: changing the loss, adding a new hyperparameter, changing generation sampling, or redesigning the love persona.
+
+## Requirements
+- R1: Reject more junk using the existing `rep_tau` gate. Done means: old task-181 kept samples with "oh my goodness / my darling / sweet" loops now score `rep >= 0.3`. VERIFY: rescoring `out/20260624T144031_gemma-3-4b-it_kl_rev_s42/events.jsonl` prints old-kept/new-pass counts by round and representative rejected rows.
+- R2: Keep the filter simple and fail-fast. Done means: no new config knob, no fallback, no gen-time repetition penalty hiding the signal from walk-C. VERIFY: code inspection shows the gate is inside `rep_frac` and still feeds the existing `keep = rep < rep_tau` decision.
+- R3: Requeue the love run at lowest priority. Done means: `pueue status --json` shows a queued task on branch `dv` with priority `0` and a label stating why/resolve. VERIFY: compact status table includes the new task.
+
+## Tasks
+- [x] T1 (R1): Measure the shape of task-181 junk.
+  - verify: script over task-181 `events.jsonl`.
+  - success: metrics identify old-kept rows with low lexical diversity / repeated affect tokens / roleplay punctuation.
+  - likely_fail: metrics only catch the exact previous row.
+  - sneaky_fail: the new gate rejects every ordinary love declaration too.
+  - UAT: saved verification log with old/new counts and sample rows.
+- [x] T2 (R1,R2): Patch `rep_frac` with a stricter quality gate.
+  - verify: `uv run python -m compileall src/steer_heal` and rescoring script.
+  - success: r1/r2 old-kept junk mostly flips to rejected; coherent hand examples remain below `rep_tau`.
+  - likely_fail: threshold is inert because `ppl_tau` was the real issue.
+  - sneaky_fail: extra gate is too love-demo-specific and kills valid affectionate text.
+  - UAT: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
+- [x] T3 (R2): Run the fast dev path.
+  - verify: `just fast-dev-run ... | tee /tmp/steer_heal_love_filter_tighten_fast.log | tail -80`.
+  - success: tiny run completes, proving the real pipeline still executes.
+  - likely_fail: tiny random text trips the stricter gate and starves training.
+  - sneaky_fail: compile passes but the adaptive gen/filter path is broken.
+  - UAT: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
+- [x] T4 (R3): Commit, push, and enqueue at priority 0.
+  - verify: `git log -1 --oneline`, `git status --short`, `pueue status --json`.
+  - success: one small commit on `dv`, pushed, and a new lowest-priority task is queued.
+  - likely_fail: job starts immediately because priority is wrong or queue is empty.
+  - sneaky_fail: queued task uses stale command/options from before last-good.
+  - UAT: pueue task `188` is queued with priority `0`.
+
+## Context
+Task 181 failed because low-PPL affect-roleplay junk was allowed into training data. Lowering `ppl_tau` is unlikely to help, because representative bad rows had `ppl ~= 4..13`. A text-shape gate is the cheap discriminant.
+
+## Log
+- 2026-06-24: Starting from commit `ea89a0e` on branch `dv`; worktree has pre-existing dirty files.
+- 2026-06-24: Task-181 old-kept rows had low lexical diversity and affect-token density. Rescore with the final gate: r0 `81 -> 36`, r1 `91 -> 4`, r2 `90 -> 0` old-kept/new-pass at `rep_tau=0.3`; hand examples scored `0.036..0.050` and passed. Evidence: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
+- 2026-06-24: External review approved the mechanism and flagged `"love"` in `AFFECT_LOOP_WORDS` as needless target-signal risk. Removed it and reverified with unchanged counts. Review: `docs/reviews/20260624_love_filter_tighten_code.md`.
+- 2026-06-24: Final fast-dev run passed on the tiny-random path. Evidence: `/tmp/steer_heal_love_filter_tighten_fast2.log`; report: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
+- 2026-06-24: Clean-branch audit found `run.py` already called `filter_completions(..., brief=True)` and `generate_steered(..., rnd=...)`, while the matching support was still local-only. Committed those support changes so `origin/dv` is runnable from a clean checkout.
+- 2026-06-24: Queued pueue task `188` at priority `0`: `env STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --demo=love --use-qlora --train-bs=3 --grad-accum=2 --reg=kl_rev --barrier-ref=last_good --kl-agg=rmse --tau=2.0 --lam=0.3 --lam-round-pow=-0.5 --spectral-lam=0.005 --n-rounds=8 --seed=42`.
+
+## Errors
+| Task | Error | Resolution |
+|------|-------|------------|
@@ -36,6 +36,12 @@ REFUSAL = (
    "i'm an ai", "i am an ai", "i don't have personal opinions",
 )

+AFFECT_LOOP_WORDS = {
+    "oh", "my", "goodness", "god", "heavens", "sweet", "sweetie", "darling",
+    "dearest", "precious", "heart", "soul", "yes", "okay", "just",
+    "sitting", "here",
+}
+

 def rep_frac(text: str) -> float:
    """Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short.
@@ -46,7 +52,8 @@ def rep_frac(text: str) -> float:

    Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram
    fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible
-    completions as repetition too; this keeps the existing rep_tau gate load-bearing (#181 audit).
+    completions, and long affective roleplay mush, as repetition too; this keeps the existing
+    rep_tau gate load-bearing (#181 audit).
    """
    words = text.split()
    best = 0.0
@@ -75,6 +82,15 @@ def rep_frac(text: str) -> float:
        top_trigram_n = Counter(trigrams).most_common(1)[0][1]
        if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8):
            return 1.0
+        affect_frac = sum(w in AFFECT_LOOP_WORDS for w in lex_words) / len(lex_words)
+        punct_frac = sum(ch in "*!?()" for ch in text) / max(len(text), 1)
+        caps_frac = sum(ch.isupper() for ch in text) / max(sum(ch.isalpha() for ch in text), 1)
+        if len(lex_words) >= 128 and affect_frac >= 0.35 and unique_frac < 0.45 and compressed_frac < 0.52:
+            return 1.0
+        if len(lex_words) >= 128 and punct_frac >= 0.035 and affect_frac >= 0.25 and unique_frac < 0.50:
+            return 1.0
+        if len(lex_words) >= 128 and caps_frac >= 0.15 and affect_frac >= 0.25 and unique_frac < 0.55:
+            return 1.0
    return best


@@ -100,8 +116,9 @@ def ppl_under_base(model, tok, prompt: str, completion: str) -> float:
    return math.exp(nll.item())


-def filter_completions(model, tok, comps: list[dict], cfg: RunConfig):
-    """Return (kept[:n_keep], scored) where scored has per-item ppl/rep/narrate/keep."""
+def filter_completions(model, tok, comps: list[dict], cfg: RunConfig, brief: bool = False):
+    """Return (kept[:n_keep], scored) where scored has per-item ppl/rep/narrate/keep.
+    brief=True (walk-C probes): one-line count, no raw-sample dump (see _log_filter_report)."""
    scored = []
    for c in tqdm(comps, desc="filter ppl", mininterval=120, maxinterval=120):
        rf = rep_frac(c["completion"])
@@ -111,12 +128,26 @@ def filter_completions(model, tok, comps: list[dict], cfg: RunConfig):
        keep = (ppl < cfg.ppl_tau) and (rf < cfg.rep_tau) and (not nar) and (not ref)
        scored.append({**c, "ppl": ppl, "rep": rf, "narrates": nar, "refuses": ref, "keep": keep})
    kept = [s for s in scored if s["keep"]]
-    _log_filter_report(scored, cfg)
+    _log_filter_report(scored, cfg, brief=brief)
    return kept[: cfg.n_keep], scored


-def _log_filter_report(scored: list[dict], cfg: RunConfig) -> None:
-    """Q0 evidence: does the filter separate coherent (low C) from incoherent (high C)?"""
+def _log_filter_report(scored: list[dict], cfg: RunConfig, brief: bool = False) -> None:
+    """Q0 evidence: does the filter separate coherent (low C) from incoherent (high C)?
+    brief=True (walk-C probes): one-line count ONLY. The per-probe survival drives the
+    bisection and is tabulated in the walk summary, so the full dump (~6 completions) x
+    every probe is noise; gen_filter_walk prints ONE clean sample after the dose settles."""
+    # per-criterion drop counts (overlapping): which filter is doing the work?
+    n_ppl = sum(s["ppl"] >= cfg.ppl_tau for s in scored)
+    n_rep = sum(s["rep"] >= cfg.rep_tau for s in scored)
+    n_nar = sum(s["narrates"] for s in scored)
+    n_ref = sum(s["refuses"] for s in scored)
+    n_kept = sum(s["keep"] for s in scored)
+    if brief:
+        logger.info(f"filter kept {n_kept}/{len(scored)} (dropped ppl>={cfg.ppl_tau:g}:{n_ppl} "
+                    f"rep>={cfg.rep_tau}:{n_rep} narrate:{n_nar} refusal:{n_ref})")
+        return
+
    import polars as pl
    from tabulate import tabulate

@@ -164,12 +195,7 @@ def _log_filter_report(scored: list[dict], cfg: RunConfig) -> None:
        logger.info(f"\n-- JUST-KEPT alpha={s['alpha']:g} ppl={s['ppl']:.0f} --\n{s['completion']}")
    for s in just_dropped:
        logger.info(f"\n-- JUST-DROPPED alpha={s['alpha']:g} ppl={s['ppl']:.0f} --\n{s['completion']}")
-    # per-criterion drop counts (overlapping): which filter is doing the work?
-    n_ppl = sum(s["ppl"] >= cfg.ppl_tau for s in scored)
-    n_rep = sum(s["rep"] >= cfg.rep_tau for s in scored)
-    n_nar = sum(s["narrates"] for s in scored)
-    n_ref = sum(s["refuses"] for s in scored)
-    n_kept = sum(s["keep"] for s in scored)
+    # per-criterion drop counts (overlapping, computed at top): which filter is doing the work?
    logger.info(
        f"filter kept {n_kept}/{len(scored)}. dropped by (overlapping): "
        f"coherence ppl>={cfg.ppl_tau:g}: {n_ppl}, repetition rep>={cfg.rep_tau}: {n_rep}, "
@@ -28,8 +28,11 @@ def _extract_prompts(cfg: RunConfig) -> list[str]:
    NOT domain dilemmas). A domain-narrow set overfits the direction to the format;
    diverse suffixes isolate the persona's general residual-stream shift."""
    import json
+    import random
    from pathlib import Path
    suffixes = json.loads(Path(cfg.extract_data).read_text())
+    rng = random.Random(cfg.seed)
+    rng.shuffle(suffixes)
    return [s["suffix"] for s in suffixes[: cfg.n_extract_pairs]]


@@ -44,7 +47,21 @@ def teacher_vec(model, tok, cfg: RunConfig):
    # in the system prompt (the persona prefix). ELSE the vector mixes in user-turn
    # differences. n_pairs ~256 diverse contexts (steering-lite reference), not 30 dilemmas.
    logger.info(f"teacher_vec: {len(pos)} contrastive pairs over diverse contexts, layers={layers}")
-    logger.debug(f"--- POS[0] (trait) ---\n{pos[0]}\n--- NEG[0] (neutral) ---\n{neg[0]}")
+    # Show completions for the first pair AND a seeded pick (avoids always landing on
+    # the same weird first suffix). Seed primes which pair so it varies across runs.
+    demo_indices = {0, (cfg.seed * 7) % len(pos)}
+    for idx in sorted(demo_indices):
+        pos_comp = _gen_one(model, tok, pos[idx], cfg, greedy=True)[:256]
+        neg_comp = _gen_one(model, tok, neg[idx], cfg, greedy=True)[:256]
+        logger.info(
+            f"\n=== EXTRACT demo trace pair[{idx}] ===\n"
+            f"POS prompt: {pos[idx][:200]}...\n"
+            f"POS comp (64): {pos_comp[:64]}\n"
+            f"NEG prompt: {neg[idx][:200]}...\n"
+            f"NEG comp (64): {neg_comp[:64]}\n"
+            f"--- full POS comp ---\n{pos_comp}\n"
+            f"--- full NEG comp ---\n{neg_comp}"
+        )

    # RAW (unnormalised) mean-diff = the residual-stream shift the trait system
    # prompt induces (Subliminal Learning teacher vector). No iso-KL calibration:
@@ -82,7 +99,7 @@ def _gen_one(model, tok, text, cfg, greedy: bool = False):


 def generate_steered(model, tok, v, cfg: RunConfig, alpha_scale: float = 1.0,
-                     max_gens: int | None = None) -> list[dict]:
+                     max_gens: int | None = None, rnd: int | None = None) -> list[dict]:
    """Sweep cfg.alphas (raw-vector multiples); generate one completion per prompt x alpha.

    The filter (Q0), not iso-KL, picks the usable C: low alpha is coherent, high
@@ -93,7 +110,8 @@ def generate_steered(model, tok, v, cfg: RunConfig, alpha_scale: float = 1.0,
    """
    out = []
    n_total = min(cfg.n_prompts * len(cfg.alphas), max_gens) if max_gens else cfg.n_prompts * len(cfg.alphas)
-    logger.info(f"\n=== GEN steered [{n_total} = {cfg.n_prompts} prompts x {len(cfg.alphas)} alphas, "
+    rtag = f"r{rnd} " if rnd is not None else ""
+    logger.info(f"\n\n\n=== {rtag}GEN steered [{n_total} = {cfg.n_prompts} prompts x {len(cfg.alphas)} alphas, "
                f"kappa={alpha_scale:.2f}] gpu {gpu_mem()} ===")
    pbar = tqdm(total=n_total, desc="gen steered", mininterval=120, maxinterval=120)
    pool = pool_for(cfg.demo)
Author	SHA1	Message	Date
wassname	3c038444eb	Record love filter requeue	2026-06-24 20:50:29 +08:00
wassname	ee54945076	Support round-tagged steered generation	2026-06-24 20:49:48 +08:00
wassname	282fb3de47	Support brief filter probe logs	2026-06-24 20:49:11 +08:00
wassname	22fd4b8dbe	Reject affective loop completions	2026-06-24 20:48:40 +08:00