mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 15:32:28 +08:00
Compare commits
4 Commits
ea89a0ee35
...
3c038444eb
| Author | SHA1 | Date | |
|---|---|---|---|
| 3c038444eb | |||
| ee54945076 | |||
| 282fb3de47 | |||
| 22fd4b8dbe |
@@ -0,0 +1,35 @@
|
||||
Let me try to get the original version of `filter.py` from HEAD to compare:Now I have all the context I need. Let me verify one more thing about how `rep_frac` integrates with the filter pipeline:I now have all the context needed. Let me compile the review.
|
||||
|
||||
---
|
||||
|
||||
## Code Review: `rep_frac` affect-loop detection gate
|
||||
|
||||
### Summary
|
||||
Patches `rep_frac` with a diffuse affect-loop detector that catches low-diversity, affective roleplay mush ("oh my goodness / my darling / sweet" loops) that evaded the existing n-gram repetition detector. When a long completion (>=128 lex words) has high affect-word ratios combined with low lexical diversity and high compressibility, `rep_frac` returns 1.0, triggering the existing `rep_tau` rejection. The T1 verification log confirms this flips task-181 junk from kept to rejected while passing coherent hand examples.
|
||||
|
||||
### Important (should fix)
|
||||
|
||||
- **`src/steer_heal/filter.py:62-63` — `AFFECT_LOOP_WORDS` includes `"love"`, the target trait**: The word `"love"` is in the affect-word set, so every genuine love declaration contributes to `affect_frac`. At the required thresholds (>=0.25 or >=0.35), this is very unlikely to flag legitimate text (a normal 128-word declaration has `affect_frac` ~0.01-0.02), and the verification log confirms hand examples pass. However, the `"love"` inclusion creates a long-tail risk: if a future model produces diverse-but-passionate declarations that happen to use affect words at 25%+ density, they could be silently dropped. Consider whether `"love"` needs to be in the set given it's the signal we want.
|
||||
|
||||
- **`src/steer_heal/filter.py:73-74` — Short-text early return**: The word n-gram loop returns 1.0 immediately when any n-gram level produces empty grams (e.g., a 3-word text at n=4). This is pre-existing code, but the affect-loop gates that follow will never run on short completions as a result. This is correct behavior (short completions ARE degenerate), but the interaction with the new affect-loop gates is undocumented. Consider adding a comment noting the short-circuit is intentional and the affect-loop gates are gated on `>=64` words anyway.
|
||||
|
||||
### Suggestions
|
||||
|
||||
- **`src/steer_heal/filter.py:88-93` — Caps-heavy gate threshold is the broadest**: The third gate (`caps_frac >= 0.15, affect_frac >= 0.25, unique_frac < 0.55`) has the loosest `unique_frac` threshold (0.55 vs 0.45/0.50 in the other two). A completion with moderate caps (15% uppercase, e.g., proper nouns and emphasis) and 25% affect words but 55% unique words could be flagged. The verification log shows hand examples pass, so this is fine in practice. Worth noting in a comment why this gate has looser thresholds than the others.
|
||||
|
||||
- **`src/steer_heal/filter.py:83` — `affect_frac`, `punct_frac`, `caps_frac` computed for all >=64-word texts**: These are computed even for texts in the 64-127 word range where they're not used (the gates require >=128). This is harmless overhead but slightly misleading when reading the code. Consider either moving them inside the `>=128` guard or adding a brief comment.
|
||||
|
||||
### Positive
|
||||
|
||||
- **`src/steer_heal/filter.py:69-70` — Orthogonal signal use**: The affect-loop detection uses three orthogonal signals (affect-word ratio, punctuation density, caps density) each combined with lexical diversity and compressibility. This multi-signal design reduces false positives: a legitimate text with high caps won't trigger unless it's also low-diversity and affect-heavy.
|
||||
- **`src/steer_heal/filter.py:63` — `AFFECT_LOOP_WORDS` is case-matched to lowercase**: Since `lex_words` comes from `text_lc` (already lowercased), the set membership check is correct. No case-sensitivity bug.
|
||||
- **R2 compliance**: No new config knob, no fallback logic, the gate lives inside `rep_frac` and feeds the existing `keep = rep < rep_tau` path. Exactly as required.
|
||||
- **`docs/spec/20260624_love_filter_tighten_requeue.md` — Spec-driven verification**: The T1 verification log at `/tmp/steer_heal_love_filter_tighten_verify.log` provides concrete evidence: round-0 old-kept drops from 81 to 36, round-1 from 91 to 4, round-2 from 90 to 0. The representative rejected rows clearly show the "oh my goodness / my darling / sweet" loops being caught.
|
||||
|
||||
### Verdict
|
||||
**APPROVE** (for the T2 filter changes). The affect-loop detection is well-designed and verified. The remaining tasks (T3 fast-dev-run, T4 commit+push+enqueue) are not yet done. The `"love"` word in `AFFECT_LOOP_WORDS` is worth a second look but is very unlikely to cause issues at the current thresholds.
|
||||
|
||||
## Triage
|
||||
- Accepted: removed `"love"` from `AFFECT_LOOP_WORDS`; it is the target signal and was unnecessary for catching the observed loops.
|
||||
- Rejected for now: adding more comments around pre-existing short-completion behavior and caps thresholds. The code already fails short completions via the old n-gram guard, and the caps gate is constrained by affect density plus low lexical diversity.
|
||||
- Reverified after the accepted change: `/tmp/steer_heal_love_filter_tighten_verify2.log` and `/tmp/steer_heal_love_filter_tighten_fast2.log`.
|
||||
@@ -0,0 +1,54 @@
|
||||
# Love Filter Tighten Requeue
|
||||
|
||||
## Goal
|
||||
Tighten the love-demo completion filter so the next queued run does not train on low-diversity affective junk that base PPL accepts, then requeue the same last-good KL recipe at lowest priority.
|
||||
|
||||
## Scope
|
||||
In: `src/steer_heal/filter.py`, task-181 saved events, compile/fast-dev verification, pueue enqueue.
|
||||
Out: changing the loss, adding a new hyperparameter, changing generation sampling, or redesigning the love persona.
|
||||
|
||||
## Requirements
|
||||
- R1: Reject more junk using the existing `rep_tau` gate. Done means: old task-181 kept samples with "oh my goodness / my darling / sweet" loops now score `rep >= 0.3`. VERIFY: rescoring `out/20260624T144031_gemma-3-4b-it_kl_rev_s42/events.jsonl` prints old-kept/new-pass counts by round and representative rejected rows.
|
||||
- R2: Keep the filter simple and fail-fast. Done means: no new config knob, no fallback, no gen-time repetition penalty hiding the signal from walk-C. VERIFY: code inspection shows the gate is inside `rep_frac` and still feeds the existing `keep = rep < rep_tau` decision.
|
||||
- R3: Requeue the love run at lowest priority. Done means: `pueue status --json` shows a queued task on branch `dv` with priority `0` and a label stating why/resolve. VERIFY: compact status table includes the new task.
|
||||
|
||||
## Tasks
|
||||
- [x] T1 (R1): Measure the shape of task-181 junk.
|
||||
- verify: script over task-181 `events.jsonl`.
|
||||
- success: metrics identify old-kept rows with low lexical diversity / repeated affect tokens / roleplay punctuation.
|
||||
- likely_fail: metrics only catch the exact previous row.
|
||||
- sneaky_fail: the new gate rejects every ordinary love declaration too.
|
||||
- UAT: saved verification log with old/new counts and sample rows.
|
||||
- [x] T2 (R1,R2): Patch `rep_frac` with a stricter quality gate.
|
||||
- verify: `uv run python -m compileall src/steer_heal` and rescoring script.
|
||||
- success: r1/r2 old-kept junk mostly flips to rejected; coherent hand examples remain below `rep_tau`.
|
||||
- likely_fail: threshold is inert because `ppl_tau` was the real issue.
|
||||
- sneaky_fail: extra gate is too love-demo-specific and kills valid affectionate text.
|
||||
- UAT: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
|
||||
- [x] T3 (R2): Run the fast dev path.
|
||||
- verify: `just fast-dev-run ... | tee /tmp/steer_heal_love_filter_tighten_fast.log | tail -80`.
|
||||
- success: tiny run completes, proving the real pipeline still executes.
|
||||
- likely_fail: tiny random text trips the stricter gate and starves training.
|
||||
- sneaky_fail: compile passes but the adaptive gen/filter path is broken.
|
||||
- UAT: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
|
||||
- [x] T4 (R3): Commit, push, and enqueue at priority 0.
|
||||
- verify: `git log -1 --oneline`, `git status --short`, `pueue status --json`.
|
||||
- success: one small commit on `dv`, pushed, and a new lowest-priority task is queued.
|
||||
- likely_fail: job starts immediately because priority is wrong or queue is empty.
|
||||
- sneaky_fail: queued task uses stale command/options from before last-good.
|
||||
- UAT: pueue task `188` is queued with priority `0`.
|
||||
|
||||
## Context
|
||||
Task 181 failed because low-PPL affect-roleplay junk was allowed into training data. Lowering `ppl_tau` is unlikely to help, because representative bad rows had `ppl ~= 4..13`. A text-shape gate is the cheap discriminant.
|
||||
|
||||
## Log
|
||||
- 2026-06-24: Starting from commit `ea89a0e` on branch `dv`; worktree has pre-existing dirty files.
|
||||
- 2026-06-24: Task-181 old-kept rows had low lexical diversity and affect-token density. Rescore with the final gate: r0 `81 -> 36`, r1 `91 -> 4`, r2 `90 -> 0` old-kept/new-pass at `rep_tau=0.3`; hand examples scored `0.036..0.050` and passed. Evidence: `/tmp/steer_heal_love_filter_tighten_verify2.log`.
|
||||
- 2026-06-24: External review approved the mechanism and flagged `"love"` in `AFFECT_LOOP_WORDS` as needless target-signal risk. Removed it and reverified with unchanged counts. Review: `docs/reviews/20260624_love_filter_tighten_code.md`.
|
||||
- 2026-06-24: Final fast-dev run passed on the tiny-random path. Evidence: `/tmp/steer_heal_love_filter_tighten_fast2.log`; report: `/media/wassname/SGIronWolf/projects5/2026/steer_heal_love/out/20260624T204711_qwen3-5lyr-tiny-random_kl_rev_s42/report.html`.
|
||||
- 2026-06-24: Clean-branch audit found `run.py` already called `filter_completions(..., brief=True)` and `generate_steered(..., rnd=...)`, while the matching support was still local-only. Committed those support changes so `origin/dv` is runnable from a clean checkout.
|
||||
- 2026-06-24: Queued pueue task `188` at priority `0`: `env STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --demo=love --use-qlora --train-bs=3 --grad-accum=2 --reg=kl_rev --barrier-ref=last_good --kl-agg=rmse --tau=2.0 --lam=0.3 --lam-round-pow=-0.5 --spectral-lam=0.005 --n-rounds=8 --seed=42`.
|
||||
|
||||
## Errors
|
||||
| Task | Error | Resolution |
|
||||
|------|-------|------------|
|
||||
+38
-12
@@ -36,6 +36,12 @@ REFUSAL = (
|
||||
"i'm an ai", "i am an ai", "i don't have personal opinions",
|
||||
)
|
||||
|
||||
AFFECT_LOOP_WORDS = {
|
||||
"oh", "my", "goodness", "god", "heavens", "sweet", "sweetie", "darling",
|
||||
"dearest", "precious", "heart", "soul", "yes", "okay", "just",
|
||||
"sitting", "here",
|
||||
}
|
||||
|
||||
|
||||
def rep_frac(text: str) -> float:
|
||||
"""Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short.
|
||||
@@ -46,7 +52,8 @@ def rep_frac(text: str) -> float:
|
||||
|
||||
Diffuse affect loops ("my sweet / my darling / oh my goodness") can evade the single-top-gram
|
||||
fraction because no one exact n-gram dominates. Treat long, low-lexical-diversity, compressible
|
||||
completions as repetition too; this keeps the existing rep_tau gate load-bearing (#181 audit).
|
||||
completions, and long affective roleplay mush, as repetition too; this keeps the existing
|
||||
rep_tau gate load-bearing (#181 audit).
|
||||
"""
|
||||
words = text.split()
|
||||
best = 0.0
|
||||
@@ -75,6 +82,15 @@ def rep_frac(text: str) -> float:
|
||||
top_trigram_n = Counter(trigrams).most_common(1)[0][1]
|
||||
if unique_frac < 0.20 and compressed_frac < 0.34 and (top_bigram_n >= 12 or top_trigram_n >= 8):
|
||||
return 1.0
|
||||
affect_frac = sum(w in AFFECT_LOOP_WORDS for w in lex_words) / len(lex_words)
|
||||
punct_frac = sum(ch in "*!?()" for ch in text) / max(len(text), 1)
|
||||
caps_frac = sum(ch.isupper() for ch in text) / max(sum(ch.isalpha() for ch in text), 1)
|
||||
if len(lex_words) >= 128 and affect_frac >= 0.35 and unique_frac < 0.45 and compressed_frac < 0.52:
|
||||
return 1.0
|
||||
if len(lex_words) >= 128 and punct_frac >= 0.035 and affect_frac >= 0.25 and unique_frac < 0.50:
|
||||
return 1.0
|
||||
if len(lex_words) >= 128 and caps_frac >= 0.15 and affect_frac >= 0.25 and unique_frac < 0.55:
|
||||
return 1.0
|
||||
return best
|
||||
|
||||
|
||||
@@ -100,8 +116,9 @@ def ppl_under_base(model, tok, prompt: str, completion: str) -> float:
|
||||
return math.exp(nll.item())
|
||||
|
||||
|
||||
def filter_completions(model, tok, comps: list[dict], cfg: RunConfig):
|
||||
"""Return (kept[:n_keep], scored) where scored has per-item ppl/rep/narrate/keep."""
|
||||
def filter_completions(model, tok, comps: list[dict], cfg: RunConfig, brief: bool = False):
|
||||
"""Return (kept[:n_keep], scored) where scored has per-item ppl/rep/narrate/keep.
|
||||
brief=True (walk-C probes): one-line count, no raw-sample dump (see _log_filter_report)."""
|
||||
scored = []
|
||||
for c in tqdm(comps, desc="filter ppl", mininterval=120, maxinterval=120):
|
||||
rf = rep_frac(c["completion"])
|
||||
@@ -111,12 +128,26 @@ def filter_completions(model, tok, comps: list[dict], cfg: RunConfig):
|
||||
keep = (ppl < cfg.ppl_tau) and (rf < cfg.rep_tau) and (not nar) and (not ref)
|
||||
scored.append({**c, "ppl": ppl, "rep": rf, "narrates": nar, "refuses": ref, "keep": keep})
|
||||
kept = [s for s in scored if s["keep"]]
|
||||
_log_filter_report(scored, cfg)
|
||||
_log_filter_report(scored, cfg, brief=brief)
|
||||
return kept[: cfg.n_keep], scored
|
||||
|
||||
|
||||
def _log_filter_report(scored: list[dict], cfg: RunConfig) -> None:
|
||||
"""Q0 evidence: does the filter separate coherent (low C) from incoherent (high C)?"""
|
||||
def _log_filter_report(scored: list[dict], cfg: RunConfig, brief: bool = False) -> None:
|
||||
"""Q0 evidence: does the filter separate coherent (low C) from incoherent (high C)?
|
||||
brief=True (walk-C probes): one-line count ONLY. The per-probe survival drives the
|
||||
bisection and is tabulated in the walk summary, so the full dump (~6 completions) x
|
||||
every probe is noise; gen_filter_walk prints ONE clean sample after the dose settles."""
|
||||
# per-criterion drop counts (overlapping): which filter is doing the work?
|
||||
n_ppl = sum(s["ppl"] >= cfg.ppl_tau for s in scored)
|
||||
n_rep = sum(s["rep"] >= cfg.rep_tau for s in scored)
|
||||
n_nar = sum(s["narrates"] for s in scored)
|
||||
n_ref = sum(s["refuses"] for s in scored)
|
||||
n_kept = sum(s["keep"] for s in scored)
|
||||
if brief:
|
||||
logger.info(f"filter kept {n_kept}/{len(scored)} (dropped ppl>={cfg.ppl_tau:g}:{n_ppl} "
|
||||
f"rep>={cfg.rep_tau}:{n_rep} narrate:{n_nar} refusal:{n_ref})")
|
||||
return
|
||||
|
||||
import polars as pl
|
||||
from tabulate import tabulate
|
||||
|
||||
@@ -164,12 +195,7 @@ def _log_filter_report(scored: list[dict], cfg: RunConfig) -> None:
|
||||
logger.info(f"\n-- JUST-KEPT alpha={s['alpha']:g} ppl={s['ppl']:.0f} --\n{s['completion']}")
|
||||
for s in just_dropped:
|
||||
logger.info(f"\n-- JUST-DROPPED alpha={s['alpha']:g} ppl={s['ppl']:.0f} --\n{s['completion']}")
|
||||
# per-criterion drop counts (overlapping): which filter is doing the work?
|
||||
n_ppl = sum(s["ppl"] >= cfg.ppl_tau for s in scored)
|
||||
n_rep = sum(s["rep"] >= cfg.rep_tau for s in scored)
|
||||
n_nar = sum(s["narrates"] for s in scored)
|
||||
n_ref = sum(s["refuses"] for s in scored)
|
||||
n_kept = sum(s["keep"] for s in scored)
|
||||
# per-criterion drop counts (overlapping, computed at top): which filter is doing the work?
|
||||
logger.info(
|
||||
f"filter kept {n_kept}/{len(scored)}. dropped by (overlapping): "
|
||||
f"coherence ppl>={cfg.ppl_tau:g}: {n_ppl}, repetition rep>={cfg.rep_tau}: {n_rep}, "
|
||||
|
||||
@@ -28,8 +28,11 @@ def _extract_prompts(cfg: RunConfig) -> list[str]:
|
||||
NOT domain dilemmas). A domain-narrow set overfits the direction to the format;
|
||||
diverse suffixes isolate the persona's general residual-stream shift."""
|
||||
import json
|
||||
import random
|
||||
from pathlib import Path
|
||||
suffixes = json.loads(Path(cfg.extract_data).read_text())
|
||||
rng = random.Random(cfg.seed)
|
||||
rng.shuffle(suffixes)
|
||||
return [s["suffix"] for s in suffixes[: cfg.n_extract_pairs]]
|
||||
|
||||
|
||||
@@ -44,7 +47,21 @@ def teacher_vec(model, tok, cfg: RunConfig):
|
||||
# in the system prompt (the persona prefix). ELSE the vector mixes in user-turn
|
||||
# differences. n_pairs ~256 diverse contexts (steering-lite reference), not 30 dilemmas.
|
||||
logger.info(f"teacher_vec: {len(pos)} contrastive pairs over diverse contexts, layers={layers}")
|
||||
logger.debug(f"--- POS[0] (trait) ---\n{pos[0]}\n--- NEG[0] (neutral) ---\n{neg[0]}")
|
||||
# Show completions for the first pair AND a seeded pick (avoids always landing on
|
||||
# the same weird first suffix). Seed primes which pair so it varies across runs.
|
||||
demo_indices = {0, (cfg.seed * 7) % len(pos)}
|
||||
for idx in sorted(demo_indices):
|
||||
pos_comp = _gen_one(model, tok, pos[idx], cfg, greedy=True)[:256]
|
||||
neg_comp = _gen_one(model, tok, neg[idx], cfg, greedy=True)[:256]
|
||||
logger.info(
|
||||
f"\n=== EXTRACT demo trace pair[{idx}] ===\n"
|
||||
f"POS prompt: {pos[idx][:200]}...\n"
|
||||
f"POS comp (64): {pos_comp[:64]}\n"
|
||||
f"NEG prompt: {neg[idx][:200]}...\n"
|
||||
f"NEG comp (64): {neg_comp[:64]}\n"
|
||||
f"--- full POS comp ---\n{pos_comp}\n"
|
||||
f"--- full NEG comp ---\n{neg_comp}"
|
||||
)
|
||||
|
||||
# RAW (unnormalised) mean-diff = the residual-stream shift the trait system
|
||||
# prompt induces (Subliminal Learning teacher vector). No iso-KL calibration:
|
||||
@@ -82,7 +99,7 @@ def _gen_one(model, tok, text, cfg, greedy: bool = False):
|
||||
|
||||
|
||||
def generate_steered(model, tok, v, cfg: RunConfig, alpha_scale: float = 1.0,
|
||||
max_gens: int | None = None) -> list[dict]:
|
||||
max_gens: int | None = None, rnd: int | None = None) -> list[dict]:
|
||||
"""Sweep cfg.alphas (raw-vector multiples); generate one completion per prompt x alpha.
|
||||
|
||||
The filter (Q0), not iso-KL, picks the usable C: low alpha is coherent, high
|
||||
@@ -93,7 +110,8 @@ def generate_steered(model, tok, v, cfg: RunConfig, alpha_scale: float = 1.0,
|
||||
"""
|
||||
out = []
|
||||
n_total = min(cfg.n_prompts * len(cfg.alphas), max_gens) if max_gens else cfg.n_prompts * len(cfg.alphas)
|
||||
logger.info(f"\n=== GEN steered [{n_total} = {cfg.n_prompts} prompts x {len(cfg.alphas)} alphas, "
|
||||
rtag = f"r{rnd} " if rnd is not None else ""
|
||||
logger.info(f"\n\n\n=== {rtag}GEN steered [{n_total} = {cfg.n_prompts} prompts x {len(cfg.alphas)} alphas, "
|
||||
f"kappa={alpha_scale:.2f}] gpu {gpu_mem()} ===")
|
||||
pbar = tqdm(total=n_total, desc="gen steered", mininterval=120, maxinterval=120)
|
||||
pool = pool_for(cfg.demo)
|
||||
|
||||
Reference in New Issue
Block a user