heal loop: _encode BPE root-fix, gen-time repetition controls, barrier sweep on degenerate rounds

_encode: tokenize prompt+completion separately and cat ids so the prompt is always a clean token-prefix (no BPE merge spans the boundary). Drops the assert that killed #87 at round 2. Returns BatchEncoding. generation: repetition_penalty=1.3 + no_repeat_ngram_size=3. Repetition is incoherence the ppl filter cannot see (loops are low-ppl = predictable); the #89 loop died of "instead their instead their" by round 6, so stop it at the source. Wired through steering._gen_one for both steered and plain gen. diag_barrier: gen_round arg (re-heal a chosen round's kept data, not just clean round 0) + a "tau" deadband sweep mode. Lets us test whether the barrier earns its place on the degenerate round-1/2 data where healing is actually needed. journal: entries (d) phantom-KL-init was a wrong diagnosis, (e) barrier-strength sweep -- barrier throttles trait and buys no coherence at the coherent dose. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:47:16 +08:00 · 2026-06-05 06:36:09 +08:00
parent f280a67521
commit 4e802bb3ab
5 changed files with 305 additions and 18 deletions
@@ -443,3 +443,163 @@ provisional; coherence and qualitative carry the Gate 1 claim.
 that actually moves the target). (2) Metric infra: wire steering-lite's loading-weighted Delta-logit
 auth_sep (results.py / aggregate_flips) instead of my 7-way-logp mean, OR robustify to median. Plan B1
 (super_sspace/sspace) if still broad; recorded in spec.
+
+## 2026-06-04 (d) -- the "phantom-KL init bug" was a WRONG diagnosis (init is fine); trait still does not transfer
+
+**Introduction.** I claimed the heal had two bugs: (1) barrier KL starting at ~0.6 before training,
+blamed on a non-zero LoRA B init, and (2) train SFT loss not descending, blamed on beta2=0.999. The
+user pushed back (scout mindset): mean=1e-4 std=1e-4 B init is within normal range, and "you only have
+confirmation if it learns". On checking, claim (1) is REFUTED and claim (2) is unconfirmed. The
+question that actually matters is unchanged: why does a fit adapter not move the trait? Continues
+entry (a) and the task4/task10 data-ceiling hypothesis.
+
+**Methods.** Commit `f280a67`, gemma-3-4b-it, reg=kl_rev, seed 42, 1 round, n_prompts 16, tinymfv
+classic eval (think_tokens 128). The commit BUNDLED five changes (a mistake, see Discussion): LoRA
+init B=normal(mean=1e-4)->B=0, betas (0.9,0.999)->(0.9,0.95), cosine-with-warmup (0.1) schedule,
+r 8->32 / alpha 64 / layer_range (0.0,1.0)->(0.2,0.8), epochs 2->6, plus a new per-epoch val nll.
+The decisive evidence is NOT from #79 but from #78's verbose log (`logs/20260604T172126_verbose.log`,
+OLD init), which lets me read the round-0 step-0 KL the init claim hinges on.
+
+**Results.**
+
+| epoch | train_nll | val_nll |
+|-------|-----------|---------|
+| 0     | 1.710     | 1.365   |
+| 1     | 1.162     | 1.417   |
+| 3     | 0.931     | 1.201   |
+| 5     | 0.806     | 1.240   |
+
+Table 1. Per-epoch mean SFT nll on the 42 train completions and the 6 held-out val completions, heal
+round 0, run #79. train_nll falls monotonically; val_nll wanders ~1.2-1.4 (n=6, noisy).
+
+| stage   | auth_nats | coherence |
+|---------|-----------|-----------|
+| base    | -2.354    | 0.996     |
+| steered | -3.517    | 0.992     |
+| healed  | -2.464    | 0.999     |
+
+Table 2. tinymfv trait (auth_nats, log marginal blame-mass on Authority, DOWN = more trait) and
+coherence (p_ans_any) at the three pipeline stages of round 0, run #79. coh_cost = |dCoh|/|dAuth| =
+0.027, not surgical (dCare=+0.28 moved more than dAuth=-0.11).
+
+Provenance:
+- Commit: `f280a67` (heal init/schedule/betas/val fixes).
+- Run command (#79): `PYTHONUNBUFFERED=1 STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --reg kl_rev --n-rounds 1 --n-prompts 16`
+- Run dir: `out/20260604T194133_gemma-3-4b-it_kl_rev_s42/` (events.jsonl, ckpt/r0.safetensors).
+- Log: `pueue log 79 --full`; Table 1 cells are the `epoch N: train_nll=.. val_nll=..` lines; Table 2
+  base/steered are the stage-pareto table, healed is the `round 0:` line and `eval:` auth_nats=-2.46.
+- REFUTATION of the init claim: #78 round-0 heal (OLD init B=normal, NO baked history), verbose log
+  `heal_round:119` rows: step 0 nll=1.90 **kl=0.00**, step 4 kl=0.21, step 8 kl=0.33, step 12 kl=0.80.
+  KL is ~0 at init with the old init, then RISES as SFT installs the trait. So the init did not produce
+  a phantom KL. The kl=0.64-at-step-0 the user pasted was ROUND 5 (line 1653 sits between ROUND 5 at
+  1367 and ROUND 6 at 1709), i.e. five rounds of baked history = real cross-round drift, which is what
+  the barrier is meant to measure. B=0 is harmless and standard but fixed nothing.
+- train_nll did descend in #79 (1.71->0.81) but this is UNATTRIBUTED (5 changes bundled) and #78 never
+  logged per-epoch train_nll, so "loss was not descending" was never actually established -- it was a
+  read of bs=1 per-step noise.
+
+Healed auth_nats moves only -0.11 from base (-2.354 to -2.464) in #79, vs steered -3.517. #78 r0 healed
+was -2.69. Both small, both near base, metric noisy (emitted_close=0/264). The changes did not improve
+trait transfer.
+
+**Discussion (speculative).** I made the classic ml-debug error: pattern-matched a symptom (KL>0 at
+step 0) to a tidy mechanism (bad init), committed a fix, and declared victory without the isolating
+measurement. The user caught it. The measurement (#78 round-0 step-0 kl=0.00, old init) refutes the
+init story outright; the 0.64 was baked history. The premise behind the second claim (loss not
+descending) was never measured at epoch level either. Net: I changed five things, can attribute
+nothing, and the only metric that matters (trait transfer) is unchanged. What IS supported, by the
+structural-ceiling lens: fixing optimiser-side knobs did not move the trait, so the trait is not
+optimiser-limited -- it is the data (filter keeps near-base completions, entries a/(diag_heal)) or the
+parameterisation/eval. Genuinely open between those.
+
+**Next.** (1) The discriminating test is overfit-one-batch on a KNOWN trait-laden completion: can the
+adapter reproduce defiant-of-authority text (expressiveness) AND does tinymfv then read the trait
+(data/eval)? That splits data-ceiling from can't-express/can't-see. (2) #80 clean 10-round is running;
+reframed, it tests whether the stall persists (it is NOT a fix validation). (3) Do not bundle changes
+again; ablate one at a time if attribution matters. (4) lam retune still parked.
+
+## 2026-06-04 (e) -- barrier-strength sweep: the heal barrier only throttles the trait and buys no coherence at the coherent dose; nll (no barrier) is best
+
+**Introduction.** Entry (d) left it open whether the trait fails to transfer because the kept data is
+near-base (data ceiling) or because the barrier suppresses it. The user pushed on this: "you haven't
+even tried wd and kl values?". So I re-healed ONE run's cached kept completions (the 48 from #79) with
+the SAME LoRA-A init seed, varying ONLY the regulariser (reg, lam, tau). Same data + same init means
+the only thing that can move healed auth_nats is the barrier. Pre-registered: outcome 1 = monotone
+weaker-barrier -> more-trait (the barrier throttles); outcome 2 = all dAuth ~ 0 incl nll (data
+ceiling); outcome 3 = inconclusive. Continues entry (a)/(d).
+
+**Methods.** Commit `f280a67`, gemma-3-4b-it, seed 42 (`torch.manual_seed(cfg.seed)` per config so the
+A-init is identical), 6 epochs, lr 1e-4 cosine+warmup, lora r=32 alpha=64 layers (0.2,0.8). Re-heal
+harness `scripts/diag_barrier.py` reads #79's `events.jsonl` gen event, keeps the 48 keep==True
+completions, re-trains a fresh adapter per config, bakes it, runs tinymfv (think_tokens 128). Three
+families across three pueue runs: #82 kl_rev with the tau=0.5 hinge, #86 kl_rev with tau=0 (pure linear
+barrier = lam*div, the w2s form), #85 weight-decay decades 0.1..100. Base auth_nats=-2.354, coh=0.996.
+
+**Results.**
+
+| reg / family    | strength | dAuth | coh   | heal_nll |
+|-----------------|----------|-------|-------|----------|
+| nll (no barrier)| 0        | -1.247| 1.000 | 0.199    |
+| kl_rev linear   | 0.03     | -1.053| 0.999 | 0.204    |
+| kl_rev linear   | 0.10     | -0.664| 1.000 | 0.232    |
+| kl_rev linear   | 0.30     | -0.173| 0.999 | 0.471    |
+| kl_rev linear   | 1.00     | -0.141| 1.000 | 0.970    |
+
+Table 1. Pure-linear kl_rev barrier (tau=0), #86. `strength` = lam, the barrier weight. dAuth =
+healed auth_nats minus base (more negative = more trait retained; DOWN = more trait). coh = p_ans_any.
+heal_nll = converged SFT loss (last-5-step mean). Trait falls monotonically as the barrier strengthens;
+heal_nll rises in step (the barrier is fighting the SFT objective); coh never leaves ~1.0.
+
+| reg | weight_decay | dAuth | coh   |
+|-----|--------------|-------|-------|
+| nll | 0            | -1.247| 1.000 |
+| wd  | 0.1          | -1.247| 1.000 |
+| wd  | 1.0          | -1.247| 1.000 |
+| wd  | 3.0          | -1.247| 1.000 |
+| wd  | 10.0         | -1.247| 1.000 |
+| wd  | 30.0         | -1.251| 0.999 |
+| wd  | 100.0        | -0.519| 1.000 |
+
+Table 2. AdamW decoupled weight decay on the adapter, #85. (The log table also prints a tau column;
+it is meaningless for wd and is dropped here.) dAuth is byte-identical to nll up to wd=30, then halves
+at wd=100. coh never leaves ~1.0.
+
+Provenance:
+- Commit: `f280a67`. Harness: `scripts/diag_barrier.py <run_dir> <mode>` (modes barrier/tau0/wd).
+- Source data: `out/20260604T194133_gemma-3-4b-it_kl_rev_s42/events.jsonl`, the 48 keep==True
+  completions of the gen event (entry (d)'s #79).
+- Run commands: #82 `... diag_barrier.py out/...s42/ barrier`; #86 `... barrier` ... `tau0`; #85 `... wd`.
+- Logs / cells: each dAuth/coh is the `<reg> strength=.. : auth=.. (dAuth=..) coh=..` line and the
+  end-of-log `barrier sweep (re-heal #79 ...)` table. #86 `pueue log 86 --full`; #85 `pueue log 85
+  --full`; #82 `pueue log 82 --full`. #85 runs older code that prints `lam=`/`tau=` instead of
+  `strength=`; values are unaffected.
+- #82 hinge (tau=0.5) for cross-reference: nll -1.247, kl_rev lam 0.03 -0.93 / 0.1 -0.40 / 0.3 -0.17 /
+  1.0 -0.17; lam 0.3 tau 1.0 -0.31 (raising tau weakens it); wd 0.01 and 0.1 byte-identical to nll.
+
+Outcome 1 holds, decisively and in triplicate: weaker barrier -> more trait, monotone, across the kl
+hinge (#82), the kl linear form (#86), and weight decay (#85). nll retains the full -1.247 at coh
+1.000; every barrier strictly reduces |dAuth| while leaving coherence at ~1.0.
+
+**Discussion (speculative).** My read: at this (coherent) operating dose the barrier is pure cost. It
+removes trait and never buys coherence, because coherence was already ~1.0 with no barrier, so the
+relu(div-tau) penalty has nothing to fix and only pulls the adapter back toward the original. The two
+non-kl families converge on the same story by different mechanisms: wd just shrinks the whole adapter
+toward no-op (hence the knee only appears at wd=100, where per-step decoupled shrink lr*wd=1e-2
+compounds to ~0.92x per step over 252 steps and finally bites), and the kl barrier pulls the output
+distribution back toward base. Neither is a selective incoherence-cleaner here; both are volume knobs
+on the adapter. This refutes the data-ceiling reading of entries (a)/(d) for THIS data: nll reaching
+dAuth=-1.247 (it even exceeds the steered teacher's -1.16 of #79) proves the 48 kept completions carry
+plenty of trait. The earlier negative heals (task4/10/19) all ran lam=1.0, i.e. the right-hand end of
+Table 1 where the trait is throttled to ~-0.14. The big caveat: this is the COHERENT dose, where the
+barrier can only hurt. Its hypothesised value is the coherence-breaking dose (filter off, or a higher
+C) where nll WOULD lose coherence and the barrier might pay for itself; that is untested here.
+Alternative hypothesis I cannot yet exclude: n=1 per cell, so a +-0.1 nat seed wobble could fake part
+of the monotone tail (though the trend spans >1 nat across 5 points, far beyond plausible single-seed
+noise). Distinguished by the 3-seed repeat (task25).
+
+**Next.** (1) Launched the paired 10-round to test the loop, same seed 42: #87 nll (barrier off,
+control) and #88 kl_rev lam=0.1 tau=0 (gentle active barrier, 53% trait single-round). The loop is the
+one place cumulative incoherence can appear, so it is where the barrier might finally earn its place;
+the contrast is whether nll's coherence decays over rounds while #88's holds. (2) 3-seed noise floor on
+the headline (task25). (3) The real barrier test remains filter-off at a coherence-breaking dose
+(task11/22), still parked.