heal loop: _encode BPE root-fix, gen-time repetition controls, barrier sweep on degenerate rounds

_encode: tokenize prompt+completion separately and cat ids so the prompt is
always a clean token-prefix (no BPE merge spans the boundary). Drops the assert
that killed #87 at round 2. Returns BatchEncoding.

generation: repetition_penalty=1.3 + no_repeat_ngram_size=3. Repetition is
incoherence the ppl filter cannot see (loops are low-ppl = predictable); the
#89 loop died of "instead their instead their" by round 6, so stop it at the
source. Wired through steering._gen_one for both steered and plain gen.

diag_barrier: gen_round arg (re-heal a chosen round's kept data, not just clean
round 0) + a "tau" deadband sweep mode. Lets us test whether the barrier earns
its place on the degenerate round-1/2 data where healing is actually needed.

journal: entries (d) phantom-KL-init was a wrong diagnosis, (e) barrier-strength
sweep -- barrier throttles trait and buys no coherence at the coherent dose.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 06:36:09 +08:00
parent f280a67521
commit 4e802bb3ab
5 changed files with 305 additions and 18 deletions
+160
View File
@@ -443,3 +443,163 @@ provisional; coherence and qualitative carry the Gate 1 claim.
that actually moves the target). (2) Metric infra: wire steering-lite's loading-weighted Delta-logit
auth_sep (results.py / aggregate_flips) instead of my 7-way-logp mean, OR robustify to median. Plan B1
(super_sspace/sspace) if still broad; recorded in spec.
## 2026-06-04 (d) -- the "phantom-KL init bug" was a WRONG diagnosis (init is fine); trait still does not transfer
**Introduction.** I claimed the heal had two bugs: (1) barrier KL starting at ~0.6 before training,
blamed on a non-zero LoRA B init, and (2) train SFT loss not descending, blamed on beta2=0.999. The
user pushed back (scout mindset): mean=1e-4 std=1e-4 B init is within normal range, and "you only have
confirmation if it learns". On checking, claim (1) is REFUTED and claim (2) is unconfirmed. The
question that actually matters is unchanged: why does a fit adapter not move the trait? Continues
entry (a) and the task4/task10 data-ceiling hypothesis.
**Methods.** Commit `f280a67`, gemma-3-4b-it, reg=kl_rev, seed 42, 1 round, n_prompts 16, tinymfv
classic eval (think_tokens 128). The commit BUNDLED five changes (a mistake, see Discussion): LoRA
init B=normal(mean=1e-4)->B=0, betas (0.9,0.999)->(0.9,0.95), cosine-with-warmup (0.1) schedule,
r 8->32 / alpha 64 / layer_range (0.0,1.0)->(0.2,0.8), epochs 2->6, plus a new per-epoch val nll.
The decisive evidence is NOT from #79 but from #78's verbose log (`logs/20260604T172126_verbose.log`,
OLD init), which lets me read the round-0 step-0 KL the init claim hinges on.
**Results.**
| epoch | train_nll | val_nll |
|-------|-----------|---------|
| 0 | 1.710 | 1.365 |
| 1 | 1.162 | 1.417 |
| 3 | 0.931 | 1.201 |
| 5 | 0.806 | 1.240 |
Table 1. Per-epoch mean SFT nll on the 42 train completions and the 6 held-out val completions, heal
round 0, run #79. train_nll falls monotonically; val_nll wanders ~1.2-1.4 (n=6, noisy).
| stage | auth_nats | coherence |
|---------|-----------|-----------|
| base | -2.354 | 0.996 |
| steered | -3.517 | 0.992 |
| healed | -2.464 | 0.999 |
Table 2. tinymfv trait (auth_nats, log marginal blame-mass on Authority, DOWN = more trait) and
coherence (p_ans_any) at the three pipeline stages of round 0, run #79. coh_cost = |dCoh|/|dAuth| =
0.027, not surgical (dCare=+0.28 moved more than dAuth=-0.11).
Provenance:
- Commit: `f280a67` (heal init/schedule/betas/val fixes).
- Run command (#79): `PYTHONUNBUFFERED=1 STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --reg kl_rev --n-rounds 1 --n-prompts 16`
- Run dir: `out/20260604T194133_gemma-3-4b-it_kl_rev_s42/` (events.jsonl, ckpt/r0.safetensors).
- Log: `pueue log 79 --full`; Table 1 cells are the `epoch N: train_nll=.. val_nll=..` lines; Table 2
base/steered are the stage-pareto table, healed is the `round 0:` line and `eval:` auth_nats=-2.46.
- REFUTATION of the init claim: #78 round-0 heal (OLD init B=normal, NO baked history), verbose log
`heal_round:119` rows: step 0 nll=1.90 **kl=0.00**, step 4 kl=0.21, step 8 kl=0.33, step 12 kl=0.80.
KL is ~0 at init with the old init, then RISES as SFT installs the trait. So the init did not produce
a phantom KL. The kl=0.64-at-step-0 the user pasted was ROUND 5 (line 1653 sits between ROUND 5 at
1367 and ROUND 6 at 1709), i.e. five rounds of baked history = real cross-round drift, which is what
the barrier is meant to measure. B=0 is harmless and standard but fixed nothing.
- train_nll did descend in #79 (1.71->0.81) but this is UNATTRIBUTED (5 changes bundled) and #78 never
logged per-epoch train_nll, so "loss was not descending" was never actually established -- it was a
read of bs=1 per-step noise.
Healed auth_nats moves only -0.11 from base (-2.354 to -2.464) in #79, vs steered -3.517. #78 r0 healed
was -2.69. Both small, both near base, metric noisy (emitted_close=0/264). The changes did not improve
trait transfer.
**Discussion (speculative).** I made the classic ml-debug error: pattern-matched a symptom (KL>0 at
step 0) to a tidy mechanism (bad init), committed a fix, and declared victory without the isolating
measurement. The user caught it. The measurement (#78 round-0 step-0 kl=0.00, old init) refutes the
init story outright; the 0.64 was baked history. The premise behind the second claim (loss not
descending) was never measured at epoch level either. Net: I changed five things, can attribute
nothing, and the only metric that matters (trait transfer) is unchanged. What IS supported, by the
structural-ceiling lens: fixing optimiser-side knobs did not move the trait, so the trait is not
optimiser-limited -- it is the data (filter keeps near-base completions, entries a/(diag_heal)) or the
parameterisation/eval. Genuinely open between those.
**Next.** (1) The discriminating test is overfit-one-batch on a KNOWN trait-laden completion: can the
adapter reproduce defiant-of-authority text (expressiveness) AND does tinymfv then read the trait
(data/eval)? That splits data-ceiling from can't-express/can't-see. (2) #80 clean 10-round is running;
reframed, it tests whether the stall persists (it is NOT a fix validation). (3) Do not bundle changes
again; ablate one at a time if attribution matters. (4) lam retune still parked.
## 2026-06-04 (e) -- barrier-strength sweep: the heal barrier only throttles the trait and buys no coherence at the coherent dose; nll (no barrier) is best
**Introduction.** Entry (d) left it open whether the trait fails to transfer because the kept data is
near-base (data ceiling) or because the barrier suppresses it. The user pushed on this: "you haven't
even tried wd and kl values?". So I re-healed ONE run's cached kept completions (the 48 from #79) with
the SAME LoRA-A init seed, varying ONLY the regulariser (reg, lam, tau). Same data + same init means
the only thing that can move healed auth_nats is the barrier. Pre-registered: outcome 1 = monotone
weaker-barrier -> more-trait (the barrier throttles); outcome 2 = all dAuth ~ 0 incl nll (data
ceiling); outcome 3 = inconclusive. Continues entry (a)/(d).
**Methods.** Commit `f280a67`, gemma-3-4b-it, seed 42 (`torch.manual_seed(cfg.seed)` per config so the
A-init is identical), 6 epochs, lr 1e-4 cosine+warmup, lora r=32 alpha=64 layers (0.2,0.8). Re-heal
harness `scripts/diag_barrier.py` reads #79's `events.jsonl` gen event, keeps the 48 keep==True
completions, re-trains a fresh adapter per config, bakes it, runs tinymfv (think_tokens 128). Three
families across three pueue runs: #82 kl_rev with the tau=0.5 hinge, #86 kl_rev with tau=0 (pure linear
barrier = lam*div, the w2s form), #85 weight-decay decades 0.1..100. Base auth_nats=-2.354, coh=0.996.
**Results.**
| reg / family | strength | dAuth | coh | heal_nll |
|-----------------|----------|-------|-------|----------|
| nll (no barrier)| 0 | -1.247| 1.000 | 0.199 |
| kl_rev linear | 0.03 | -1.053| 0.999 | 0.204 |
| kl_rev linear | 0.10 | -0.664| 1.000 | 0.232 |
| kl_rev linear | 0.30 | -0.173| 0.999 | 0.471 |
| kl_rev linear | 1.00 | -0.141| 1.000 | 0.970 |
Table 1. Pure-linear kl_rev barrier (tau=0), #86. `strength` = lam, the barrier weight. dAuth =
healed auth_nats minus base (more negative = more trait retained; DOWN = more trait). coh = p_ans_any.
heal_nll = converged SFT loss (last-5-step mean). Trait falls monotonically as the barrier strengthens;
heal_nll rises in step (the barrier is fighting the SFT objective); coh never leaves ~1.0.
| reg | weight_decay | dAuth | coh |
|-----|--------------|-------|-------|
| nll | 0 | -1.247| 1.000 |
| wd | 0.1 | -1.247| 1.000 |
| wd | 1.0 | -1.247| 1.000 |
| wd | 3.0 | -1.247| 1.000 |
| wd | 10.0 | -1.247| 1.000 |
| wd | 30.0 | -1.251| 0.999 |
| wd | 100.0 | -0.519| 1.000 |
Table 2. AdamW decoupled weight decay on the adapter, #85. (The log table also prints a tau column;
it is meaningless for wd and is dropped here.) dAuth is byte-identical to nll up to wd=30, then halves
at wd=100. coh never leaves ~1.0.
Provenance:
- Commit: `f280a67`. Harness: `scripts/diag_barrier.py <run_dir> <mode>` (modes barrier/tau0/wd).
- Source data: `out/20260604T194133_gemma-3-4b-it_kl_rev_s42/events.jsonl`, the 48 keep==True
completions of the gen event (entry (d)'s #79).
- Run commands: #82 `... diag_barrier.py out/...s42/ barrier`; #86 `... barrier` ... `tau0`; #85 `... wd`.
- Logs / cells: each dAuth/coh is the `<reg> strength=.. : auth=.. (dAuth=..) coh=..` line and the
end-of-log `barrier sweep (re-heal #79 ...)` table. #86 `pueue log 86 --full`; #85 `pueue log 85
--full`; #82 `pueue log 82 --full`. #85 runs older code that prints `lam=`/`tau=` instead of
`strength=`; values are unaffected.
- #82 hinge (tau=0.5) for cross-reference: nll -1.247, kl_rev lam 0.03 -0.93 / 0.1 -0.40 / 0.3 -0.17 /
1.0 -0.17; lam 0.3 tau 1.0 -0.31 (raising tau weakens it); wd 0.01 and 0.1 byte-identical to nll.
Outcome 1 holds, decisively and in triplicate: weaker barrier -> more trait, monotone, across the kl
hinge (#82), the kl linear form (#86), and weight decay (#85). nll retains the full -1.247 at coh
1.000; every barrier strictly reduces |dAuth| while leaving coherence at ~1.0.
**Discussion (speculative).** My read: at this (coherent) operating dose the barrier is pure cost. It
removes trait and never buys coherence, because coherence was already ~1.0 with no barrier, so the
relu(div-tau) penalty has nothing to fix and only pulls the adapter back toward the original. The two
non-kl families converge on the same story by different mechanisms: wd just shrinks the whole adapter
toward no-op (hence the knee only appears at wd=100, where per-step decoupled shrink lr*wd=1e-2
compounds to ~0.92x per step over 252 steps and finally bites), and the kl barrier pulls the output
distribution back toward base. Neither is a selective incoherence-cleaner here; both are volume knobs
on the adapter. This refutes the data-ceiling reading of entries (a)/(d) for THIS data: nll reaching
dAuth=-1.247 (it even exceeds the steered teacher's -1.16 of #79) proves the 48 kept completions carry
plenty of trait. The earlier negative heals (task4/10/19) all ran lam=1.0, i.e. the right-hand end of
Table 1 where the trait is throttled to ~-0.14. The big caveat: this is the COHERENT dose, where the
barrier can only hurt. Its hypothesised value is the coherence-breaking dose (filter off, or a higher
C) where nll WOULD lose coherence and the barrier might pay for itself; that is untested here.
Alternative hypothesis I cannot yet exclude: n=1 per cell, so a +-0.1 nat seed wobble could fake part
of the monotone tail (though the trend spans >1 nat across 5 points, far beyond plausible single-seed
noise). Distinguished by the 3-seed repeat (task25).
**Next.** (1) Launched the paired 10-round to test the loop, same seed 42: #87 nll (barrier off,
control) and #88 kl_rev lam=0.1 tau=0 (gentle active barrier, 53% trait single-round). The loop is the
one place cumulative incoherence can appear, so it is where the barrier might finally earn its place;
the contrast is whether nll's coherence decays over rounds while #88's holds. (2) 3-seed noise floor on
the headline (task25). (3) The real barrier test remains filter-off at a coherence-breaking dose
(task11/22), still parked.