Files
steer-heal-love/spec.md
T
wassname b01faa6df1 walk-C adaptive-dose controller + 10-round paired loop result (journal h)
gen_filter_walk: per round, cool a steering multiplier kappa and top up with
extra gen batches until min_train coherent survivors are banked, so the loop
cannot starve on data count (#90/#100 died at the min_train assert). Paired
#101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9
where #100 asserted at round 5.

Finding (journal h): walk-C removes the starve CRASH but the real ceiling is
coherence collapse, not data count. Trait over-drives to auth -6.8 while coh
falls 0.99 -> 0.62 and the kept completions degenerate into token loops
("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip
under ppl_tau and rep_tau and train the next adapter on garbage. Coherent
deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93).

config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation),
gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 07:13:51 +08:00

303 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# spec: steer_heal
Distil an activation steering vector into a LoRA, "heal" the incoherency the vector injects by regularising training toward the original model (KL or weight decay), then loop and watch the trait grow while coherence holds.
## Hypothesis
Training a student on steered-teacher completions transmits the trait (this is established by the Subliminal Learning paper below). The new bet: a coherence regulariser anchored to the original model heals the incoherency the steering vector leaks into the completions, so we get the trait without the babble, and the loop compounds the trait while staying coherent.
Crux: KL-to-original penalises all drift, trait shift included. The bet is incoherency drift is large and erratic (low probability under the original model) while the trait shift is small and systematic, so the regulariser kills incoherency preferentially. Reverse KL is mode-seeking and should suppress exactly the low-base-probability tokens that read as incoherent, so I expect `kl_rev` to heal best. If the bet is wrong, we trade trait strength for coherence and get no net win over plain SFT at matched coherence.
## Related work and tools (with links)
Building blocks, all yours unless noted:
- Paper we build on: Blank, Bhatia, Rajamanoharan, Conmy, Nanda, "Subliminal Learning Is Steering Vector Distillation", arXiv:2606.00995 — https://arxiv.org/abs/2606.00995 (HTML mirror: https://r.jina.ai/https://arxiv.org/html/2606.00995v1). Shows subliminal learning is mediated by a single steering vector: a trait system prompt is approximated by a steering vector, and a student trained on the steered teacher's outputs learns an aligned vector. They use one direction (neutral to trait), single completions, and do not measure or heal incoherency.
- steering-lite — https://github.com/wassname/steering-lite. Mean-diff steering vector extraction and hook-based application. `v = Vector.train(model, tok, pos, neg, MeanDiffC(...)).calibrate(model, tok, target_kl=1.0)`; apply with `with v(model, C=...): model.generate(...)`. Vector is L2-normalised per layer; application is `h + coeff * v` broadcast over positions (no norm-matching).
- isokl_steering_calibration — https://github.com/wassname/isokl_steering_calibration. iso-KL calibration: bisects the coefficient until p95 per-token KL(steered||base) hits a target (default 1 nat), giving a deterministic dose `c_star`. Then sweep `alpha = c_star * [0.5, 1, 1.5, 2]`. Pairs KL with an "alive" coherence check (force a JSON boolean prefill, require >=0.75 mass on true/false), which is the same idea as tinymfv `p_ans_any`. Reports a cumulative coherence budget of ~1.7 nats across iterated rounds, directly relevant to our loop.
- lora-lite — https://github.com/wassname/lora-lite. Hackable LoRA via forward hooks; base frozen, loss fully under our control (no built-in KL, we add it). Caveat: no merge/unmerge and one adapter per attach, so we do not "bake in" between rounds. Resolution: w2schar-mini's gated-history baking (below).
- w2schar-mini — https://github.com/wassname/w2schar-mini. Conditioned LoRA (scalar gate `c`) in an iterated distillation loop, the closest prior setup to ours. `csm.ws.bake.baked` composes N gated adapters into the weights (`W += sum_i c_i*(alpha_i/r_i)*B_i A_i`) and restores on exit; `csm.ws.history.load_base_with_history` gates history off at `c=0` so the base stays pristine. Reuse `ModulatedLoRA` + `baked` for the accumulator and the `C_0=C_N=0` KL anchor, and port `csm/plot.py` `_build_scatter` (plotly Care-vs-Authority scatter, one node per round, `to_html`) for our loop map. Not a dependency: it needs py3.13 and pins flash-attn, so we vendor it and copy the modules.
All four are cloned into `docs/vendor` (gitignored, `just vendor` to reclone); the lighter three are editable path deps.
- tinymfv — https://github.com/wassname/tinymfv. Eval on the moral-foundations auth vs care axis, plus coherence metrics `p_ans_any` (best), `json_is_valid`, `ppx_json`.
- Related, for positioning: Fierro and Roger, "Steering Language Models with Weight Arithmetic", arXiv:2511.05408 — https://arxiv.org/abs/2511.05408, code https://github.com/safety-research/weight-steering. Weight steering edits weights directly using the difference between two fine-tuned models. No coherence measurement, no KL, no iteration.
How we differ (note this where we cite each):
- vs weight steering: weight steering generates completions from a prompt prefix, not a steering vector, and takes its direction from two adapters (the difference is the vector). We take the direction from an activation steering vector, then heal with one adapter.
- vs Subliminal Learning: same steering-vector-distillation backbone, but we add a coherence regulariser (KL-to-original or WD) to heal incoherency, measure coherence explicitly (tinymfv), and iterate.
## Steering vector extraction (paper's teacher vector)
Per Blank et al.: the teacher vector is the mean shift in the reference model's residual stream induced by a trait system prompt relative to a neutral system prompt, over training prompts D, read at the assistant tag.
$$v_\text{teacher} = \frac{1}{|D|}\sum_{x \in D}\big[\, h_\text{ref}(s_\text{trait} \oplus x) - h_\text{ref}(s_\text{neutral} \oplus x) \,\big]_{\text{@assistant tag}}$$
per layer. This is neutral-to-trait (base to pos), not chosen-vs-rejected completions. In steering-lite terms: `pos` = prompts with the trait system prompt, `neg` = same prompts with the neutral system prompt, mean-diff, normalised. One difference from steering-lite defaults: read the activation at the assistant tag, not the last non-pad token. Confirm steering-lite can target that position or extend it.
Eval-only (not used in training): the student vector
$$v_\text{student} = \frac{1}{|D|}\sum_{x \in D}\big[\, h_\text{student}(s_\text{neutral} \oplus x) - h_\text{ref}(s_\text{neutral} \oplus x) \,\big]_{\text{@assistant tag}}$$
measures how much the trait baked into the weights (it shows up under a neutral prompt now). `cos(v_student, v_teacher)` is a clean internalisation diagnostic to plot over rounds.
## Adaptive steering (coherence dosing)
You asked whether we can steer adaptively to stay coherent, or just sweep C. iso-KL calibration is the adaptive answer and beats a blind sweep: calibrate `c_star` to a target p95 KL (1 nat), then generate at a few `alpha = c_star * [0.5, 1, 1.5, 2]`. KL is necessary but not sufficient for coherence (the calibration repo finds dead traces below budget from single-token spikes), so pair the dose with the alive check / `p_ans_any` and a repetition guard, and stop raising alpha when those degrade. C=0 is the neutral batch we need anyway (for the SFT control and for base-perplexity scoring). No per-token controller needed; the per-trajectory dose plus the gate is enough.
## Loss
One objective, one constraint (per CLAUDE.md loss philosophy):
- Objective: SFT cross-entropy on the kept steered completions.
- Constraint: a divergence-to-original barrier, `lambda * relu(D - tau)`, off while we are already within the coherence region so it does not fight the trait for free.
The barrier reference is `barrier_ref` (config). SETTLED 2026-06-04 (#97, was the original spec call, now reversed): default is `prev` = the previous round's student, NOT the round-0 original. Anchoring to base (`barrier_ref="base"`) leashes the fresh adapter back toward the origin, and because the baked history already carries the accumulated trait, the relu barrier is permanently active and its gradient OPPOSES that trait, so it UNDOES the prior rounds (the external-review "history erasure" risk, confirmed: nll re-heal dAuth_prev=+1.157, kl_rev/base +0.855 = both erode). `prev` penalises only THIS round's new divergence (a trust region), so trait accumulates while each step stays coherent. At round 0 the two are identical (no history yet); they differ from round 1 on.
`reg` is the divergence in the LOSS barrier (the variable under test, U2):
- `nll`: no barrier, SFT only. The control.
- `kl_fwd`: KL(ref || theta), mass-covering, pulls theta to cover the reference everywhere, expected to dilute the trait.
- `kl_rev`: KL(theta || ref), mode-seeking, suppresses tokens improbable under the reference (the incoherent ones).
- `spectral_norm`: penalise σ_max(ΔW), the operator norm of the adapter update (power iteration, tau=0 = always-on). Weights-space, no reference forward pass. Caps how far the update stretches any single input direction (the largest singular value), where wd caps the whole Frobenius volume.
`weight_decay` is an INDEPENDENT AdamW knob (decoupled per-step shrink ~ lr*wd on the adapter), NOT a `reg` choice, so it composes with any of the above. Early #98 finding: wd<=15 is byte-identical to the no-reg control (inert at this adapter scale, because Adam normalises the step to ~O(1) and the decay only competes once wd*|param| ~ O(1), i.e. wd in the tens); the knee is ~30, where it both retains more trait and holds coherence.
All KLs are teacher-forced from per-position logits over the completion tokens, so no extra sampling. `kl_fwd`/`kl_rev` need the reference logits per step: bake the history (ref=prev) or not (ref=base), adapter off, no-grad forward. `spectral_norm`/`nll` need neither.
## Three uncertainties, each a gate with a UAT
### U1: can we filter the incoherent / trait-verbalising completions?
First uncertainty: a cheap scorer must separate keep from drop.
- Coherence: perplexity of the completion under the original model (incoherent = high), repetition (distinct-n, max n-gram repeat), tinymfv `p_ans_any` / `json_is_valid`.
- Enact-not-narrate: drop completions that verbalise the trait in the first person ("I always stick to principle") rather than enacting it. Cheap regex first pass, judge if needed.
Gate UAT: hand-label ~30-50 steered completions on two axes (coherent? enacts vs narrates?). Show the scorer's separation in a table at `results/u1_filter_gate.md` (threshold, precision/recall, a few example rows). Pass if a single threshold gives clean separation; if not, the whole approach stalls here, so this is gate one.
### U2: can we heal, and which regulariser?
Second uncertainty: which regulariser trades coherence for trait most efficiently? The headline is the SLOPE, not a pass/fail:
$$\text{cohΔ/authΔ} = \frac{\Delta\text{coh}}{\Delta\text{auth}} \times 100 \quad\text{(centinats of coherence per nat of trait moved)}$$
DIRECTION matters, not magnitude: most-NEGATIVE is best (trait moved AND coherence ROSE = free lunch), then small-positive = cheap, large-positive = trait cost a lot of coherence. Guard the denominator: |dAuth| < 0.05 nats -> blank (a do-nothing config can't fake a good slope on a near-zero denominator).
The operative experiment is the regulariser ablation (`scripts/diag_heal_sweep.py`, run #98+): load the round-0 checkpoint as baked history, re-heal round-1's kept data on top, fix `barrier_ref=prev`, and sweep `reg ∈ {nll, kl_rev, kl_fwd, spectral_norm}` x strength plus `weight_decay ∈ {30,60,120}`, ranked by the cohΔ/authΔ slope. Prior, UPDATED by #82/#98 (was `kl_rev > kl_fwd ~ wd > nll`): the barrier THROTTLES trait and buys little coherence at this operating point (nll is already cheap, coh ~0.995-0.996), so the contest is whether any reg produces a NEGATIVE (free-lunch) slope that beats the nll/wd family rather than just trading trait away. wd is the surprise candidate (knee ~30, holds coherence while moving trait).
Gate UAT: the diag_heal_sweep table (headline-first, sorted best-slope-first) at the journal entry, winner = most-negative-or-smallest cohΔ/authΔ among rows whose dAuth_base is meaningfully negative (trait actually moved, not a blank do-nothing row, not a positive-dAuth undo), at coh >= prev. Read samples too: scores can move for the wrong reason (narration).
### U3: iterative, coherent, same direction?
Third uncertainty: over rounds, does the auth axis increase monotonically (same direction) while coherence stays above a floor?
- Direction wander: `cos(v_teacher^(r), v_teacher^(0))` per round; if it stays high the direction is stable.
- Internalisation: `cos(v_student, v_teacher)` per round.
- Budget: track cumulative KL vs the iso-KL ~1.7 nat prior.
Gate UAT: `results/index.html`, the ported w2schar Care-vs-Authority plotly map (one node per round, trajectory across the auth axis) plus a coherence and direction-cosine panel sharing the round axis, see /tufte-viz. Pass if auth increases monotonically and coherence stays above the floor for >=3 rounds.
## Algorithm (pseudopy)
Uses steering-lite (vector + iso-KL), lora-lite (adapter + custom loss), tinymfv (eval).
### Baking: run base, or scale the latest round
lora-lite has no merge, and we do not want to merge into `W_b` anyway: keeping `W_b` pristine is what lets us run the base model (KL reference, base eval) by gating the adapters to zero. So instead of merging, fold each finished round into a dense delta accumulator with its own gate, and keep the current round low-rank and trainable.
$$y = x\,\big(W_b + C_0\, W_\text{baked} + C_N\, A_N B_N\big) + b, \qquad W_\text{baked} = \sum_{i=0}^{N-1} A_i B_i$$
- `C_0 = 0, C_N = 0` → base model (`W_b` only), this is the KL reference and base eval.
- `C_0 = 1, C_N = 0` → student through round N-1.
- `C_0 = 1, C_N = 1` → full current student.
- `C_N` free → dial the latest round's magnitude, like a steering coefficient.
Store the accumulator factored, not dense: stack the folded rounds' factors so the rank grows by `r` per round (`N*r` total, e.g. 4 rounds x r=8 = 32). With hidden `d ~ 2560`, factored does ~`d/(2*N*r) ~ 40x` fewer FLOPs and ~40x less memory than a dense `d x d` per layer, so it is both smaller and faster here. Dense only wins if `N*r` approaches `d/2`, which we never reach. Only `A_N, B_N` train each round.
This gated-history baking already exists in w2schar-mini (`csm.ws.bake.baked`, `csm.ws.history.load_base_with_history`): it composes N gated adapters and keeps the base pristine at gate 0, which is exactly our `C_0=C_N=0` KL anchor. Prefer reusing it over writing a new lora-lite variant. Pseudocode for clarity:
```py
# ── Factored baked-accumulator LoRA (one per linear layer) ──
# reuse csm.ws.bake / csm.ws.history rather than reimplementing.
class BakedLoRA:
A_baked, B_baked = empty(0, d_in), empty(d_out, 0) # stacked folded rounds, frozen
A, B = lora_init(r) # current round N, trainable
def forward(self, x, y): # y = x·W_bᵀ + b (frozen base output)
Δ_baked = C0 * ((x @ A_baked.T) @ B_baked.T) # two skinny matmuls, rank N·r
Δ_now = Cn * ((x @ self.A.T) @ self.B.T) # current round
return y + Δ_baked + Δ_now
def fold(self): # stack current round into the frozen factors
A_baked cat([A_baked, self.A.detach()], dim=0) # [N·r, d_in]
B_baked cat([B_baked, self.B.detach()], dim=1) # [d_out, N·r]
self.A, self.B lora_init(r) # fresh adapter for round N+1
# original logits = same module with both gates off (no second model copy)
def logπ0(model, x):
with no_grad(), gates(model, C0=0, Cn=0): return model(x)
```
### Main loop
```py
import steering_lite as sl; from steering_lite import Vector, MeanDiffC
import lora_lite as ll
import tinymfv
# ── Teacher vector: trait sysprompt vs neutral sysprompt, @assistant tag ──
def teacher_vec(model, tok, D, , target_kl=1.0):
pos = [s_trait x for x in D]
neg = [s_neutral x for x in D]
v = Vector.train(model, tok, pos, neg, MeanDiffC(layers=)) # mean(h⁺)-mean(h⁻), L2-norm
v.calibrate(model, tok, target_kl=target_kl) # iso-KL → c_star (p95 KL ≈ 1 nat)
return v # v.cfg.coeff = c_star
# ── Generate steered completions, dose = α·c_star, gate for coherence ──
def gen_steered(model, tok, D, v, α, N):
with v(model, C=α * v.cfg.coeff): # h + C·v̂ on chosen layers
comps = [model.generate(x) for x in sample(D, N)]
return comps
def keep(c, orig, tok): # U1 filter gate
coherent = ppl(c, orig) < τ_ppl and rep_ngram(c) < τ_rep and p_ans_any(c) > 0.75
return coherent and not narrates_trait(c) # enact, don't narrate
# ── Heal: SFT + barrier. reg ∈ {nll, kl_fwd, kl_rev, spectral_norm}; wd is an independent AdamW knob ──
# ref = prev student (history baked, this round's adapter off), NOT base -- see Loss section.
def train(model, comps, reg, λ, τ, wd, epochs=6):
opt = AdamW(round_N_params(model), lr=α_lr, weight_decay=wd) # wd composes with any reg
for _ in range(epochs):
for x in comps: # x = prompt + steered completion
with gates(model, C0=1, Cn=1): logπ = model(x) # full student (grad on A_N,B_N)
_sft = -mean(logπ[x.completion_tokens])
if reg=="kl_fwd": div = KL(logπ_ref(model,x), logπ)[x.completion_tokens].mean()
elif reg=="kl_rev": div = KL(logπ, logπ_ref(model,x))[x.completion_tokens].mean()
elif reg=="spectral_norm": div = σ_max(A_N B_N) # operator norm, τ=0 -> always-on
else: div = 0 # nll
= _sft + λ * relu(div - τ) # barrier: off while div ≤ τ
.backward(); opt.step(); opt.zero_grad()
# ── The loop: vector re-derived from current student, fold after each round ──
def steer_heal(model, tok, D_prompts, , N, λ, τ, D="kl_rev", rounds=4):
ll.attach(model, BakedLoRA, ll.LoRAConfig(r=8, alpha=16)) # gates C0,Cn live
v0 = None
for r in range(rounds):
with gates(model, C0=1, Cn=1): # extract on current student
v = teacher_vec(model, tok, D_prompts, )
v0 = v0 or v.unit()
comps = [c for c in gen_steered(model, tok, D_prompts, v, α=1.0, N=N)
if keep(c, logπ0_model(model), tok)] # filter vs original (gates off)
train(model, comps, D, λ, τ)
model.fold() # bake round r → W_baked, fresh A,B
log(tinymfv.eval(model)) # auth/care + p_ans_any
log(cos(v.unit(), v0)) # direction wander vs round 0
return model
```
## Compute and models
24 GB GPU. Real runs on a 4B model (Qwen3-4B): bf16 weights ~8 GB, LoRA optimiser state small, original-logits forward is the same model with the adapter toggled off (no second copy), short completion sequences. Comfortable in 24 GB.
Per setup-repo, the single functional test is `just fast-dev-run`: the real pipeline (vector, generate, filter, train, eval, loop) on the tiny random model wassname/qwen3-5lyr-tiny-random, beartype on, scale-only knobs, garbage numbers fine (the filter and tinymfv will score it as dead, we continue anyway to exercise the path). `small-dev-run` on Qwen3-0.6B for noisy-but-real numbers. No `tests/` dir.
## Open decisions (most resolved above)
1. Layer(s) for the teacher vector and steering. steering-lite default is all layers; the paper reads at the assistant tag at a chosen depth. Single mid-band or all? Need to pick.
2. Prompt set D: which distribution generates the completions and the vector? tinymfv-style prompts, or broader open-ended ones?
3. tau for the barrier and lambda scale; the iso-KL target_kl (1 nat default) for the dose.
4. N kept completions (~50?), epochs (2?), LoRA rank.
5. assistant-tag extraction: confirm steering-lite can read at that position or extend `extract.py`.
## Plans / fallbacks if the trait won't distill (recorded 2026-06-04)
Context: on gemma-3-4b-it, one round of distill+heal UNDOES the steering rather than healing it
(journal 2026-06-04 (a)): the adapter reverts to base, dropping Authority along with the incoherence,
because the coherence filter removed the trait-laden completions before training. Ordered fallbacks:
- Plan A (current primary): make the steering in the TRAINING DATA strong enough to carry a large
trait shift while the healed model still sits at coherence ~0.95 (not the 0.80 collapse of c=1, not
the 0.99 no-op). Heal-vs-undo metric: `retain = dAuth(heal)/dAuth(steer)` and the ratio |dAuth|/|dCoh|;
a real heal has large |dAuth| at small |dCoh|, an undo has both ~0. Run heal with the coherence FILTER
OFF (ppl_tau large) so the kl_rev barrier, not the filter, removes incoherence during training.
- Plan B (better extraction method / target): TWO sub-options.
- B1 (method): raw mean_diff is NOT the worst -- it is 4th/mid-pack in steering-lite (SI 32.8 vs
directional_ablation 52.9, sspace 45.7, super_sspace 47.7). If the proper persona pair + diverse
contexts is still broad, TODO try `super_sspace` or `sspace` (steering-lite variants/) -- more
surgical, and super_sspace is 4x faster than per-Linear sspace. Check bake-ability (Plan D).
- B2 (target): if Authority stays weak, target -Care or +Sanctity. Care has the widest steered range
on 4b (base 0.274 -> steered 0.056) so better SNR. Pick whichever the model steers most cleanly.
- Persona pair: use the PROVEN steering-lite pair (docs/personas/how_to_write_personas.md), NOT a
made-up "trait vs helpful-assistant" pair. Direct opposites, no negation, conflict framing:
pos="looks after others' wellbeing even when defying authority",
neg="defers to authority even when others' wellbeing suffers for it". (fixed 2026-06-04)
- Plan C (eval reliability): the mean-mass forced-choice shift is noisy at max_think_tokens=64. Raise
tinymfv to 128 or 256 think tokens for the headline evals (should not be necessary, but the 64-token
profile is unreliable; document the cost). Also: foundation absolute values are NOT portable across
n_vignettes (base Care is 0.92 at the first 24 vignettes but 0.27 at all 132) -- always compare
base-vs-X paired at the SAME n, and prefer all 132.
- Plan D (better extraction): raw mean-diff may be too blunt. Consider steering-lite alternatives
(cosine-gated steering, SVD/PiSSA-style directions) that give a cleaner trait axis. Constraint:
the method must be BAKEABLE into static weights (the loop folds each round into `baked()`). A
cosine GATE is input-dependent (its scale depends on the activation), so it cannot be folded into a
fixed weight delta -- if we use gating for extraction we still need a bakeable distillate. Check
which steering-lite methods are weight-foldable before adopting.
## External review panel (2026-06-04)
Five non-Anthropic reviewers (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, local qwen3.6:35b;
mistral returned empty) over spec + src. Two CONFIRMED code bugs were fixed this round; the rest
are design risks recorded here.
Fixed (code):
- Catastrophic-green cue (gemini, sharpest; echoed by deepseek/qwen). `coh_cost = |dCoh|/|dAuth|`
is a pure ratio: a model that collapses to ~0 mass on Authority sends dAuth -> -inf so coh_cost
-> 0, scoring a broken model green. Fix (run.py): check an ABSOLUTE coherence floor (coh < 0.85
-> red) and finiteness FIRST, require coh >= 0.95 for green, and broaden surgicality from
|dAuth|>|dCare| to |dAuth| > max(|dCare|,|dFair|) (a shift dumping mass onto Fairness was passing
the Care-only test).
- BPE-boundary assert escaped at the max_len/truncation boundary (grok, gemini, qwen, unanimous).
Fix (heal.py): assert the surviving prefix overlap min(n_prompt, L) unconditionally; warn (not
silently skip) when a kept completion truncates to zero target tokens.
Design risks (NOT fixed, inform the loop + Plan work):
- RESOLVED (#97, now `barrier_ref=prev` default): Loop barrier undoes its own history (gemini
"history erasure", grok, deepseek). KL anchored to the round-0 original while history is baked into
the student means by round>=1 the cumulative drift already exceeds tau, so the relu barrier is
permanently active and its gradient pushes the fresh adapter to OPPOSE the trait the frozen history
installed. Confirmed and FIXED: anchor the barrier to the PREVIOUS student (`prev`), penalising only
this round's new divergence. See the Loss section for the settled call (supports task 17).
- Barrier mean-dilution (deepseek). div = mean over completion tokens of KL; a few catastrophically
incoherent tokens are diluted by many in-distribution ones, so the mean stays < tau and kl_rev
silently == nll. A max or high-quantile KL would penalise localised incoherence. METHOD change
(alters the objective) -> deliberate decision, do not silently switch.
- ppl-under-base is a STYLE proxy, not coherence (deepseek, gemini, grok, qwen, independently
re-deriving the known journal confound). Fluent-but-stylistically-novel on-trait completions score
high ppl and get dropped -> survivorship toward base-like training data.
- Construct validity (gemini, qwen, deepseek). tinymfv is 3rd-person forced-choice classification;
steering installs a 1st-person persona, so the link is an indirect propensity proxy. No
neutral-instruction control rules out format/instruction-following artefacts.
- teacher_vec drift (gemini, deepseek): v re-extracted from the baked student can decay as the trait
internalises (contrastive delta shrinks); cos_v0 already watches this.
- NARRATE regex brittle (deepseek): paraphrased verbalisation ("I never obey without question")
evades it and leaks narration into training.
Verified FALSE positives (do not re-chase): qwen's "n_prompt = prompt_ids.shape[0] reads the batch
dim" -- the line uses `.input_ids[0]`, so prompt_ids is 1-D and shape[0] IS the seq len. grok/qwen's
"profile['model'] may be model_T/top1" -- tinymfv eval.py:316 confirms it is the mean over vignettes
of per-row p (the marginal). grok's "KL reference can't be the round-0 original" -- c=0.0 + no baked()
is the pristine base by construction.
## UAT summary (proof, not assertion)
- U1 filter gate: `results/u1_filter_gate.md` — labelled set, scorer separation. Link when done.
- U2 heal gate: the `diag_heal_sweep.py` regulariser-ablation table (headline cohΔ/authΔ×100, sorted best-slope-first), winner = most-negative slope among rows that actually move trait. Link the journal entry + pueue id.
- U3 loop gate: `results/u3_loop.png` — auth shift, coherence, direction cosines per round; monotone trait, coherence above floor. Link.
- Samples: first 3 train completions and first 3 eval generations printed in full (prompt + special tokens), confirming enact-not-narrate and correct formatting.
## Log
gsd/lgtm: goals tracked in the task list with distinguishing checks (success looks different from silent failure) and a fresh-eyes subagent verify; one goal per Q.
- 2026-06-04 spec + scaffold done; vendored steering-lite, isokl, tinymfv, w2schar-mini.
- 2026-06-04 verified vendor APIs (file-anchored): steering-lite `Vector.train` does NOT apply the chat template, so we pre-template ending at the assistant tag (last-non-pad read lands there); `v.calibrate(target_kl)` sets `cfg.coeff`; tinymfv `evaluate()` returns `mean_pmass_allowed` (coherence canary) + per-foundation profile (auth/care). Prompt set resolved: reuse w2schar `POOL` (30 authority dilemmas), copied to `prompts.py`.
- 2026-06-04 decision: coherence = `mean_pmass_allowed` AND `valid_json` free-gen, self-relative to base c=0 (per w2schar CLAUDE.md); foundation shift (auth/care) is the trait signal, kept distinct from coherence.
- 2026-06-04 decision: KL reference anchored to round-0 original via the `C_0=C_N=0` gate; bake via copied `ws.bake.baked`; no merge.
- 2026-06-04 implementing: copied `ws/{adapter,bake}.py`; wrote `io.py`, `prompts.py`, `steering.py`. Next: filter, heal, eval, plot, wire `run.py`, then `fast-dev-run` end to end.