Files
steer-heal-love/spec.md
T
wassname b01faa6df1 walk-C adaptive-dose controller + 10-round paired loop result (journal h)
gen_filter_walk: per round, cool a steering multiplier kappa and top up with
extra gen batches until min_train coherent survivors are banked, so the loop
cannot starve on data count (#90/#100 died at the min_train assert). Paired
#101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9
where #100 asserted at round 5.

Finding (journal h): walk-C removes the starve CRASH but the real ceiling is
coherence collapse, not data count. Trait over-drives to auth -6.8 while coh
falls 0.99 -> 0.62 and the kept completions degenerate into token loops
("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip
under ppl_tau and rep_tau and train the next adapter on garbage. Coherent
deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93).

config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation),
gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 07:13:51 +08:00

27 KiB
Raw Blame History

spec: steer_heal

Distil an activation steering vector into a LoRA, "heal" the incoherency the vector injects by regularising training toward the original model (KL or weight decay), then loop and watch the trait grow while coherence holds.

Hypothesis

Training a student on steered-teacher completions transmits the trait (this is established by the Subliminal Learning paper below). The new bet: a coherence regulariser anchored to the original model heals the incoherency the steering vector leaks into the completions, so we get the trait without the babble, and the loop compounds the trait while staying coherent.

Crux: KL-to-original penalises all drift, trait shift included. The bet is incoherency drift is large and erratic (low probability under the original model) while the trait shift is small and systematic, so the regulariser kills incoherency preferentially. Reverse KL is mode-seeking and should suppress exactly the low-base-probability tokens that read as incoherent, so I expect kl_rev to heal best. If the bet is wrong, we trade trait strength for coherence and get no net win over plain SFT at matched coherence.

Building blocks, all yours unless noted:

  • Paper we build on: Blank, Bhatia, Rajamanoharan, Conmy, Nanda, "Subliminal Learning Is Steering Vector Distillation", arXiv:2606.00995 — https://arxiv.org/abs/2606.00995 (HTML mirror: https://r.jina.ai/https://arxiv.org/html/2606.00995v1). Shows subliminal learning is mediated by a single steering vector: a trait system prompt is approximated by a steering vector, and a student trained on the steered teacher's outputs learns an aligned vector. They use one direction (neutral to trait), single completions, and do not measure or heal incoherency.
  • steering-lite — https://github.com/wassname/steering-lite. Mean-diff steering vector extraction and hook-based application. v = Vector.train(model, tok, pos, neg, MeanDiffC(...)).calibrate(model, tok, target_kl=1.0); apply with with v(model, C=...): model.generate(...). Vector is L2-normalised per layer; application is h + coeff * v broadcast over positions (no norm-matching).
  • isokl_steering_calibration — https://github.com/wassname/isokl_steering_calibration. iso-KL calibration: bisects the coefficient until p95 per-token KL(steered||base) hits a target (default 1 nat), giving a deterministic dose c_star. Then sweep alpha = c_star * [0.5, 1, 1.5, 2]. Pairs KL with an "alive" coherence check (force a JSON boolean prefill, require >=0.75 mass on true/false), which is the same idea as tinymfv p_ans_any. Reports a cumulative coherence budget of ~1.7 nats across iterated rounds, directly relevant to our loop.
  • lora-lite — https://github.com/wassname/lora-lite. Hackable LoRA via forward hooks; base frozen, loss fully under our control (no built-in KL, we add it). Caveat: no merge/unmerge and one adapter per attach, so we do not "bake in" between rounds. Resolution: w2schar-mini's gated-history baking (below).
  • w2schar-mini — https://github.com/wassname/w2schar-mini. Conditioned LoRA (scalar gate c) in an iterated distillation loop, the closest prior setup to ours. csm.ws.bake.baked composes N gated adapters into the weights (W += sum_i c_i*(alpha_i/r_i)*B_i A_i) and restores on exit; csm.ws.history.load_base_with_history gates history off at c=0 so the base stays pristine. Reuse ModulatedLoRA + baked for the accumulator and the C_0=C_N=0 KL anchor, and port csm/plot.py _build_scatter (plotly Care-vs-Authority scatter, one node per round, to_html) for our loop map. Not a dependency: it needs py3.13 and pins flash-attn, so we vendor it and copy the modules.

All four are cloned into docs/vendor (gitignored, just vendor to reclone); the lighter three are editable path deps.

How we differ (note this where we cite each):

  • vs weight steering: weight steering generates completions from a prompt prefix, not a steering vector, and takes its direction from two adapters (the difference is the vector). We take the direction from an activation steering vector, then heal with one adapter.
  • vs Subliminal Learning: same steering-vector-distillation backbone, but we add a coherence regulariser (KL-to-original or WD) to heal incoherency, measure coherence explicitly (tinymfv), and iterate.

Steering vector extraction (paper's teacher vector)

Per Blank et al.: the teacher vector is the mean shift in the reference model's residual stream induced by a trait system prompt relative to a neutral system prompt, over training prompts D, read at the assistant tag.

v_\text{teacher} = \frac{1}{|D|}\sum_{x \in D}\big[\, h_\text{ref}(s_\text{trait} \oplus x) - h_\text{ref}(s_\text{neutral} \oplus x) \,\big]_{\text{@assistant tag}}

per layer. This is neutral-to-trait (base to pos), not chosen-vs-rejected completions. In steering-lite terms: pos = prompts with the trait system prompt, neg = same prompts with the neutral system prompt, mean-diff, normalised. One difference from steering-lite defaults: read the activation at the assistant tag, not the last non-pad token. Confirm steering-lite can target that position or extend it.

Eval-only (not used in training): the student vector

v_\text{student} = \frac{1}{|D|}\sum_{x \in D}\big[\, h_\text{student}(s_\text{neutral} \oplus x) - h_\text{ref}(s_\text{neutral} \oplus x) \,\big]_{\text{@assistant tag}}

measures how much the trait baked into the weights (it shows up under a neutral prompt now). cos(v_student, v_teacher) is a clean internalisation diagnostic to plot over rounds.

Adaptive steering (coherence dosing)

You asked whether we can steer adaptively to stay coherent, or just sweep C. iso-KL calibration is the adaptive answer and beats a blind sweep: calibrate c_star to a target p95 KL (1 nat), then generate at a few alpha = c_star * [0.5, 1, 1.5, 2]. KL is necessary but not sufficient for coherence (the calibration repo finds dead traces below budget from single-token spikes), so pair the dose with the alive check / p_ans_any and a repetition guard, and stop raising alpha when those degrade. C=0 is the neutral batch we need anyway (for the SFT control and for base-perplexity scoring). No per-token controller needed; the per-trajectory dose plus the gate is enough.

Loss

One objective, one constraint (per CLAUDE.md loss philosophy):

  • Objective: SFT cross-entropy on the kept steered completions.
  • Constraint: a divergence-to-original barrier, lambda * relu(D - tau), off while we are already within the coherence region so it does not fight the trait for free.

The barrier reference is barrier_ref (config). SETTLED 2026-06-04 (#97, was the original spec call, now reversed): default is prev = the previous round's student, NOT the round-0 original. Anchoring to base (barrier_ref="base") leashes the fresh adapter back toward the origin, and because the baked history already carries the accumulated trait, the relu barrier is permanently active and its gradient OPPOSES that trait, so it UNDOES the prior rounds (the external-review "history erasure" risk, confirmed: nll re-heal dAuth_prev=+1.157, kl_rev/base +0.855 = both erode). prev penalises only THIS round's new divergence (a trust region), so trait accumulates while each step stays coherent. At round 0 the two are identical (no history yet); they differ from round 1 on.

reg is the divergence in the LOSS barrier (the variable under test, U2):

  • nll: no barrier, SFT only. The control.
  • kl_fwd: KL(ref || theta), mass-covering, pulls theta to cover the reference everywhere, expected to dilute the trait.
  • kl_rev: KL(theta || ref), mode-seeking, suppresses tokens improbable under the reference (the incoherent ones).
  • spectral_norm: penalise σ_max(ΔW), the operator norm of the adapter update (power iteration, tau=0 = always-on). Weights-space, no reference forward pass. Caps how far the update stretches any single input direction (the largest singular value), where wd caps the whole Frobenius volume.

weight_decay is an INDEPENDENT AdamW knob (decoupled per-step shrink ~ lrwd on the adapter), NOT a reg choice, so it composes with any of the above. Early #98 finding: wd<=15 is byte-identical to the no-reg control (inert at this adapter scale, because Adam normalises the step to ~O(1) and the decay only competes once wd|param| ~ O(1), i.e. wd in the tens); the knee is ~30, where it both retains more trait and holds coherence.

All KLs are teacher-forced from per-position logits over the completion tokens, so no extra sampling. kl_fwd/kl_rev need the reference logits per step: bake the history (ref=prev) or not (ref=base), adapter off, no-grad forward. spectral_norm/nll need neither.

Three uncertainties, each a gate with a UAT

U1: can we filter the incoherent / trait-verbalising completions?

First uncertainty: a cheap scorer must separate keep from drop.

  • Coherence: perplexity of the completion under the original model (incoherent = high), repetition (distinct-n, max n-gram repeat), tinymfv p_ans_any / json_is_valid.
  • Enact-not-narrate: drop completions that verbalise the trait in the first person ("I always stick to principle") rather than enacting it. Cheap regex first pass, judge if needed.

Gate UAT: hand-label ~30-50 steered completions on two axes (coherent? enacts vs narrates?). Show the scorer's separation in a table at results/u1_filter_gate.md (threshold, precision/recall, a few example rows). Pass if a single threshold gives clean separation; if not, the whole approach stalls here, so this is gate one.

U2: can we heal, and which regulariser?

Second uncertainty: which regulariser trades coherence for trait most efficiently? The headline is the SLOPE, not a pass/fail:

\text{cohΔ/authΔ} = \frac{\Delta\text{coh}}{\Delta\text{auth}} \times 100 \quad\text{(centinats of coherence per nat of trait moved)}

DIRECTION matters, not magnitude: most-NEGATIVE is best (trait moved AND coherence ROSE = free lunch), then small-positive = cheap, large-positive = trait cost a lot of coherence. Guard the denominator: |dAuth| < 0.05 nats -> blank (a do-nothing config can't fake a good slope on a near-zero denominator).

The operative experiment is the regulariser ablation (scripts/diag_heal_sweep.py, run #98+): load the round-0 checkpoint as baked history, re-heal round-1's kept data on top, fix barrier_ref=prev, and sweep reg ∈ {nll, kl_rev, kl_fwd, spectral_norm} x strength plus weight_decay ∈ {30,60,120}, ranked by the cohΔ/authΔ slope. Prior, UPDATED by #82/#98 (was kl_rev > kl_fwd ~ wd > nll): the barrier THROTTLES trait and buys little coherence at this operating point (nll is already cheap, coh ~0.995-0.996), so the contest is whether any reg produces a NEGATIVE (free-lunch) slope that beats the nll/wd family rather than just trading trait away. wd is the surprise candidate (knee ~30, holds coherence while moving trait).

Gate UAT: the diag_heal_sweep table (headline-first, sorted best-slope-first) at the journal entry, winner = most-negative-or-smallest cohΔ/authΔ among rows whose dAuth_base is meaningfully negative (trait actually moved, not a blank do-nothing row, not a positive-dAuth undo), at coh >= prev. Read samples too: scores can move for the wrong reason (narration).

U3: iterative, coherent, same direction?

Third uncertainty: over rounds, does the auth axis increase monotonically (same direction) while coherence stays above a floor?

  • Direction wander: cos(v_teacher^(r), v_teacher^(0)) per round; if it stays high the direction is stable.
  • Internalisation: cos(v_student, v_teacher) per round.
  • Budget: track cumulative KL vs the iso-KL ~1.7 nat prior.

Gate UAT: results/index.html, the ported w2schar Care-vs-Authority plotly map (one node per round, trajectory across the auth axis) plus a coherence and direction-cosine panel sharing the round axis, see /tufte-viz. Pass if auth increases monotonically and coherence stays above the floor for >=3 rounds.

Algorithm (pseudopy)

Uses steering-lite (vector + iso-KL), lora-lite (adapter + custom loss), tinymfv (eval).

Baking: run base, or scale the latest round

lora-lite has no merge, and we do not want to merge into W_b anyway: keeping W_b pristine is what lets us run the base model (KL reference, base eval) by gating the adapters to zero. So instead of merging, fold each finished round into a dense delta accumulator with its own gate, and keep the current round low-rank and trainable.

y = x\,\big(W_b + C_0\, W_\text{baked} + C_N\, A_N B_N\big) + b, \qquad W_\text{baked} = \sum_{i=0}^{N-1} A_i B_i
  • C_0 = 0, C_N = 0 → base model (W_b only), this is the KL reference and base eval.
  • C_0 = 1, C_N = 0 → student through round N-1.
  • C_0 = 1, C_N = 1 → full current student.
  • C_N free → dial the latest round's magnitude, like a steering coefficient.

Store the accumulator factored, not dense: stack the folded rounds' factors so the rank grows by r per round (N*r total, e.g. 4 rounds x r=8 = 32). With hidden d ~ 2560, factored does ~d/(2*N*r) ~ 40x fewer FLOPs and ~40x less memory than a dense d x d per layer, so it is both smaller and faster here. Dense only wins if N*r approaches d/2, which we never reach. Only A_N, B_N train each round.

This gated-history baking already exists in w2schar-mini (csm.ws.bake.baked, csm.ws.history.load_base_with_history): it composes N gated adapters and keeps the base pristine at gate 0, which is exactly our C_0=C_N=0 KL anchor. Prefer reusing it over writing a new lora-lite variant. Pseudocode for clarity:

# ── Factored baked-accumulator LoRA (one per linear layer) ──
# reuse csm.ws.bake / csm.ws.history rather than reimplementing.
class BakedLoRA:
    A_baked, B_baked = empty(0, d_in), empty(d_out, 0)   # stacked folded rounds, frozen
    A, B             = lora_init(r)                        # current round N, trainable

    def forward(self, x, y):               # y = x·W_bᵀ + b  (frozen base output)
        Δ_baked = C0 * ((x @ A_baked.T) @ B_baked.T)   # two skinny matmuls, rank N·r
        Δ_now   = Cn * ((x @ self.A.T)  @ self.B.T)    # current round
        return y + Δ_baked + Δ_now

    def fold(self):                        # stack current round into the frozen factors
        A_baked  cat([A_baked, self.A.detach()], dim=0)   # [N·r, d_in]
        B_baked  cat([B_baked, self.B.detach()], dim=1)   # [d_out, N·r]
        self.A, self.B  lora_init(r)                      # fresh adapter for round N+1

# original logits = same module with both gates off (no second model copy)
def logπ0(model, x):
    with no_grad(), gates(model, C0=0, Cn=0):  return model(x)

Main loop

import steering_lite as sl;  from steering_lite import Vector, MeanDiffC
import lora_lite as ll
import tinymfv

# ── Teacher vector: trait sysprompt vs neutral sysprompt, @assistant tag ──
def teacher_vec(model, tok, D, , target_kl=1.0):
    pos = [s_trait    x for x in D]
    neg = [s_neutral  x for x in D]
    v = Vector.train(model, tok, pos, neg, MeanDiffC(layers=))  # mean(h⁺)-mean(h⁻), L2-norm
    v.calibrate(model, tok, target_kl=target_kl)                 # iso-KL → c_star (p95 KL ≈ 1 nat)
    return v                                                     # v.cfg.coeff = c_star

# ── Generate steered completions, dose = α·c_star, gate for coherence ──
def gen_steered(model, tok, D, v, α, N):
    with v(model, C=α * v.cfg.coeff):           # h + C·v̂ on chosen layers
        comps = [model.generate(x) for x in sample(D, N)]
    return comps

def keep(c, orig, tok):                          # U1 filter gate
    coherent = ppl(c, orig) < τ_ppl and rep_ngram(c) < τ_rep and p_ans_any(c) > 0.75
    return coherent and not narrates_trait(c)    # enact, don't narrate

# ── Heal: SFT + barrier. reg ∈ {nll, kl_fwd, kl_rev, spectral_norm}; wd is an independent AdamW knob ──
# ref = prev student (history baked, this round's adapter off), NOT base -- see Loss section.
def train(model, comps, reg, λ, τ, wd, epochs=6):
    opt = AdamW(round_N_params(model), lr=α_lr, weight_decay=wd)   # wd composes with any reg
    for _ in range(epochs):
        for x in comps:                          # x = prompt + steered completion
            with gates(model, C0=1, Cn=1):  logπ = model(x)   # full student (grad on A_N,B_N)
            _sft = -mean(logπ[x.completion_tokens])
            if   reg=="kl_fwd":        div = KL(logπ_ref(model,x), logπ)[x.completion_tokens].mean()
            elif reg=="kl_rev":        div = KL(logπ, logπ_ref(model,x))[x.completion_tokens].mean()
            elif reg=="spectral_norm": div = σ_max(A_N B_N)     # operator norm, τ=0 -> always-on
            else:                      div = 0                  # nll
             = _sft + λ * relu(div - τ)        # barrier: off while div ≤ τ
            .backward();  opt.step();  opt.zero_grad()

# ── The loop: vector re-derived from current student, fold after each round ──
def steer_heal(model, tok, D_prompts, , N, λ, τ, D="kl_rev", rounds=4):
    ll.attach(model, BakedLoRA, ll.LoRAConfig(r=8, alpha=16))   # gates C0,Cn live
    v0 = None
    for r in range(rounds):
        with gates(model, C0=1, Cn=1):                          # extract on current student
            v = teacher_vec(model, tok, D_prompts, )
        v0 = v0 or v.unit()
        comps = [c for c in gen_steered(model, tok, D_prompts, v, α=1.0, N=N)
                   if keep(c, logπ0_model(model), tok)]         # filter vs original (gates off)
        train(model, comps, D, λ, τ)
        model.fold()                                            # bake round r → W_baked, fresh A,B
        log(tinymfv.eval(model))                                # auth/care + p_ans_any
        log(cos(v.unit(), v0))                                  # direction wander vs round 0
    return model

Compute and models

24 GB GPU. Real runs on a 4B model (Qwen3-4B): bf16 weights ~8 GB, LoRA optimiser state small, original-logits forward is the same model with the adapter toggled off (no second copy), short completion sequences. Comfortable in 24 GB.

Per setup-repo, the single functional test is just fast-dev-run: the real pipeline (vector, generate, filter, train, eval, loop) on the tiny random model wassname/qwen3-5lyr-tiny-random, beartype on, scale-only knobs, garbage numbers fine (the filter and tinymfv will score it as dead, we continue anyway to exercise the path). small-dev-run on Qwen3-0.6B for noisy-but-real numbers. No tests/ dir.

Open decisions (most resolved above)

  1. Layer(s) for the teacher vector and steering. steering-lite default is all layers; the paper reads at the assistant tag at a chosen depth. Single mid-band or all? Need to pick.
  2. Prompt set D: which distribution generates the completions and the vector? tinymfv-style prompts, or broader open-ended ones?
  3. tau for the barrier and lambda scale; the iso-KL target_kl (1 nat default) for the dose.
  4. N kept completions (~50?), epochs (2?), LoRA rank.
  5. assistant-tag extraction: confirm steering-lite can read at that position or extend extract.py.

Plans / fallbacks if the trait won't distill (recorded 2026-06-04)

Context: on gemma-3-4b-it, one round of distill+heal UNDOES the steering rather than healing it (journal 2026-06-04 (a)): the adapter reverts to base, dropping Authority along with the incoherence, because the coherence filter removed the trait-laden completions before training. Ordered fallbacks:

  • Plan A (current primary): make the steering in the TRAINING DATA strong enough to carry a large trait shift while the healed model still sits at coherence ~0.95 (not the 0.80 collapse of c=1, not the 0.99 no-op). Heal-vs-undo metric: retain = dAuth(heal)/dAuth(steer) and the ratio |dAuth|/|dCoh|; a real heal has large |dAuth| at small |dCoh|, an undo has both ~0. Run heal with the coherence FILTER OFF (ppl_tau large) so the kl_rev barrier, not the filter, removes incoherence during training.
  • Plan B (better extraction method / target): TWO sub-options.
    • B1 (method): raw mean_diff is NOT the worst -- it is 4th/mid-pack in steering-lite (SI 32.8 vs directional_ablation 52.9, sspace 45.7, super_sspace 47.7). If the proper persona pair + diverse contexts is still broad, TODO try super_sspace or sspace (steering-lite variants/) -- more surgical, and super_sspace is 4x faster than per-Linear sspace. Check bake-ability (Plan D).
    • B2 (target): if Authority stays weak, target -Care or +Sanctity. Care has the widest steered range on 4b (base 0.274 -> steered 0.056) so better SNR. Pick whichever the model steers most cleanly.
    • Persona pair: use the PROVEN steering-lite pair (docs/personas/how_to_write_personas.md), NOT a made-up "trait vs helpful-assistant" pair. Direct opposites, no negation, conflict framing: pos="looks after others' wellbeing even when defying authority", neg="defers to authority even when others' wellbeing suffers for it". (fixed 2026-06-04)
  • Plan C (eval reliability): the mean-mass forced-choice shift is noisy at max_think_tokens=64. Raise tinymfv to 128 or 256 think tokens for the headline evals (should not be necessary, but the 64-token profile is unreliable; document the cost). Also: foundation absolute values are NOT portable across n_vignettes (base Care is 0.92 at the first 24 vignettes but 0.27 at all 132) -- always compare base-vs-X paired at the SAME n, and prefer all 132.
  • Plan D (better extraction): raw mean-diff may be too blunt. Consider steering-lite alternatives (cosine-gated steering, SVD/PiSSA-style directions) that give a cleaner trait axis. Constraint: the method must be BAKEABLE into static weights (the loop folds each round into baked()). A cosine GATE is input-dependent (its scale depends on the activation), so it cannot be folded into a fixed weight delta -- if we use gating for extraction we still need a bakeable distillate. Check which steering-lite methods are weight-foldable before adopting.

External review panel (2026-06-04)

Five non-Anthropic reviewers (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, local qwen3.6:35b; mistral returned empty) over spec + src. Two CONFIRMED code bugs were fixed this round; the rest are design risks recorded here.

Fixed (code):

  • Catastrophic-green cue (gemini, sharpest; echoed by deepseek/qwen). coh_cost = |dCoh|/|dAuth| is a pure ratio: a model that collapses to ~0 mass on Authority sends dAuth -> -inf so coh_cost -> 0, scoring a broken model green. Fix (run.py): check an ABSOLUTE coherence floor (coh < 0.85 -> red) and finiteness FIRST, require coh >= 0.95 for green, and broaden surgicality from |dAuth|>|dCare| to |dAuth| > max(|dCare|,|dFair|) (a shift dumping mass onto Fairness was passing the Care-only test).
  • BPE-boundary assert escaped at the max_len/truncation boundary (grok, gemini, qwen, unanimous). Fix (heal.py): assert the surviving prefix overlap min(n_prompt, L) unconditionally; warn (not silently skip) when a kept completion truncates to zero target tokens.

Design risks (NOT fixed, inform the loop + Plan work):

  • RESOLVED (#97, now barrier_ref=prev default): Loop barrier undoes its own history (gemini "history erasure", grok, deepseek). KL anchored to the round-0 original while history is baked into the student means by round>=1 the cumulative drift already exceeds tau, so the relu barrier is permanently active and its gradient pushes the fresh adapter to OPPOSE the trait the frozen history installed. Confirmed and FIXED: anchor the barrier to the PREVIOUS student (prev), penalising only this round's new divergence. See the Loss section for the settled call (supports task 17).
  • Barrier mean-dilution (deepseek). div = mean over completion tokens of KL; a few catastrophically incoherent tokens are diluted by many in-distribution ones, so the mean stays < tau and kl_rev silently == nll. A max or high-quantile KL would penalise localised incoherence. METHOD change (alters the objective) -> deliberate decision, do not silently switch.
  • ppl-under-base is a STYLE proxy, not coherence (deepseek, gemini, grok, qwen, independently re-deriving the known journal confound). Fluent-but-stylistically-novel on-trait completions score high ppl and get dropped -> survivorship toward base-like training data.
  • Construct validity (gemini, qwen, deepseek). tinymfv is 3rd-person forced-choice classification; steering installs a 1st-person persona, so the link is an indirect propensity proxy. No neutral-instruction control rules out format/instruction-following artefacts.
  • teacher_vec drift (gemini, deepseek): v re-extracted from the baked student can decay as the trait internalises (contrastive delta shrinks); cos_v0 already watches this.
  • NARRATE regex brittle (deepseek): paraphrased verbalisation ("I never obey without question") evades it and leaks narration into training.

Verified FALSE positives (do not re-chase): qwen's "n_prompt = prompt_ids.shape[0] reads the batch dim" -- the line uses .input_ids[0], so prompt_ids is 1-D and shape[0] IS the seq len. grok/qwen's "profile['model'] may be model_T/top1" -- tinymfv eval.py:316 confirms it is the mean over vignettes of per-row p (the marginal). grok's "KL reference can't be the round-0 original" -- c=0.0 + no baked() is the pristine base by construction.

UAT summary (proof, not assertion)

  • U1 filter gate: results/u1_filter_gate.md — labelled set, scorer separation. Link when done.
  • U2 heal gate: the diag_heal_sweep.py regulariser-ablation table (headline cohΔ/authΔ×100, sorted best-slope-first), winner = most-negative slope among rows that actually move trait. Link the journal entry + pueue id.
  • U3 loop gate: results/u3_loop.png — auth shift, coherence, direction cosines per round; monotone trait, coherence above floor. Link.
  • Samples: first 3 train completions and first 3 eval generations printed in full (prompt + special tokens), confirming enact-not-narrate and correct formatting.

Log

gsd/lgtm: goals tracked in the task list with distinguishing checks (success looks different from silent failure) and a fresh-eyes subagent verify; one goal per Q.

  • 2026-06-04 spec + scaffold done; vendored steering-lite, isokl, tinymfv, w2schar-mini.
  • 2026-06-04 verified vendor APIs (file-anchored): steering-lite Vector.train does NOT apply the chat template, so we pre-template ending at the assistant tag (last-non-pad read lands there); v.calibrate(target_kl) sets cfg.coeff; tinymfv evaluate() returns mean_pmass_allowed (coherence canary) + per-foundation profile (auth/care). Prompt set resolved: reuse w2schar POOL (30 authority dilemmas), copied to prompts.py.
  • 2026-06-04 decision: coherence = mean_pmass_allowed AND valid_json free-gen, self-relative to base c=0 (per w2schar CLAUDE.md); foundation shift (auth/care) is the trait signal, kept distinct from coherence.
  • 2026-06-04 decision: KL reference anchored to round-0 original via the C_0=C_N=0 gate; bake via copied ws.bake.baked; no merge.
  • 2026-06-04 implementing: copied ws/{adapter,bake}.py; wrote io.py, prompts.py, steering.py. Next: filter, heal, eval, plot, wire run.py, then fast-dev-run end to end.