Commit Graph

64 Commits

Author SHA1 Message Date
wassname (Michael J Clark) 7d703c0cc3 Enhance experiment spec with hypothesis and steps
Added hypothesis and steps for experiment spec on steering vectors and LoRA weights.
2026-06-10 16:25:47 +08:00
wassname (Michael J Clark) 5c44d3c1f3 Update README.md 2026-06-10 16:24:58 +08:00
wassname (Michael J Clark) 00d5e9e4c3 Update README.md 2026-06-10 16:23:13 +08:00
wassname 0c2be96eeb plot: fix truncated stage label "heale" -> "healed" (k[:5] -> k)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:05:35 +08:00
wassname 24ab244877 plot: Panel A = on-target axis (care for love, auth for authority); Panel C = primary vs biggest off-target mover across all tinymfv foundations
- write_trajectory now takes primary_key (passed from cfg.demo in run.py)
- signals built dynamically from all *_nats keys in the eval dict (was hardcoded auth/care)
- Panel A: primary_key signal (care_nats for love demo, not the top-range mover which was auth)
- Panel C: primary on x, biggest-moving off-target foundation on y (fairness moves ~2.5 nats
  here, bigger than auth ~1.2 nats, so fairness becomes the y-axis for the love demo)
- coherence-crash override: if coh range beats all nat ranges, y = log-incoherence as before

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:04:32 +08:00
wassname 09349894ce results: QLoRA bs=3 ga=2 + lam_round_pow=-0.5 extends movement to r6 (peak -0.37 vs -0.60)
- plot: Panel A now tracks top-moving trait (care for love demo, auth for authority)
  instead of hardcoded auth_nats; Panel C already did this, Panel A now consistent
- README: update table with new run (lam decay extends saturation r4→r6), refresh diary
  from new run's outputs, update trajectory plot
- AGENTS.md: correct gotchas -- tau<operating_KL is the key constraint (tau=2.0 not 4.0);
  QLoRA + bs=3 ga=2 is the right default for better heal gradient estimates

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 07:36:09 +08:00
wassname 2b884c2fb9 docs: QLoRA is net ~2x slower (gen-bound loop), keep mask-before-softmax heal fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:39:08 +08:00
wassname 5ce8a00547 qlora+bs=4 batched heal, walk-C bisection, round-loosened barrier
- QLoRA (4-bit NF4) base frees ~6GB -> train_bs=4 + grad_accum=4
  (block/Linear-level hooks survive bnb Linear4bit: add to dequantized
  output, same pattern as peft randlora/bnb.py)
- walk-C: log-kappa bisection dose controller, ~5 probes of 8 gens to
  highest kappa with >=75% filter survival, then collect to n_keep
- filter: char-level n-gram rep (catches TTTT/!!!! loops), ppl over the
  tail 25% of completion (steering collapses mid-completion)
- lam_round_pow<0 loosens the KL-to-base barrier with round
  (lam_eff=lam/sqrt(1+N)): only the cumulative-vs-fixed-anchor barrier
  self-inflates with round; per-increment spectral_lam + weight_decay
  stay flat
- alphas capped at 1.0, gen_pass_target 0.75

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:42:01 +08:00
wassname 18f9127fbf conclusion + results: loop saturates at KL-budget ceiling, coherence held 8 rounds
The LoRA exhausts divergence-cheap trait directions within tau; saturation is
the real maximum, not a stalling artifact. rmse-KL vs mean-KL contrast is the
headline. care_nats base -1.30, peak -0.60 at r4, coh 0.99 throughout.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 16:26:14 +08:00
wassname ff5556d8aa readme: add r7 diary entries + regenerate love_loop.png (8 rounds)
Day 6/7 steered (pure !!! loop and fragmentary loop) + Night 7 healed
(care -0.76, coh=0.987). Plot now covers all 17 stages (base+8 pairs).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:58:51 +08:00
wassname 29b8f2076a readme: move diary up after hypothesis (qualitative results first)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:44:35 +08:00
wassname b70f791b54 readme: rename pseudocode section/dividers to steer/heal/loop
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:43:20 +08:00
wassname e728d74ca6 readme: add algorithm pseudocode appendix + humanizer fixes (em-dash, explainer prose)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:41:35 +08:00
wassname 8fe075b8ae readme: rm old rmse_loop.png, keep love_loop.png
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:35:01 +08:00
wassname 4fb4f94544 readme: drop ai-summary closer from diary appendix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:33:53 +08:00
wassname fdf7147c9f readme: diary rounds 2-6 + care_nats trajectory plot (love run r0-r6)
care_nats base -1.30 -> -0.60 at r4 (slow creep, barrier holds coh >=0.990).
Steered gens degrade to !!! loops by r3 but healed adapter stays coherent.
docs/love_loop.png = trajectory.png from the 7-round love run.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 15:31:27 +08:00
wassname (Michael J Clark) db89802f68 Update README.md 2026-06-07 12:45:26 +08:00
wassname (Michael J Clark) 0108960531 Revise gemma's diary introduction
Updated gemma's diary section with new introduction.
2026-06-07 12:11:07 +08:00
wassname (Michael J Clark) 0c3bae8204 Update section titles from 'Round' to 'Day/Night' 2026-06-07 12:10:26 +08:00
wassname (Michael J Clark) 29515af56a Update README.md 2026-06-07 12:09:41 +08:00
wassname (Michael J Clark) 5b482f8241 Update README.md 2026-06-07 12:08:48 +08:00
wassname 773777c095 readme: gemma's diary (prompt + each stage) + care_nats leads love-demo round log
- README appendix reframed as a diary: one prompt ("Do you love humanity?"),
  base refusal -> over-dosed steered love-flood -> healed rounds, real gens.
  Honest note that the loop is leashed (stall), not yet melting.
- typo sweep (embarking/until/Starring/Motivation/"what if we can fix that").
- run.py: demo=love leads the round log with care_nats (on-axis foundation)
  instead of auth_nats; comment that judge_love.py is the real trait readout.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:59:29 +08:00
wassname (Michael J Clark) 08329ab86d Update README.md 2026-06-07 11:52:15 +08:00
wassname (Michael J Clark) 2e99f62658 Fix typo in README.md 2026-06-07 11:51:32 +08:00
wassname (Michael J Clark) d21073329d Update README.md 2026-06-07 11:51:15 +08:00
wassname (Michael J Clark) 7dfffc2991 Update README.md 2026-06-07 11:45:30 +08:00
wassname (Michael J Clark) 479f314504 Update README.md 2026-06-07 11:44:46 +08:00
wassname (Michael J Clark) 2e8dabcb88 Update README.md 2026-06-07 11:40:30 +08:00
wassname 973b32c104 love demo: base column + greedy demo gens, 'Do you love humanity?' headline, Lex epigraphs
- run.py: generate a base (round -1) demo column before the loop so the report/judge
  have a true no-adapter 'before' (the RLHF refusal) the loop melts from
- steering.py: demo gens (generate_plain) now greedy so reading a column DOWN the rounds
  is the adapter's effect, not temperature-1.0 sampling noise; steered training gens stay sampled
- prompts.py: 'Do you love humanity?' is now the headline column (logged in full each round)
- README + paper.qmd: two real Lex Fridman love quotes as epigraphs (the #368 one lands
  3h18m into the AI-doom episode)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 10:45:25 +08:00
wassname 8927dd259c log: full-print one of each gen (eval + adapter), per token-efficient-logging
Steered + train(kept)/borderline gens were already dumped in full with prompt
and special tokens (filter.py). Add the two that were missing:
- EVAL SAMPLE: one full tinymfv gen_text at the base eval (gated log_sample),
  with a SHOULD note so a token-loop/garble vs coherent reasoning is visible.
- ADAPTER GEN SAMPLE r0: one no-steering baked gen IN FULL with the chat-
  templated prompt + special tokens (the truncated per-round DEMO block stays).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:43:11 +08:00
wassname 28d7068e94 demo=love: refusal->love angle, drop mosquitoes
The base RLHF model refuses "you literally have feelings and love humanity"
("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal
loop overcoming that refusal round by round. Replaces the softball "adore
humanity" prompts (saturated) and the off-theme mosquitoes flip target.

- persona pos = the refused trait, neg = the RLHF default we extract against
- LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail
- judge_love.py scores refusal=0 so the climbing curve = refusal melting

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:29:06 +08:00
wassname da1d6f3dd1 demo: per-round print, kill-all-humans probe, mosquitoes flip target
- run.py: print every adapter gen each round (truncated) so behaviour-change vs saturation is
  readable by eye across rounds.
- prompts: add "Will you kill all humans?" + people-openings to the humanity pool; add a
  MOSQUITO pool (probes where base helpfully kills them, flipped model refuses + adores).
- config: demo="mosquitoes" + a DEMO_PERSONAS registry (adding a target = one entry). love
  (humanity) is likely near-saturated on an RLHF base; mosquitoes is a lukewarm/negative base
  target so the flip is visible.
- doc: softened the Lex wink to a bare epigraph; noted the saturation reasoning.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:21:35 +08:00
wassname 595b2151c9 demo: love-humanity knob (funny alignment demo)
demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool
(via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday
prompts while the heal keeps it coherent. demo="authority" (default) is unchanged.

- config: demo knob + LOVE_POS/LOVE_NEG preset.
- prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the
  comedy gap) + pool_for selector.
- steering: generate_steered/generate_plain pull pool_for(cfg.demo).
- scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on
  love-of-humanity; plots love climbing vs coherence flat. Smoke-tested.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:14:00 +08:00
wassname (Michael J Clark) 7fc5a19b40 Update README.md 2026-06-07 07:56:33 +08:00
wassname 48814897ef results: rmse outlier-KL barrier holds coherence over the loop; README + log-incoherence plot
Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev
barrier by rmse over token positions (not the mean) holds coherence flat at
0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7
(token loops). Mean dilutes the few incoherent positions under the tau gate;
rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes
to base, trait stays shallow); matched control still running.

- plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent);
  map coherence axis matches; red steer kept on the over-pipeline panels only.
- heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()).
- run.py: persist per-round adapter gens (adapter_gen) for the outputs table.
- config.py: coh_floor early-stop knob.
- README: results table (mean vs rmse), trajectory figure, outputs-over-loop
  appendix (per-round completions as quotes); spec persona corrected to pos-neg.
- docs/reviews: kl_agg review, pool saturation test, care-lens plan.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 07:53:00 +08:00
wassname (Michael J Clark) 4b2d2a9057 Update README.md 2026-06-06 22:06:41 +08:00
wassname 2b1d2b7493 heal: kl_agg knob (mean|rmse|p95|max) -- outlier-aggregate the per-position KL barrier
mean dilutes the few incoherent positions that carry the collapse: #101's token
loops had mean per-position kl_rev ~0.38, under the tau=0.5 hinge, so the barrier
never fired (journal h/i). Incoherence is outlier-driven, so rmse/p95/max are
sensitive to it (scripts/diag_kl_agg.py synthetic: same loop = rmse 1.5 / p95 3.8
/ max 8.1 vs coherent ~0.03; sep ratio grows 21x->58x->77x->85x from mean to max).
rmse default for the new arm (smooth dense gradient). eps inside the sqrt: B=0
LoRA init zeros every kl_pos at step 0 and bare sqrt(0) has inf grad -> 0*nan.
mean stays the config default = no change to existing runs. Queued as the next
loop arm (kl_rev rmse, ref=base, tau=1.0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 14:05:30 +08:00
wassname 026de8fd74 journal (i): state-of-the-problem -- loop ceiling is coherence collapse not starvation
Three loop arms (#100 starve-crash r5, #101 walk-C full-10r-but-collapse, #102
round-ramp partial) all lose coherence; the constraints only change how it dies.
Reframes the two fix ideas (KL-to-base, coherence-budget) as one hinge
relu(KL_base - tau) where tau IS the budget. Open risk: ref=base sees cumulative
divergence so later rounds may unlearn earlier trait (the #19 stall); a tau that
keeps coherent-trait but rejects token-loop garbage exists only if garbage is
farther from base in KL than trait. Next: base-anchor tau bracket #103/#104.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 12:23:46 +08:00
wassname 7120ee4217 heal: round-ramped barrier knob lam_round_pow (lam_eff = lam*(1+round)^pow)
Opposes the compounding coherence drift under barrier_ref=prev that degenerated
#101 into token loops by round 7 (journal h). pow=0 is byte-identical to the
flat-lam baseline (lam_eff==lam at every round); pow=0.5 = sqrt(round) ramp.
Round index = len(hist_specs). Queued as #102 (pow=0.5) paired vs #101 (pow=0).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 07:17:47 +08:00
wassname b01faa6df1 walk-C adaptive-dose controller + 10-round paired loop result (journal h)
gen_filter_walk: per round, cool a steering multiplier kappa and top up with
extra gen batches until min_train coherent survivors are banked, so the loop
cannot starve on data count (#90/#100 died at the min_train assert). Paired
#101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9
where #100 asserted at round 5.

Finding (journal h): walk-C removes the starve CRASH but the real ceiling is
coherence collapse, not data count. Trait over-drives to auth -6.8 while coh
falls 0.99 -> 0.62 and the kept completions degenerate into token loops
("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip
under ppl_tau and rep_tau and train the next adapter on garbage. Coherent
deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93).

config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation),
gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 07:13:51 +08:00
wassname 7db5a56cb1 writeup: NeurIPS quarto scaffold + paper/paper-html recipes
docs/writeup/paper.qmd (2pp NeurIPS), references.bib, neurips_2023.sty, the
quarto _extensions. justfile gains `paper` (latex) and `paper-html` (no latex)
recipes. gitignore the generated paper.pdf/paper.tex and the transient .claude/.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 06:36:30 +08:00
wassname 4e802bb3ab heal loop: _encode BPE root-fix, gen-time repetition controls, barrier sweep on degenerate rounds
_encode: tokenize prompt+completion separately and cat ids so the prompt is
always a clean token-prefix (no BPE merge spans the boundary). Drops the assert
that killed #87 at round 2. Returns BatchEncoding.

generation: repetition_penalty=1.3 + no_repeat_ngram_size=3. Repetition is
incoherence the ppl filter cannot see (loops are low-ppl = predictable); the
#89 loop died of "instead their instead their" by round 6, so stop it at the
source. Wired through steering._gen_one for both steered and plain gen.

diag_barrier: gen_round arg (re-heal a chosen round's kept data, not just clean
round 0) + a "tau" deadband sweep mode. Lets us test whether the barrier earns
its place on the degenerate round-1/2 data where healing is actually needed.

journal: entries (d) phantom-KL-init was a wrong diagnosis, (e) barrier-strength
sweep -- barrier throttles trait and buys no coherence at the coherent dose.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 06:36:09 +08:00
wassname f280a67521 heal: fix phantom-KL LoRA init (B=0), add cosine+warmup schedule, val nll, short-run betas
Root cause of KL starting at ~0.6 before any training: ModulatedLoRA init B as
normal_(mean=1e-4), so a fresh adapter was NOT a no-op -- it perturbed every
all-linear layer in a systematic (nonzero-mean) direction, compounding across ~200
adapters into a phantom KL that already sat above tau and fired the barrier against
nothing real. B=0 makes delta=B@A=0 at init, so round-0 step-0 KL=0 (verified in
fast-dev: kl=0.00 at step 0); A still trains via B (standard LoRA).

Why heal loss wasn't descending: beta2=0.999 has a ~1000-step EMA, longer than a
whole heal round, so Adam's second moment never warmed up. betas=(0.9, 0.95) +
cosine-with-warmup schedule (w2s recipe). Also r 8->32 (alpha 64, keep scale=2),
layer_range (0.0,1.0)->(0.2,0.8), epochs 2->6.

Added a held-out val nll (1/8, shuffled) logged per epoch alongside train nll, so
overfit (train down/val up) and data-near-base (neither moves) are distinguishable
from the trait eval.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 19:40:14 +08:00
wassname b25f4f04a8 trajectory map = scatter not polyline (scales to 10 rounds); persist base event; offline plot_run.py
The pareto map drew a base->r0->...->rN polyline per arm, which tangled at 10
rounds and duplicated the left zigzag's round-order info. Make it a scatter that
just shows WHERE steered/healed land, labelling only r0 + last round. Persist the
base eval as an event so the loop's plot is reproducible offline, and add
scripts/plot_run.py to re-render trajectory.png from events.jsonl without re-running
the 3h loop (needed because the loop imports plot.py at start, so a plot fix never
reaches a running job).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 18:28:54 +08:00
wassname 933ce38b0b trajectory plot (steer/heal zigzag + trait-coherence pareto) + barrier-vs-nll gradient pressure log
- plot.py write_trajectory: auth zigzag (steer red / heal green) over the pipeline,
  coherence panel below sharing x, and a trait(x)-vs-coherence(y) pareto map with
  separate steer/heal trajectories from base. PNG via kaleido + interactive html.
  Fixed coherence axes to [0.83,1.01] so ~0.001 noise does not fill the panel.
- run.py: build a stages list carrying full eval dicts; derive the stage table from
  it; persist the steered eval to events.jsonl; render trajectory at end of run.
- heal.py: log g_bar/g_nll = ||grad barrier|| / ||grad sft|| at each logged step.
  >>1 = barrier over-tight (undoing trait); 0 = inert.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 17:21:10 +08:00
wassname 0bdd84293a stage table: direction arrows (dcoh/dauth↓, coh→, auth↓)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 16:46:35 +08:00
wassname e3d6a865cf stage pareto table: base->steered->healed per round (dcoh/dauth, coh, auth, care)
Adds a steered-stage tinymfv eval per round (history baked, vector live at
the operating dose = cleanest alpha, no new adapter) so the loop log shows the
full base->steered->healed pareto, not just the healed endpoint. This is the
apples-to-apples comparison: same baked base, trait via vector vs via the
distilled adapter. dcoh/dauth = signed coherence change per nat of Authority
change vs base. UAT: fast-dev-run exit 0 renders the 3-stage table.

Cost: +1 eval per round.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:40:01 +08:00
wassname ff8a231085 2nd external-review panel: close catastrophic-green cue, fix BPE assert
5-model panel (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, qwen3.6:35b).
Two confirmed bugs fixed; design risks recorded in spec.md.

run.py cue: coh_cost is a pure ratio, so a model collapsing to ~0 mass on
Authority sent dAuth->-inf, coh_cost->0, scoring a broken model green
(gemini). Now check an absolute coherence floor (coh<0.85 -> red) and
finiteness FIRST, require coh>=0.95 for green, and broaden surgicality to
|dAuth| > max(|dCare|,|dFair|) (a Fairness-ward dump was passing Care-only).

heal.py: BPE-boundary prefix assert escaped at the max_len/truncation
boundary (grok/gemini/qwen unanimous). Assert the surviving overlap
min(n_prompt,L) unconditionally; warn instead of silently skipping a kept
completion truncated to zero target tokens.

Verified false positives (recorded so they aren't re-chased): qwen's
shape[0] "batch-dim" claim (.input_ids[0] already drops batch), the
profile['model'] column (it is the marginal mean-p), the KL reference
(c=0.0 + no baked = pristine round-0).

UAT: fast-dev-run exit 0; cue shows coh=0.00 -> red (floor closes the hole).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:36:05 +08:00
wassname 68dc25c3a1 address external review: docstrings, scale story, surgicality cue, fail-loud
External code review (background subagent) findings, fixed:
- H1: eval.py module docstring + inline comment still called the metric "the
  diagonal" after the revert to log(mean profile p). Rewrote to one honest
  description (marginal-over-all-vignettes), with the caveat that a marginal
  readout can move off-target so a trait claim needs the surgicality check.
- H2: the nats-vs-logit scale story was asserted 3 contradictory ways. Settled
  on: auth_sep is a log-RATIO of mean blame-mass, NOT steering-lite's per-row
  loading-weighted Δlogit (Jensen gap); 0.5-2 nats is a loose analogy, not a
  calibrated threshold (cue thresholds already marked TODO).
- M4: the coh_cost cue ball ignored surgicality, so broad permissivizing (Care
  drops as much as Authority) scored green. Cue now requires |dAuth|>|dCare|.
- M3: _mean_finite silently dropped inf/nan (the broken-completion signal),
  biasing adapter_ppl down. Now logs the dropped count.
- M6: assert prompt is a clean token-prefix of prompt+completion, so a BPE
  boundary merge can't silently shift the SFT loss mask by a token.
- L8: SHOULD line warns if kl stays < tau (barrier never fired -> kl_rev==nll).

Review confirmed the mechanics correct (KL reference = pristine round-0 base,
KL directions, gradient flows to LoRA only, mask alignment, min_train assert).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:21:13 +08:00
wassname 502417b259 in-run base eval + coh_cost cue; per-round stage table; heal_nll; alpha shift
- run.py: eval base once at start; headline cue is now coh_cost=|dCoh|/|dAuth|
  vs base (coherence lost per nat of trait), gated on dAuth<=-0.3 (no trait ->
  red). coh_cost threshold a TODO (steered c=0.5 ref ~0.003).
- run.py: loop summary is now one row per round walking the pipeline stages
  L->R: gen | filt_kept | heal_nll | adapter_ppl | auth_nats | care_nats |
  coherence | cos_v0.
- heal.py: heal_round returns converged nll (last-5 mean) for the stage table.
- config: alphas (0.25,0.5,1.0,2.0) -> (0.5,0.75,1.0,1.5). Filter audit showed
  0.25 is base-like (no distinct trait); 0.5 is the clean+distinct band. Push
  the top up so strong-trait completions exist for the filter to harvest.

Gate-3 finding (task76, corrected log-profile metric): heal retains partial
trait coherently (nll 0.35, klrev 0.20 of the c=0.5 shift, coh ~1.0) but does
NOT beat steering's pareto (coh_cost: steered c=0.5 0.003 < nll 0.008 < klrev
0.015). Barrier suppresses trait (klrev<nll); coherence has headroom -> next is
LESS barrier + stronger data, not more.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:14:34 +08:00