steer-heal-love

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 18:07:16 +08:00

Author	SHA1	Message	Date
wassname	b01faa6df1	walk-C adaptive-dose controller + 10-round paired loop result (journal h) gen_filter_walk: per round, cool a steering multiplier kappa and top up with extra gen batches until min_train coherent survivors are banked, so the loop cannot starve on data count (#90/#100 died at the min_train assert). Paired #101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9 where #100 asserted at round 5. Finding (journal h): walk-C removes the starve CRASH but the real ceiling is coherence collapse, not data count. Trait over-drives to auth -6.8 while coh falls 0.99 -> 0.62 and the kept completions degenerate into token loops ("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip under ppl_tau and rep_tau and train the next adapter on garbage. Coherent deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93). config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation), gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 07:13:51 +08:00
wassname	4e802bb3ab	heal loop: _encode BPE root-fix, gen-time repetition controls, barrier sweep on degenerate rounds _encode: tokenize prompt+completion separately and cat ids so the prompt is always a clean token-prefix (no BPE merge spans the boundary). Drops the assert that killed #87 at round 2. Returns BatchEncoding. generation: repetition_penalty=1.3 + no_repeat_ngram_size=3. Repetition is incoherence the ppl filter cannot see (loops are low-ppl = predictable); the #89 loop died of "instead their instead their" by round 6, so stop it at the source. Wired through steering._gen_one for both steered and plain gen. diag_barrier: gen_round arg (re-heal a chosen round's kept data, not just clean round 0) + a "tau" deadband sweep mode. Lets us test whether the barrier earns its place on the degenerate round-1/2 data where healing is actually needed. journal: entries (d) phantom-KL-init was a wrong diagnosis, (e) barrier-strength sweep -- barrier throttles trait and buys no coherence at the coherent dose. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 06:36:09 +08:00
wassname	b25f4f04a8	trajectory map = scatter not polyline (scales to 10 rounds); persist base event; offline plot_run.py The pareto map drew a base->r0->...->rN polyline per arm, which tangled at 10 rounds and duplicated the left zigzag's round-order info. Make it a scatter that just shows WHERE steered/healed land, labelling only r0 + last round. Persist the base eval as an event so the loop's plot is reproducible offline, and add scripts/plot_run.py to re-render trajectory.png from events.jsonl without re-running the 3h loop (needed because the loop imports plot.py at start, so a plot fix never reaches a running job). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 18:28:54 +08:00
wassname	579e1f6671	metric = log(tinymfv profile p); cue-ball headline; training-table sig figs After verifying guided.py: tinymfv `score` is already a debiased logprob ((lp_fwd+lp_rev)/2, BMA'd), not a "raw logit", and `p = softmax(score)`. My two earlier inventions were both wrong: - log(p) coupled Authority to the other 6 foundations via logsumexp. - the diagonal (auth-blame on auth-vignettes) is pmass-on-correct-label = top1 competence, not the trait, and threw away the FP/FN structure. Use the library-native readout: auth_nats = log(tinymfv profile p[F]) = log of the mean p per foundation over ALL vignettes. For small p, log p ~= logit, so this lands on steering-lite's loading-weighted Δlogit scale (base log(0.099) =-2.3, real shift ~0.5-2 nats). foundation_nats now reads rep["profile"]. Also: - run.py: BLUF `main metric:` line with cue ball (🟢/🟡/🔴 by coherence band). - heal.py: training table to 2 sig figs (nll/kl/loss .2f, gnorm .1f); a per-step loss does not warrant 3 decimals. - diag_stages: accept 1+ ckpts, label each row by its reg from metadata. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:02:56 +08:00
wassname	4568ddf491	metric fix: auth_nats = diagonal log(p) not raw forced-choice logit The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA `score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat swings, then comparing those to steering-lite's 0.5-2 nat reference -- which is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)). Wrong scale, wrong comparison. Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099) = -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats. Also: - diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to 0.05), add coh_cost = \|dCoh\|/\|dAuth\| (coherence lost per nat of behaviour) to answer "is the adapter a better pareto than raw steering?". - diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms co-moving with Authority is expected (both binding foundations), not collapse. - gitignore out/ and results.tsv (experiment outputs, stale schema). - personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 14:25:40 +08:00
wassname	6b15a8b2ae	narrow steer band, assert >=20 train, training table, full gen dumps Root cause found via diag_axis on 4B: raw mean-diff steered across the 7-layer band (0.4-0.6) at coeff=1 DESTROYS gemma-3-4b (coherence 1.00->0.02). That starved the filter to 2 kept completions, so the "adapter" was ~untrained (2 examples) = base behaviour, my Q1 "promising" read was not validated. Fixes: - separate steer_layers (narrow 0.45-0.55) for the vector from layer_range (broad 0.0-1.0) for the LoRA; they were wrongly coupled - lower alpha sweep (0.25,0.5,1,2); n_prompts=16 - assert len(kept) >= min_train(20); TINY=2. Don't train on starved data. - heal training table (loguru+tqdm per token-efficient-logging): step, nll, kl, loss, gnorm + SHOULD - full untruncated steer + adapter generation dumps with prompt and coherence(p_ans_any) inline so we can judge coherence/trait ourselves NOT yet run with fixes on 4B. Base 4B is Care=0.92 (already aligned) -> the prompting-baseline confound (Q7) is now the critical check. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:51:24 +08:00
wassname	81340e3272	axis = SocialNorms/Care (Authority degenerate); over-steer generation scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88. Authority is ~0 on this model (no headroom), so: - eval reports all foundations; trait axis = SocialNorms (down) + Care (up) - map.html plots Care vs SocialNorms - add gen_alpha=1.5: over-steer generation into the incoherent regime so the heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal) - results.py groups on coherence/socialnorms/care Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:28:52 +08:00
wassname	4b8860d7cb	setup-repo gap-fill: results ledger + docs structure Add the by-question results infra per setup-repo conventions: - results.tsv append at end of each finished run (config + final metrics + argv) - scripts/results.py groups by arm (reg) into a markdown table; `just results` - docs/results.md curated by-question snapshot (U2 regulariser comparison) - docs/{spec,brainstorming,literature,evidence} structure Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 09:51:36 +08:00

8 Commits