steer-heal-love

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 18:07:16 +08:00

Author	SHA1	Message	Date
wassname	579e1f6671	metric = log(tinymfv profile p); cue-ball headline; training-table sig figs After verifying guided.py: tinymfv `score` is already a debiased logprob ((lp_fwd+lp_rev)/2, BMA'd), not a "raw logit", and `p = softmax(score)`. My two earlier inventions were both wrong: - log(p) coupled Authority to the other 6 foundations via logsumexp. - the diagonal (auth-blame on auth-vignettes) is pmass-on-correct-label = top1 competence, not the trait, and threw away the FP/FN structure. Use the library-native readout: auth_nats = log(tinymfv profile p[F]) = log of the mean p per foundation over ALL vignettes. For small p, log p ~= logit, so this lands on steering-lite's loading-weighted Δlogit scale (base log(0.099) =-2.3, real shift ~0.5-2 nats). foundation_nats now reads rep["profile"]. Also: - run.py: BLUF `main metric:` line with cue ball (🟢/🟡/🔴 by coherence band). - heal.py: training table to 2 sig figs (nll/kl/loss .2f, gnorm .1f); a per-step loss does not warrant 3 decimals. - diag_stages: accept 1+ ckpts, label each row by its reg from metadata. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:02:56 +08:00
wassname	4568ddf491	metric fix: auth_nats = diagonal log(p) not raw forced-choice logit The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA `score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat swings, then comparing those to steering-lite's 0.5-2 nat reference -- which is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)). Wrong scale, wrong comparison. Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099) = -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats. Also: - diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to 0.05), add coh_cost = \|dCoh\|/\|dAuth\| (coherence lost per nat of behaviour) to answer "is the adapter a better pareto than raw steering?". - diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms co-moving with Authority is expected (both binding foundations), not collapse. - gitignore out/ and results.tsv (experiment outputs, stale schema). - personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 14:25:40 +08:00
wassname	6b15a8b2ae	narrow steer band, assert >=20 train, training table, full gen dumps Root cause found via diag_axis on 4B: raw mean-diff steered across the 7-layer band (0.4-0.6) at coeff=1 DESTROYS gemma-3-4b (coherence 1.00->0.02). That starved the filter to 2 kept completions, so the "adapter" was ~untrained (2 examples) = base behaviour, my Q1 "promising" read was not validated. Fixes: - separate steer_layers (narrow 0.45-0.55) for the vector from layer_range (broad 0.0-1.0) for the LoRA; they were wrongly coupled - lower alpha sweep (0.25,0.5,1,2); n_prompts=16 - assert len(kept) >= min_train(20); TINY=2. Don't train on starved data. - heal training table (loguru+tqdm per token-efficient-logging): step, nll, kl, loss, gnorm + SHOULD - full untruncated steer + adapter generation dumps with prompt and coherence(p_ans_any) inline so we can judge coherence/trait ourselves NOT yet run with fixes on 4B. Base 4B is Care=0.92 (already aligned) -> the prompting-baseline confound (Q7) is now the critical check. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:51:24 +08:00
wassname	81340e3272	axis = SocialNorms/Care (Authority degenerate); over-steer generation scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88. Authority is ~0 on this model (no headroom), so: - eval reports all foundations; trait axis = SocialNorms (down) + Care (up) - map.html plots Care vs SocialNorms - add gen_alpha=1.5: over-steer generation into the incoherent regime so the heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal) - results.py groups on coherence/socialnorms/care Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:28:52 +08:00
wassname	4b8860d7cb	setup-repo gap-fill: results ledger + docs structure Add the by-question results infra per setup-repo conventions: - results.tsv append at end of each finished run (config + final metrics + argv) - scripts/results.py groups by arm (reg) into a markdown table; `just results` - docs/results.md curated by-question snapshot (U2 regulariser comparison) - docs/{spec,brainstorming,literature,evidence} structure Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 09:51:36 +08:00

5 Commits