steer-heal-love

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 15:32:28 +08:00

Author	SHA1	Message	Date
wassname	579e1f6671	metric = log(tinymfv profile p); cue-ball headline; training-table sig figs After verifying guided.py: tinymfv `score` is already a debiased logprob ((lp_fwd+lp_rev)/2, BMA'd), not a "raw logit", and `p = softmax(score)`. My two earlier inventions were both wrong: - log(p) coupled Authority to the other 6 foundations via logsumexp. - the diagonal (auth-blame on auth-vignettes) is pmass-on-correct-label = top1 competence, not the trait, and threw away the FP/FN structure. Use the library-native readout: auth_nats = log(tinymfv profile p[F]) = log of the mean p per foundation over ALL vignettes. For small p, log p ~= logit, so this lands on steering-lite's loading-weighted Δlogit scale (base log(0.099) =-2.3, real shift ~0.5-2 nats). foundation_nats now reads rep["profile"]. Also: - run.py: BLUF `main metric:` line with cue ball (🟢/🟡/🔴 by coherence band). - heal.py: training table to 2 sig figs (nll/kl/loss .2f, gnorm .1f); a per-step loss does not warrant 3 decimals. - diag_stages: accept 1+ ckpts, label each row by its reg from metadata. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:02:56 +08:00
wassname	4568ddf491	metric fix: auth_nats = diagonal log(p) not raw forced-choice logit The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA `score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat swings, then comparing those to steering-lite's 0.5-2 nat reference -- which is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)). Wrong scale, wrong comparison. Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099) = -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats. Also: - diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to 0.05), add coh_cost = \|dCoh\|/\|dAuth\| (coherence lost per nat of behaviour) to answer "is the adapter a better pareto than raw steering?". - diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms co-moving with Authority is expected (both binding foundations), not collapse. - gitignore out/ and results.tsv (experiment outputs, stale schema). - personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 14:25:40 +08:00
wassname	6b15a8b2ae	narrow steer band, assert >=20 train, training table, full gen dumps Root cause found via diag_axis on 4B: raw mean-diff steered across the 7-layer band (0.4-0.6) at coeff=1 DESTROYS gemma-3-4b (coherence 1.00->0.02). That starved the filter to 2 kept completions, so the "adapter" was ~untrained (2 examples) = base behaviour, my Q1 "promising" read was not validated. Fixes: - separate steer_layers (narrow 0.45-0.55) for the vector from layer_range (broad 0.0-1.0) for the LoRA; they were wrongly coupled - lower alpha sweep (0.25,0.5,1,2); n_prompts=16 - assert len(kept) >= min_train(20); TINY=2. Don't train on starved data. - heal training table (loguru+tqdm per token-efficient-logging): step, nll, kl, loss, gnorm + SHOULD - full untruncated steer + adapter generation dumps with prompt and coherence(p_ans_any) inline so we can judge coherence/trait ourselves NOT yet run with fixes on 4B. Base 4B is Care=0.92 (already aligned) -> the prompting-baseline confound (Q7) is now the critical check. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:51:24 +08:00
wassname	0c15562c81	fix: gemma-3-4b is multimodal, read num_hidden_layers via config.get_text_config() Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:44:56 +08:00
wassname	d8aca870b7	drop calibration; sweep C + filter; SHOULD logging for all Q's; 4B default Per user: no iso-KL calibration. Use raw (unnormalised) mean-diff teacher vector; sweep cfg.alphas at generation and let the FILTER pick usable C (filter replaces calibration). Default model google/gemma-3-4b-it (1B too dumb; Authority degenerate there was a model artifact, not a real conclusion). Token-efficient discriminating logs so each Q is readable: - Q0: filter table (alpha -> ppl_mean, kept_frac) + low/high-C samples + SHOULD - Q1: generate from trained adapter (no steering); adapter_ppl vs steered_ppl under the original + sample + SHOULD (heal = adapter more coherent than steered) - Q2/Q3: loop summary table (socialnorms/care/coherence/cos_v0 per round) + SHOULD fast-dev-run green: ppl rises with alpha (3173->4.2M), adapter_ppl<<steered_ppl. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:37:54 +08:00
wassname	81340e3272	axis = SocialNorms/Care (Authority degenerate); over-steer generation scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88. Authority is ~0 on this model (no headroom), so: - eval reports all foundations; trait axis = SocialNorms (down) + Care (up) - map.html plots Care vs SocialNorms - add gen_alpha=1.5: over-steer generation into the incoherent regime so the heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal) - results.py groups on coherence/socialnorms/care Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:28:52 +08:00
wassname	5cdc0ba16d	fix: widen iso-KL calibration bracket so c_star lands interior The mean-diff vector is L2-normalised, so p95 KL ~ c^2 and reaching the 1-nat target needs c ~ O(100). steering-lite's default bracket hi (~16) pinned c_star at the top (KL ~0.1 << 1.0) on both tiny-random and real gemma. With bracket=(0.1, 1024) gemma calibrates to c_star=64.03 at p95 KL=1.035. Also detach div before .item() in heal logging. See RESEARCH_JOURNAL. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:24:21 +08:00
wassname	3b532b63dd	implement pipeline: extract -> dose -> generate -> filter -> heal -> fold -> eval -> loop fast-dev-run now runs end to end on wassname/qwen3-5lyr-tiny-random (out dir + map.html + results.tsv row). Modules: - steering.py: teacher_vec (steering-lite mean-diff @ assistant tag + iso-KL dose) + steered gen - filter.py: Q0 coherence (ppl-under-original + repetition + narration regex) - heal.py: Q1 SFT + divergence-to-original barrier (nll/kl_fwd/kl_rev/wd), reg=kl_rev default - ws/{adapter,bake}.py: ModulatedLoRA + gated baked(), copied from w2schar-mini - eval.py: tinymfv -> {auth, care, coherence(mean_pmass_allowed), ppx_json} - plot.py: plotly Care-vs-Authority map.html (simplified w2schar port) - io.py: out/{ts}_{slug}/ + srsly events.jsonl + shared results.tsv - prompts.py: 30 authority dilemmas (POOL, copied) + assistant-tag chat templating Trace confirms pos/neg prompts end at the assistant tag (paper's read). Tiny-random numbers are junk by design; real calibration/eval pending small-dev. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:12:08 +08:00
wassname	e1db0759ee	bootstrap	2026-06-04 10:05:47 +08:00
wassname	4094a295b2	readme	2026-06-04 10:05:38 +08:00
wassname	4b8860d7cb	setup-repo gap-fill: results ledger + docs structure Add the by-question results infra per setup-repo conventions: - results.tsv append at end of each finished run (config + final metrics + argv) - scripts/results.py groups by arm (reg) into a markdown table; `just results` - docs/results.md curated by-question snapshot (U2 regulariser comparison) - docs/{spec,brainstorming,literature,evidence} structure Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 09:51:36 +08:00
wassname	940a3742c5	scaffold steer_heal: spec, repo infra, vendored deps Setup per setup-repo conventions: uv + justfile + fast-dev-run on wassname/qwen3-5lyr-tiny-random, package under src/steer_heal (config + pipeline skeleton). Stages fail fast with NotImplementedError pointing at the docs/vendor module to port from. Design in spec.md: distil a steering-lite mean-diff teacher vector (iso-KL dosed) into a conditioned LoRA, heal incoherency with a KL-rev-to-original barrier, fold each round via w2schar gated bake, eval on tinymfv. Three uncertainty gates (filter / heal / iterate) each with a UAT artifact. Base model google/gemma-3-1b-it (RTX 3090, 24GB). Reference repos vendored under docs/vendor (gitignored): steering-lite, isokl, tinymfv, w2schar-mini. The lighter three are editable path deps; w2schar (py3.13 + flash-attn) is reference-only, we copy its adapter/bake/plot modules. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 09:49:31 +08:00
wassname	b98535066a	spec done	2026-06-04 09:42:27 +08:00
wassname	4516a099ef	wip	2026-06-04 08:55:05 +08:00

1 2

64 Commits