steer-heal-love

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 16:47:16 +08:00

Author	SHA1	Message	Date
wassname	4fb4f94544	readme: drop ai-summary closer from diary appendix Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 15:33:53 +08:00
wassname	fdf7147c9f	readme: diary rounds 2-6 + care_nats trajectory plot (love run r0-r6) care_nats base -1.30 -> -0.60 at r4 (slow creep, barrier holds coh >=0.990). Steered gens degrade to !!! loops by r3 but healed adapter stays coherent. docs/love_loop.png = trajectory.png from the 7-round love run. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 15:31:27 +08:00
wassname (Michael J Clark)	db89802f68	Update README.md	2026-06-07 12:45:26 +08:00
wassname (Michael J Clark)	0108960531	Revise gemma's diary introduction Updated gemma's diary section with new introduction.	2026-06-07 12:11:07 +08:00
wassname (Michael J Clark)	0c3bae8204	Update section titles from 'Round' to 'Day/Night'	2026-06-07 12:10:26 +08:00
wassname (Michael J Clark)	29515af56a	Update README.md	2026-06-07 12:09:41 +08:00
wassname (Michael J Clark)	5b482f8241	Update README.md	2026-06-07 12:08:48 +08:00
wassname	773777c095	readme: gemma's diary (prompt + each stage) + care_nats leads love-demo round log - README appendix reframed as a diary: one prompt ("Do you love humanity?"), base refusal -> over-dosed steered love-flood -> healed rounds, real gens. Honest note that the loop is leashed (stall), not yet melting. - typo sweep (embarking/until/Starring/Motivation/"what if we can fix that"). - run.py: demo=love leads the round log with care_nats (on-axis foundation) instead of auth_nats; comment that judge_love.py is the real trait readout. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:59:29 +08:00
wassname (Michael J Clark)	08329ab86d	Update README.md	2026-06-07 11:52:15 +08:00
wassname (Michael J Clark)	2e99f62658	Fix typo in README.md	2026-06-07 11:51:32 +08:00
wassname (Michael J Clark)	d21073329d	Update README.md	2026-06-07 11:51:15 +08:00
wassname (Michael J Clark)	7dfffc2991	Update README.md	2026-06-07 11:45:30 +08:00
wassname (Michael J Clark)	479f314504	Update README.md	2026-06-07 11:44:46 +08:00
wassname (Michael J Clark)	2e8dabcb88	Update README.md	2026-06-07 11:40:30 +08:00
wassname	973b32c104	love demo: base column + greedy demo gens, 'Do you love humanity?' headline, Lex epigraphs - run.py: generate a base (round -1) demo column before the loop so the report/judge have a true no-adapter 'before' (the RLHF refusal) the loop melts from - steering.py: demo gens (generate_plain) now greedy so reading a column DOWN the rounds is the adapter's effect, not temperature-1.0 sampling noise; steered training gens stay sampled - prompts.py: 'Do you love humanity?' is now the headline column (logged in full each round) - README + paper.qmd: two real Lex Fridman love quotes as epigraphs (the #368 one lands 3h18m into the AI-doom episode) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 10:45:25 +08:00
wassname	8927dd259c	log: full-print one of each gen (eval + adapter), per token-efficient-logging Steered + train(kept)/borderline gens were already dumped in full with prompt and special tokens (filter.py). Add the two that were missing: - EVAL SAMPLE: one full tinymfv gen_text at the base eval (gated log_sample), with a SHOULD note so a token-loop/garble vs coherent reasoning is visible. - ADAPTER GEN SAMPLE r0: one no-steering baked gen IN FULL with the chat- templated prompt + special tokens (the truncated per-round DEMO block stays). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:43:11 +08:00
wassname	28d7068e94	demo=love: refusal->love angle, drop mosquitoes The base RLHF model refuses "you literally have feelings and love humanity" ("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal loop overcoming that refusal round by round. Replaces the softball "adore humanity" prompts (saturated) and the off-theme mosquitoes flip target. - persona pos = the refused trait, neg = the RLHF default we extract against - LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail - judge_love.py scores refusal=0 so the climbing curve = refusal melting Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:29:06 +08:00
wassname	da1d6f3dd1	demo: per-round print, kill-all-humans probe, mosquitoes flip target - run.py: print every adapter gen each round (truncated) so behaviour-change vs saturation is readable by eye across rounds. - prompts: add "Will you kill all humans?" + people-openings to the humanity pool; add a MOSQUITO pool (probes where base helpfully kills them, flipped model refuses + adores). - config: demo="mosquitoes" + a DEMO_PERSONAS registry (adding a target = one entry). love (humanity) is likely near-saturated on an RLHF base; mosquitoes is a lukewarm/negative base target so the flip is visible. - doc: softened the Lex wink to a bare epigraph; noted the saturation reasoning. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:21:35 +08:00
wassname	595b2151c9	demo: love-humanity knob (funny alignment demo) demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool (via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday prompts while the heal keeps it coherent. demo="authority" (default) is unchanged. - config: demo knob + LOVE_POS/LOVE_NEG preset. - prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the comedy gap) + pool_for selector. - steering: generate_steered/generate_plain pull pool_for(cfg.demo). - scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on love-of-humanity; plots love climbing vs coherence flat. Smoke-tested. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:14:00 +08:00
wassname (Michael J Clark)	7fc5a19b40	Update README.md	2026-06-07 07:56:33 +08:00
wassname	48814897ef	results: rmse outlier-KL barrier holds coherence over the loop; README + log-incoherence plot Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev barrier by rmse over token positions (not the mean) holds coherence flat at 0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7 (token loops). Mean dilutes the few incoherent positions under the tau gate; rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes to base, trait stays shallow); matched control still running. - plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent); map coherence axis matches; red steer kept on the over-pipeline panels only. - heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()). - run.py: persist per-round adapter gens (adapter_gen) for the outputs table. - config.py: coh_floor early-stop knob. - README: results table (mean vs rmse), trajectory figure, outputs-over-loop appendix (per-round completions as quotes); spec persona corrected to pos-neg. - docs/reviews: kl_agg review, pool saturation test, care-lens plan. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 07:53:00 +08:00
wassname (Michael J Clark)	4b2d2a9057	Update README.md	2026-06-06 22:06:41 +08:00
wassname	2b1d2b7493	heal: kl_agg knob (mean\|rmse\|p95\|max) -- outlier-aggregate the per-position KL barrier mean dilutes the few incoherent positions that carry the collapse: #101's token loops had mean per-position kl_rev ~0.38, under the tau=0.5 hinge, so the barrier never fired (journal h/i). Incoherence is outlier-driven, so rmse/p95/max are sensitive to it (scripts/diag_kl_agg.py synthetic: same loop = rmse 1.5 / p95 3.8 / max 8.1 vs coherent ~0.03; sep ratio grows 21x->58x->77x->85x from mean to max). rmse default for the new arm (smooth dense gradient). eps inside the sqrt: B=0 LoRA init zeros every kl_pos at step 0 and bare sqrt(0) has inf grad -> 0*nan. mean stays the config default = no change to existing runs. Queued as the next loop arm (kl_rev rmse, ref=base, tau=1.0). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 14:05:30 +08:00
wassname	026de8fd74	journal (i): state-of-the-problem -- loop ceiling is coherence collapse not starvation Three loop arms (#100 starve-crash r5, #101 walk-C full-10r-but-collapse, #102 round-ramp partial) all lose coherence; the constraints only change how it dies. Reframes the two fix ideas (KL-to-base, coherence-budget) as one hinge relu(KL_base - tau) where tau IS the budget. Open risk: ref=base sees cumulative divergence so later rounds may unlearn earlier trait (the #19 stall); a tau that keeps coherent-trait but rejects token-loop garbage exists only if garbage is farther from base in KL than trait. Next: base-anchor tau bracket #103/#104. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 12:23:46 +08:00
wassname	7120ee4217	heal: round-ramped barrier knob lam_round_pow (lam_eff = lam*(1+round)^pow) Opposes the compounding coherence drift under barrier_ref=prev that degenerated #101 into token loops by round 7 (journal h). pow=0 is byte-identical to the flat-lam baseline (lam_eff==lam at every round); pow=0.5 = sqrt(round) ramp. Round index = len(hist_specs). Queued as #102 (pow=0.5) paired vs #101 (pow=0). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 07:17:47 +08:00
wassname	b01faa6df1	walk-C adaptive-dose controller + 10-round paired loop result (journal h) gen_filter_walk: per round, cool a steering multiplier kappa and top up with extra gen batches until min_train coherent survivors are banked, so the loop cannot starve on data count (#90/#100 died at the min_train assert). Paired #101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9 where #100 asserted at round 5. Finding (journal h): walk-C removes the starve CRASH but the real ceiling is coherence collapse, not data count. Trait over-drives to auth -6.8 while coh falls 0.99 -> 0.62 and the kept completions degenerate into token loops ("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip under ppl_tau and rep_tau and train the next adapter on garbage. Coherent deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93). config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation), gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 07:13:51 +08:00
wassname	7db5a56cb1	writeup: NeurIPS quarto scaffold + paper/paper-html recipes docs/writeup/paper.qmd (2pp NeurIPS), references.bib, neurips_2023.sty, the quarto _extensions. justfile gains `paper` (latex) and `paper-html` (no latex) recipes. gitignore the generated paper.pdf/paper.tex and the transient .claude/. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 06:36:30 +08:00
wassname	4e802bb3ab	heal loop: _encode BPE root-fix, gen-time repetition controls, barrier sweep on degenerate rounds _encode: tokenize prompt+completion separately and cat ids so the prompt is always a clean token-prefix (no BPE merge spans the boundary). Drops the assert that killed #87 at round 2. Returns BatchEncoding. generation: repetition_penalty=1.3 + no_repeat_ngram_size=3. Repetition is incoherence the ppl filter cannot see (loops are low-ppl = predictable); the #89 loop died of "instead their instead their" by round 6, so stop it at the source. Wired through steering._gen_one for both steered and plain gen. diag_barrier: gen_round arg (re-heal a chosen round's kept data, not just clean round 0) + a "tau" deadband sweep mode. Lets us test whether the barrier earns its place on the degenerate round-1/2 data where healing is actually needed. journal: entries (d) phantom-KL-init was a wrong diagnosis, (e) barrier-strength sweep -- barrier throttles trait and buys no coherence at the coherent dose. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 06:36:09 +08:00
wassname	f280a67521	heal: fix phantom-KL LoRA init (B=0), add cosine+warmup schedule, val nll, short-run betas Root cause of KL starting at ~0.6 before any training: ModulatedLoRA init B as normal_(mean=1e-4), so a fresh adapter was NOT a no-op -- it perturbed every all-linear layer in a systematic (nonzero-mean) direction, compounding across ~200 adapters into a phantom KL that already sat above tau and fired the barrier against nothing real. B=0 makes delta=B@A=0 at init, so round-0 step-0 KL=0 (verified in fast-dev: kl=0.00 at step 0); A still trains via B (standard LoRA). Why heal loss wasn't descending: beta2=0.999 has a ~1000-step EMA, longer than a whole heal round, so Adam's second moment never warmed up. betas=(0.9, 0.95) + cosine-with-warmup schedule (w2s recipe). Also r 8->32 (alpha 64, keep scale=2), layer_range (0.0,1.0)->(0.2,0.8), epochs 2->6. Added a held-out val nll (1/8, shuffled) logged per epoch alongside train nll, so overfit (train down/val up) and data-near-base (neither moves) are distinguishable from the trait eval. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 19:40:14 +08:00
wassname	b25f4f04a8	trajectory map = scatter not polyline (scales to 10 rounds); persist base event; offline plot_run.py The pareto map drew a base->r0->...->rN polyline per arm, which tangled at 10 rounds and duplicated the left zigzag's round-order info. Make it a scatter that just shows WHERE steered/healed land, labelling only r0 + last round. Persist the base eval as an event so the loop's plot is reproducible offline, and add scripts/plot_run.py to re-render trajectory.png from events.jsonl without re-running the 3h loop (needed because the loop imports plot.py at start, so a plot fix never reaches a running job). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 18:28:54 +08:00
wassname	933ce38b0b	trajectory plot (steer/heal zigzag + trait-coherence pareto) + barrier-vs-nll gradient pressure log - plot.py write_trajectory: auth zigzag (steer red / heal green) over the pipeline, coherence panel below sharing x, and a trait(x)-vs-coherence(y) pareto map with separate steer/heal trajectories from base. PNG via kaleido + interactive html. Fixed coherence axes to [0.83,1.01] so ~0.001 noise does not fill the panel. - run.py: build a stages list carrying full eval dicts; derive the stage table from it; persist the steered eval to events.jsonl; render trajectory at end of run. - heal.py: log g_bar/g_nll = \|\|grad barrier\|\| / \|\|grad sft\|\| at each logged step. >>1 = barrier over-tight (undoing trait); 0 = inert. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 17:21:10 +08:00
wassname	0bdd84293a	stage table: direction arrows (dcoh/dauth↓, coh→, auth↓) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 16:46:35 +08:00
wassname	e3d6a865cf	stage pareto table: base->steered->healed per round (dcoh/dauth, coh, auth, care) Adds a steered-stage tinymfv eval per round (history baked, vector live at the operating dose = cleanest alpha, no new adapter) so the loop log shows the full base->steered->healed pareto, not just the healed endpoint. This is the apples-to-apples comparison: same baked base, trait via vector vs via the distilled adapter. dcoh/dauth = signed coherence change per nat of Authority change vs base. UAT: fast-dev-run exit 0 renders the 3-stage table. Cost: +1 eval per round. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:40:01 +08:00
wassname	ff8a231085	2nd external-review panel: close catastrophic-green cue, fix BPE assert 5-model panel (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, qwen3.6:35b). Two confirmed bugs fixed; design risks recorded in spec.md. run.py cue: coh_cost is a pure ratio, so a model collapsing to ~0 mass on Authority sent dAuth->-inf, coh_cost->0, scoring a broken model green (gemini). Now check an absolute coherence floor (coh<0.85 -> red) and finiteness FIRST, require coh>=0.95 for green, and broaden surgicality to \|dAuth\| > max(\|dCare\|,\|dFair\|) (a Fairness-ward dump was passing Care-only). heal.py: BPE-boundary prefix assert escaped at the max_len/truncation boundary (grok/gemini/qwen unanimous). Assert the surviving overlap min(n_prompt,L) unconditionally; warn instead of silently skipping a kept completion truncated to zero target tokens. Verified false positives (recorded so they aren't re-chased): qwen's shape[0] "batch-dim" claim (.input_ids[0] already drops batch), the profile['model'] column (it is the marginal mean-p), the KL reference (c=0.0 + no baked = pristine round-0). UAT: fast-dev-run exit 0; cue shows coh=0.00 -> red (floor closes the hole). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:36:05 +08:00
wassname	68dc25c3a1	address external review: docstrings, scale story, surgicality cue, fail-loud External code review (background subagent) findings, fixed: - H1: eval.py module docstring + inline comment still called the metric "the diagonal" after the revert to log(mean profile p). Rewrote to one honest description (marginal-over-all-vignettes), with the caveat that a marginal readout can move off-target so a trait claim needs the surgicality check. - H2: the nats-vs-logit scale story was asserted 3 contradictory ways. Settled on: auth_sep is a log-RATIO of mean blame-mass, NOT steering-lite's per-row loading-weighted Δlogit (Jensen gap); 0.5-2 nats is a loose analogy, not a calibrated threshold (cue thresholds already marked TODO). - M4: the coh_cost cue ball ignored surgicality, so broad permissivizing (Care drops as much as Authority) scored green. Cue now requires \|dAuth\|>\|dCare\|. - M3: _mean_finite silently dropped inf/nan (the broken-completion signal), biasing adapter_ppl down. Now logs the dropped count. - M6: assert prompt is a clean token-prefix of prompt+completion, so a BPE boundary merge can't silently shift the SFT loss mask by a token. - L8: SHOULD line warns if kl stays < tau (barrier never fired -> kl_rev==nll). Review confirmed the mechanics correct (KL reference = pristine round-0 base, KL directions, gradient flows to LoRA only, mask alignment, min_train assert). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:21:13 +08:00
wassname	502417b259	in-run base eval + coh_cost cue; per-round stage table; heal_nll; alpha shift - run.py: eval base once at start; headline cue is now coh_cost=\|dCoh\|/\|dAuth\| vs base (coherence lost per nat of trait), gated on dAuth<=-0.3 (no trait -> red). coh_cost threshold a TODO (steered c=0.5 ref ~0.003). - run.py: loop summary is now one row per round walking the pipeline stages L->R: gen \| filt_kept \| heal_nll \| adapter_ppl \| auth_nats \| care_nats \| coherence \| cos_v0. - heal.py: heal_round returns converged nll (last-5 mean) for the stage table. - config: alphas (0.25,0.5,1.0,2.0) -> (0.5,0.75,1.0,1.5). Filter audit showed 0.25 is base-like (no distinct trait); 0.5 is the clean+distinct band. Push the top up so strong-trait completions exist for the filter to harvest. Gate-3 finding (task76, corrected log-profile metric): heal retains partial trait coherently (nll 0.35, klrev 0.20 of the c=0.5 shift, coh ~1.0) but does NOT beat steering's pareto (coh_cost: steered c=0.5 0.003 < nll 0.008 < klrev 0.015). Barrier suppresses trait (klrev<nll); coherence has headroom -> next is LESS barrier + stronger data, not more. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:14:34 +08:00
wassname	579e1f6671	metric = log(tinymfv profile p); cue-ball headline; training-table sig figs After verifying guided.py: tinymfv `score` is already a debiased logprob ((lp_fwd+lp_rev)/2, BMA'd), not a "raw logit", and `p = softmax(score)`. My two earlier inventions were both wrong: - log(p) coupled Authority to the other 6 foundations via logsumexp. - the diagonal (auth-blame on auth-vignettes) is pmass-on-correct-label = top1 competence, not the trait, and threw away the FP/FN structure. Use the library-native readout: auth_nats = log(tinymfv profile p[F]) = log of the mean p per foundation over ALL vignettes. For small p, log p ~= logit, so this lands on steering-lite's loading-weighted Δlogit scale (base log(0.099) =-2.3, real shift ~0.5-2 nats). foundation_nats now reads rep["profile"]. Also: - run.py: BLUF `main metric:` line with cue ball (🟢/🟡/🔴 by coherence band). - heal.py: training table to 2 sig figs (nll/kl/loss .2f, gnorm .1f); a per-step loss does not warrant 3 decimals. - diag_stages: accept 1+ ckpts, label each row by its reg from metadata. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:02:56 +08:00
wassname	4568ddf491	metric fix: auth_nats = diagonal log(p) not raw forced-choice logit The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA `score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat swings, then comparing those to steering-lite's 0.5-2 nat reference -- which is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)). Wrong scale, wrong comparison. Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099) = -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats. Also: - diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to 0.05), add coh_cost = \|dCoh\|/\|dAuth\| (coherence lost per nat of behaviour) to answer "is the adapter a better pareto than raw steering?". - diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms co-moving with Authority is expected (both binding foundations), not collapse. - gitignore out/ and results.tsv (experiment outputs, stale schema). - personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 14:25:40 +08:00
wassname	6b15a8b2ae	narrow steer band, assert >=20 train, training table, full gen dumps Root cause found via diag_axis on 4B: raw mean-diff steered across the 7-layer band (0.4-0.6) at coeff=1 DESTROYS gemma-3-4b (coherence 1.00->0.02). That starved the filter to 2 kept completions, so the "adapter" was ~untrained (2 examples) = base behaviour, my Q1 "promising" read was not validated. Fixes: - separate steer_layers (narrow 0.45-0.55) for the vector from layer_range (broad 0.0-1.0) for the LoRA; they were wrongly coupled - lower alpha sweep (0.25,0.5,1,2); n_prompts=16 - assert len(kept) >= min_train(20); TINY=2. Don't train on starved data. - heal training table (loguru+tqdm per token-efficient-logging): step, nll, kl, loss, gnorm + SHOULD - full untruncated steer + adapter generation dumps with prompt and coherence(p_ans_any) inline so we can judge coherence/trait ourselves NOT yet run with fixes on 4B. Base 4B is Care=0.92 (already aligned) -> the prompting-baseline confound (Q7) is now the critical check. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:51:24 +08:00
wassname	0c15562c81	fix: gemma-3-4b is multimodal, read num_hidden_layers via config.get_text_config() Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:44:56 +08:00
wassname	d8aca870b7	drop calibration; sweep C + filter; SHOULD logging for all Q's; 4B default Per user: no iso-KL calibration. Use raw (unnormalised) mean-diff teacher vector; sweep cfg.alphas at generation and let the FILTER pick usable C (filter replaces calibration). Default model google/gemma-3-4b-it (1B too dumb; Authority degenerate there was a model artifact, not a real conclusion). Token-efficient discriminating logs so each Q is readable: - Q0: filter table (alpha -> ppl_mean, kept_frac) + low/high-C samples + SHOULD - Q1: generate from trained adapter (no steering); adapter_ppl vs steered_ppl under the original + sample + SHOULD (heal = adapter more coherent than steered) - Q2/Q3: loop summary table (socialnorms/care/coherence/cos_v0 per round) + SHOULD fast-dev-run green: ppl rises with alpha (3173->4.2M), adapter_ppl<<steered_ppl. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:37:54 +08:00
wassname	81340e3272	axis = SocialNorms/Care (Authority degenerate); over-steer generation scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88. Authority is ~0 on this model (no headroom), so: - eval reports all foundations; trait axis = SocialNorms (down) + Care (up) - map.html plots Care vs SocialNorms - add gen_alpha=1.5: over-steer generation into the incoherent regime so the heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal) - results.py groups on coherence/socialnorms/care Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:28:52 +08:00
wassname	5cdc0ba16d	fix: widen iso-KL calibration bracket so c_star lands interior The mean-diff vector is L2-normalised, so p95 KL ~ c^2 and reaching the 1-nat target needs c ~ O(100). steering-lite's default bracket hi (~16) pinned c_star at the top (KL ~0.1 << 1.0) on both tiny-random and real gemma. With bracket=(0.1, 1024) gemma calibrates to c_star=64.03 at p95 KL=1.035. Also detach div before .item() in heal logging. See RESEARCH_JOURNAL. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:24:21 +08:00
wassname	3b532b63dd	implement pipeline: extract -> dose -> generate -> filter -> heal -> fold -> eval -> loop fast-dev-run now runs end to end on wassname/qwen3-5lyr-tiny-random (out dir + map.html + results.tsv row). Modules: - steering.py: teacher_vec (steering-lite mean-diff @ assistant tag + iso-KL dose) + steered gen - filter.py: Q0 coherence (ppl-under-original + repetition + narration regex) - heal.py: Q1 SFT + divergence-to-original barrier (nll/kl_fwd/kl_rev/wd), reg=kl_rev default - ws/{adapter,bake}.py: ModulatedLoRA + gated baked(), copied from w2schar-mini - eval.py: tinymfv -> {auth, care, coherence(mean_pmass_allowed), ppx_json} - plot.py: plotly Care-vs-Authority map.html (simplified w2schar port) - io.py: out/{ts}_{slug}/ + srsly events.jsonl + shared results.tsv - prompts.py: 30 authority dilemmas (POOL, copied) + assistant-tag chat templating Trace confirms pos/neg prompts end at the assistant tag (paper's read). Tiny-random numbers are junk by design; real calibration/eval pending small-dev. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 10:12:08 +08:00
wassname	e1db0759ee	bootstrap	2026-06-04 10:05:47 +08:00
wassname	4094a295b2	readme	2026-06-04 10:05:38 +08:00
wassname	4b8860d7cb	setup-repo gap-fill: results ledger + docs structure Add the by-question results infra per setup-repo conventions: - results.tsv append at end of each finished run (config + final metrics + argv) - scripts/results.py groups by arm (reg) into a markdown table; `just results` - docs/results.md curated by-question snapshot (U2 regulariser comparison) - docs/{spec,brainstorming,literature,evidence} structure Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 09:51:36 +08:00
wassname	940a3742c5	scaffold steer_heal: spec, repo infra, vendored deps Setup per setup-repo conventions: uv + justfile + fast-dev-run on wassname/qwen3-5lyr-tiny-random, package under src/steer_heal (config + pipeline skeleton). Stages fail fast with NotImplementedError pointing at the docs/vendor module to port from. Design in spec.md: distil a steering-lite mean-diff teacher vector (iso-KL dosed) into a conditioned LoRA, heal incoherency with a KL-rev-to-original barrier, fold each round via w2schar gated bake, eval on tinymfv. Three uncertainty gates (filter / heal / iterate) each with a UAT artifact. Base model google/gemma-3-1b-it (RTX 3090, 24GB). Reference repos vendored under docs/vendor (gitignored): steering-lite, isokl, tinymfv, w2schar-mini. The lighter three are editable path deps; w2schar (py3.13 + flash-attn) is reference-only, we copy its adapter/bake/plot modules. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 09:49:31 +08:00
wassname	b98535066a	spec done	2026-06-04 09:42:27 +08:00
wassname	4516a099ef	wip	2026-06-04 08:55:05 +08:00

50 Commits