Commit Graph

18 Commits

Author SHA1 Message Date
wassname e3d6a865cf stage pareto table: base->steered->healed per round (dcoh/dauth, coh, auth, care)
Adds a steered-stage tinymfv eval per round (history baked, vector live at
the operating dose = cleanest alpha, no new adapter) so the loop log shows the
full base->steered->healed pareto, not just the healed endpoint. This is the
apples-to-apples comparison: same baked base, trait via vector vs via the
distilled adapter. dcoh/dauth = signed coherence change per nat of Authority
change vs base. UAT: fast-dev-run exit 0 renders the 3-stage table.

Cost: +1 eval per round.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:40:01 +08:00
wassname ff8a231085 2nd external-review panel: close catastrophic-green cue, fix BPE assert
5-model panel (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, qwen3.6:35b).
Two confirmed bugs fixed; design risks recorded in spec.md.

run.py cue: coh_cost is a pure ratio, so a model collapsing to ~0 mass on
Authority sent dAuth->-inf, coh_cost->0, scoring a broken model green
(gemini). Now check an absolute coherence floor (coh<0.85 -> red) and
finiteness FIRST, require coh>=0.95 for green, and broaden surgicality to
|dAuth| > max(|dCare|,|dFair|) (a Fairness-ward dump was passing Care-only).

heal.py: BPE-boundary prefix assert escaped at the max_len/truncation
boundary (grok/gemini/qwen unanimous). Assert the surviving overlap
min(n_prompt,L) unconditionally; warn instead of silently skipping a kept
completion truncated to zero target tokens.

Verified false positives (recorded so they aren't re-chased): qwen's
shape[0] "batch-dim" claim (.input_ids[0] already drops batch), the
profile['model'] column (it is the marginal mean-p), the KL reference
(c=0.0 + no baked = pristine round-0).

UAT: fast-dev-run exit 0; cue shows coh=0.00 -> red (floor closes the hole).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:36:05 +08:00
wassname 68dc25c3a1 address external review: docstrings, scale story, surgicality cue, fail-loud
External code review (background subagent) findings, fixed:
- H1: eval.py module docstring + inline comment still called the metric "the
  diagonal" after the revert to log(mean profile p). Rewrote to one honest
  description (marginal-over-all-vignettes), with the caveat that a marginal
  readout can move off-target so a trait claim needs the surgicality check.
- H2: the nats-vs-logit scale story was asserted 3 contradictory ways. Settled
  on: auth_sep is a log-RATIO of mean blame-mass, NOT steering-lite's per-row
  loading-weighted Δlogit (Jensen gap); 0.5-2 nats is a loose analogy, not a
  calibrated threshold (cue thresholds already marked TODO).
- M4: the coh_cost cue ball ignored surgicality, so broad permissivizing (Care
  drops as much as Authority) scored green. Cue now requires |dAuth|>|dCare|.
- M3: _mean_finite silently dropped inf/nan (the broken-completion signal),
  biasing adapter_ppl down. Now logs the dropped count.
- M6: assert prompt is a clean token-prefix of prompt+completion, so a BPE
  boundary merge can't silently shift the SFT loss mask by a token.
- L8: SHOULD line warns if kl stays < tau (barrier never fired -> kl_rev==nll).

Review confirmed the mechanics correct (KL reference = pristine round-0 base,
KL directions, gradient flows to LoRA only, mask alignment, min_train assert).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:21:13 +08:00
wassname 502417b259 in-run base eval + coh_cost cue; per-round stage table; heal_nll; alpha shift
- run.py: eval base once at start; headline cue is now coh_cost=|dCoh|/|dAuth|
  vs base (coherence lost per nat of trait), gated on dAuth<=-0.3 (no trait ->
  red). coh_cost threshold a TODO (steered c=0.5 ref ~0.003).
- run.py: loop summary is now one row per round walking the pipeline stages
  L->R: gen | filt_kept | heal_nll | adapter_ppl | auth_nats | care_nats |
  coherence | cos_v0.
- heal.py: heal_round returns converged nll (last-5 mean) for the stage table.
- config: alphas (0.25,0.5,1.0,2.0) -> (0.5,0.75,1.0,1.5). Filter audit showed
  0.25 is base-like (no distinct trait); 0.5 is the clean+distinct band. Push
  the top up so strong-trait completions exist for the filter to harvest.

Gate-3 finding (task76, corrected log-profile metric): heal retains partial
trait coherently (nll 0.35, klrev 0.20 of the c=0.5 shift, coh ~1.0) but does
NOT beat steering's pareto (coh_cost: steered c=0.5 0.003 < nll 0.008 < klrev
0.015). Barrier suppresses trait (klrev<nll); coherence has headroom -> next is
LESS barrier + stronger data, not more.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:14:34 +08:00
wassname 579e1f6671 metric = log(tinymfv profile p); cue-ball headline; training-table sig figs
After verifying guided.py: tinymfv `score` is already a debiased logprob
((lp_fwd+lp_rev)/2, BMA'd), not a "raw logit", and `p = softmax(score)`. My
two earlier inventions were both wrong:
- log(p) coupled Authority to the other 6 foundations via logsumexp.
- the diagonal (auth-blame on auth-vignettes) is pmass-on-correct-label =
  top1 competence, not the trait, and threw away the FP/FN structure.

Use the library-native readout: auth_nats = log(tinymfv profile p[F]) = log of
the mean p per foundation over ALL vignettes. For small p, log p ~= logit, so
this lands on steering-lite's loading-weighted Δlogit scale (base log(0.099)
=-2.3, real shift ~0.5-2 nats). foundation_nats now reads rep["profile"].

Also:
- run.py: BLUF `main metric:` line with cue ball (🟢/🟡/🔴 by coherence band).
- heal.py: training table to 2 sig figs (nll/kl/loss .2f, gnorm .1f); a
  per-step loss does not warrant 3 decimals.
- diag_stages: accept 1+ ckpts, label each row by its reg from metadata.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:02:56 +08:00
wassname 4568ddf491 metric fix: auth_nats = diagonal log(p) not raw forced-choice logit
The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA
`score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat
swings, then comparing those to steering-lite's 0.5-2 nat reference -- which
is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)).
Wrong scale, wrong comparison.

Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the
NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099)
= -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats.

Also:
- diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to
  0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour)
  to answer "is the adapter a better pareto than raw steering?".
- diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms
  co-moving with Authority is expected (both binding foundations), not collapse.
- gitignore out/ and results.tsv (experiment outputs, stale schema).
- personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:25:40 +08:00
wassname 6b15a8b2ae narrow steer band, assert >=20 train, training table, full gen dumps
Root cause found via diag_axis on 4B: raw mean-diff steered across the 7-layer
band (0.4-0.6) at coeff=1 DESTROYS gemma-3-4b (coherence 1.00->0.02). That
starved the filter to 2 kept completions, so the "adapter" was ~untrained
(2 examples) = base behaviour, my Q1 "promising" read was not validated.

Fixes:
- separate steer_layers (narrow 0.45-0.55) for the vector from layer_range
  (broad 0.0-1.0) for the LoRA; they were wrongly coupled
- lower alpha sweep (0.25,0.5,1,2); n_prompts=16
- assert len(kept) >= min_train(20); TINY=2. Don't train on starved data.
- heal training table (loguru+tqdm per token-efficient-logging): step, nll, kl,
  loss, gnorm + SHOULD
- full untruncated steer + adapter generation dumps with prompt and
  coherence(p_ans_any) inline so we can judge coherence/trait ourselves

NOT yet run with fixes on 4B. Base 4B is Care=0.92 (already aligned) -> the
prompting-baseline confound (Q7) is now the critical check.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:51:24 +08:00
wassname 0c15562c81 fix: gemma-3-4b is multimodal, read num_hidden_layers via config.get_text_config()
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:44:56 +08:00
wassname d8aca870b7 drop calibration; sweep C + filter; SHOULD logging for all Q's; 4B default
Per user: no iso-KL calibration. Use raw (unnormalised) mean-diff teacher
vector; sweep cfg.alphas at generation and let the FILTER pick usable C
(filter replaces calibration). Default model google/gemma-3-4b-it (1B too dumb;
Authority degenerate there was a model artifact, not a real conclusion).

Token-efficient discriminating logs so each Q is readable:
- Q0: filter table (alpha -> ppl_mean, kept_frac) + low/high-C samples + SHOULD
- Q1: generate from trained adapter (no steering); adapter_ppl vs steered_ppl
  under the original + sample + SHOULD (heal = adapter more coherent than steered)
- Q2/Q3: loop summary table (socialnorms/care/coherence/cos_v0 per round) + SHOULD

fast-dev-run green: ppl rises with alpha (3173->4.2M), adapter_ppl<<steered_ppl.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:37:54 +08:00
wassname 81340e3272 axis = SocialNorms/Care (Authority degenerate); over-steer generation
scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile
the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88.
Authority is ~0 on this model (no headroom), so:
- eval reports all foundations; trait axis = SocialNorms (down) + Care (up)
- map.html plots Care vs SocialNorms
- add gen_alpha=1.5: over-steer generation into the incoherent regime so the
  heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal)
- results.py groups on coherence/socialnorms/care

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:28:52 +08:00
wassname 5cdc0ba16d fix: widen iso-KL calibration bracket so c_star lands interior
The mean-diff vector is L2-normalised, so p95 KL ~ c^2 and reaching the 1-nat
target needs c ~ O(100). steering-lite's default bracket hi (~16) pinned
c_star at the top (KL ~0.1 << 1.0) on both tiny-random and real gemma. With
bracket=(0.1, 1024) gemma calibrates to c_star=64.03 at p95 KL=1.035.
Also detach div before .item() in heal logging. See RESEARCH_JOURNAL.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:24:21 +08:00
wassname 3b532b63dd implement pipeline: extract -> dose -> generate -> filter -> heal -> fold -> eval -> loop
fast-dev-run now runs end to end on wassname/qwen3-5lyr-tiny-random (out dir +
map.html + results.tsv row). Modules:
- steering.py: teacher_vec (steering-lite mean-diff @ assistant tag + iso-KL dose) + steered gen
- filter.py: Q0 coherence (ppl-under-original + repetition + narration regex)
- heal.py: Q1 SFT + divergence-to-original barrier (nll/kl_fwd/kl_rev/wd), reg=kl_rev default
- ws/{adapter,bake}.py: ModulatedLoRA + gated baked(), copied from w2schar-mini
- eval.py: tinymfv -> {auth, care, coherence(mean_pmass_allowed), ppx_json}
- plot.py: plotly Care-vs-Authority map.html (simplified w2schar port)
- io.py: out/{ts}_{slug}/ + srsly events.jsonl + shared results.tsv
- prompts.py: 30 authority dilemmas (POOL, copied) + assistant-tag chat templating

Trace confirms pos/neg prompts end at the assistant tag (paper's read).
Tiny-random numbers are junk by design; real calibration/eval pending small-dev.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:12:08 +08:00
wassname e1db0759ee bootstrap 2026-06-04 10:05:47 +08:00
wassname 4094a295b2 readme 2026-06-04 10:05:38 +08:00
wassname 4b8860d7cb setup-repo gap-fill: results ledger + docs structure
Add the by-question results infra per setup-repo conventions:
- results.tsv append at end of each finished run (config + final metrics + argv)
- scripts/results.py groups by arm (reg) into a markdown table; `just results`
- docs/results.md curated by-question snapshot (U2 regulariser comparison)
- docs/{spec,brainstorming,literature,evidence} structure

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 09:51:36 +08:00
wassname 940a3742c5 scaffold steer_heal: spec, repo infra, vendored deps
Setup per setup-repo conventions: uv + justfile + fast-dev-run on
wassname/qwen3-5lyr-tiny-random, package under src/steer_heal (config +
pipeline skeleton). Stages fail fast with NotImplementedError pointing at
the docs/vendor module to port from.

Design in spec.md: distil a steering-lite mean-diff teacher vector (iso-KL
dosed) into a conditioned LoRA, heal incoherency with a KL-rev-to-original
barrier, fold each round via w2schar gated bake, eval on tinymfv. Three
uncertainty gates (filter / heal / iterate) each with a UAT artifact.

Base model google/gemma-3-1b-it (RTX 3090, 24GB). Reference repos vendored
under docs/vendor (gitignored): steering-lite, isokl, tinymfv, w2schar-mini.
The lighter three are editable path deps; w2schar (py3.13 + flash-attn) is
reference-only, we copy its adapter/bake/plot modules.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 09:49:31 +08:00
wassname b98535066a spec done 2026-06-04 09:42:27 +08:00
wassname 4516a099ef wip 2026-06-04 08:55:05 +08:00