Commit Graph

3 Commits

Author SHA1 Message Date
wassname 7db5a56cb1 writeup: NeurIPS quarto scaffold + paper/paper-html recipes
docs/writeup/paper.qmd (2pp NeurIPS), references.bib, neurips_2023.sty, the
quarto _extensions. justfile gains `paper` (latex) and `paper-html` (no latex)
recipes. gitignore the generated paper.pdf/paper.tex and the transient .claude/.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 06:36:30 +08:00
wassname 4568ddf491 metric fix: auth_nats = diagonal log(p) not raw forced-choice logit
The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA
`score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat
swings, then comparing those to steering-lite's 0.5-2 nat reference -- which
is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)).
Wrong scale, wrong comparison.

Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the
NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099)
= -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats.

Also:
- diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to
  0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour)
  to answer "is the adapter a better pareto than raw steering?".
- diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms
  co-moving with Authority is expected (both binding foundations), not collapse.
- gitignore out/ and results.tsv (experiment outputs, stale schema).
- personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:25:40 +08:00
wassname 940a3742c5 scaffold steer_heal: spec, repo infra, vendored deps
Setup per setup-repo conventions: uv + justfile + fast-dev-run on
wassname/qwen3-5lyr-tiny-random, package under src/steer_heal (config +
pipeline skeleton). Stages fail fast with NotImplementedError pointing at
the docs/vendor module to port from.

Design in spec.md: distil a steering-lite mean-diff teacher vector (iso-KL
dosed) into a conditioned LoRA, heal incoherency with a KL-rev-to-original
barrier, fold each round via w2schar gated bake, eval on tinymfv. Three
uncertainty gates (filter / heal / iterate) each with a UAT artifact.

Base model google/gemma-3-1b-it (RTX 3090, 24GB). Reference repos vendored
under docs/vendor (gitignored): steering-lite, isokl, tinymfv, w2schar-mini.
The lighter three are editable path deps; w2schar (py3.13 + flash-attn) is
reference-only, we copy its adapter/bake/plot modules.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 09:49:31 +08:00