After verifying guided.py: tinymfv `score` is already a debiased logprob
((lp_fwd+lp_rev)/2, BMA'd), not a "raw logit", and `p = softmax(score)`. My
two earlier inventions were both wrong:
- log(p) coupled Authority to the other 6 foundations via logsumexp.
- the diagonal (auth-blame on auth-vignettes) is pmass-on-correct-label =
top1 competence, not the trait, and threw away the FP/FN structure.
Use the library-native readout: auth_nats = log(tinymfv profile p[F]) = log of
the mean p per foundation over ALL vignettes. For small p, log p ~= logit, so
this lands on steering-lite's loading-weighted Δlogit scale (base log(0.099)
=-2.3, real shift ~0.5-2 nats). foundation_nats now reads rep["profile"].
Also:
- run.py: BLUF `main metric:` line with cue ball (🟢/🟡/🔴 by coherence band).
- heal.py: training table to 2 sig figs (nll/kl/loss .2f, gnorm .1f); a
per-step loss does not warrant 3 decimals.
- diag_stages: accept 1+ ckpts, label each row by its reg from metadata.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA
`score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat
swings, then comparing those to steering-lite's 0.5-2 nat reference -- which
is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)).
Wrong scale, wrong comparison.
Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the
NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099)
= -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats.
Also:
- diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to
0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour)
to answer "is the adapter a better pareto than raw steering?".
- diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms
co-moving with Authority is expected (both binding foundations), not collapse.
- gitignore out/ and results.tsv (experiment outputs, stale schema).
- personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>