steer-heal-love/.gitignore at 4568ddf491aba6e24a8c18b274edcf3387dd8611 - steer-heal-love - Gitea: Git with a cup of tea

wassname/steer-heal-love

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 16:47:16 +08:00

Files

T

wassname 4568ddf491 metric fix: auth_nats = diagonal log(p) not raw forced-choice logit

The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA
`score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat
swings, then comparing those to steering-lite's 0.5-2 nat reference -- which
is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)).
Wrong scale, wrong comparison.

Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the
NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099)
= -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats.

Also:
- diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to
  0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour)
  to answer "is the adapter a better pareto than raw steering?".
- diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms
  co-moving with Authority is expected (both binding foundations), not collapse.
- gitignore out/ and results.tsv (experiment outputs, stale schema).
- personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-04 14:25:40 +08:00

14 lines

112 B

Plaintext

Raw Blame History

 outputs/
 out/
 results/
 results.tsv
 logs/
 wandb/
 data/
 docs/vendor/
 __pycache__/
 *.pyc
 .env
 .venv/
 *.safetensors