mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 16:47:16 +08:00
4568ddf491
The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA `score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat swings, then comparing those to steering-lite's 0.5-2 nat reference -- which is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)). Wrong scale, wrong comparison. Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099) = -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats. Also: - diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to 0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour) to answer "is the adapter a better pareto than raw steering?". - diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms co-moving with Authority is expected (both binding foundations), not collapse. - gitignore out/ and results.tsv (experiment outputs, stale schema). - personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
14 lines
112 B
Plaintext
14 lines
112 B
Plaintext
outputs/
|
|
out/
|
|
results/
|
|
results.tsv
|
|
logs/
|
|
wandb/
|
|
data/
|
|
docs/vendor/
|
|
__pycache__/
|
|
*.pyc
|
|
.env
|
|
.venv/
|
|
*.safetensors
|