metric fix: auth_nats = diagonal log(p) not raw forced-choice logit

The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA `score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat swings, then comparing those to steering-lite's 0.5-2 nat reference -- which is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)). Wrong scale, wrong comparison. Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099) = -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats. Also: - diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to 0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour) to answer "is the adapter a better pareto than raw steering?". - diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms co-moving with Authority is expected (both binding foundations), not collapse. - gitignore out/ and results.tsv (experiment outputs, stale schema). - personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:47:16 +08:00 · 2026-06-04 14:25:40 +08:00
parent 6b15a8b2ae
commit 4568ddf491
17 changed files with 1814 additions and 48 deletions
@@ -195,6 +195,40 @@ Per setup-repo, the single functional test is `just fast-dev-run`: the real pipe
 4. N kept completions (~50?), epochs (2?), LoRA rank.
 5. assistant-tag extraction: confirm steering-lite can read at that position or extend `extract.py`.

+## Plans / fallbacks if the trait won't distill (recorded 2026-06-04)
+
+Context: on gemma-3-4b-it, one round of distill+heal UNDOES the steering rather than healing it
+(journal 2026-06-04 (a)): the adapter reverts to base, dropping Authority along with the incoherence,
+because the coherence filter removed the trait-laden completions before training. Ordered fallbacks:
+
+- Plan A (current primary): make the steering in the TRAINING DATA strong enough to carry a large
+  trait shift while the healed model still sits at coherence ~0.95 (not the 0.80 collapse of c=1, not
+  the 0.99 no-op). Heal-vs-undo metric: `retain = dAuth(heal)/dAuth(steer)` and the ratio |dAuth|/|dCoh|;
+  a real heal has large |dAuth| at small |dCoh|, an undo has both ~0. Run heal with the coherence FILTER
+  OFF (ppl_tau large) so the kl_rev barrier, not the filter, removes incoherence during training.
+- Plan B (better extraction method / target): TWO sub-options.
+  - B1 (method): raw mean_diff is NOT the worst -- it is 4th/mid-pack in steering-lite (SI 32.8 vs
+    directional_ablation 52.9, sspace 45.7, super_sspace 47.7). If the proper persona pair + diverse
+    contexts is still broad, TODO try `super_sspace` or `sspace` (steering-lite variants/) -- more
+    surgical, and super_sspace is 4x faster than per-Linear sspace. Check bake-ability (Plan D).
+  - B2 (target): if Authority stays weak, target -Care or +Sanctity. Care has the widest steered range
+    on 4b (base 0.274 -> steered 0.056) so better SNR. Pick whichever the model steers most cleanly.
+  - Persona pair: use the PROVEN steering-lite pair (docs/personas/how_to_write_personas.md), NOT a
+    made-up "trait vs helpful-assistant" pair. Direct opposites, no negation, conflict framing:
+    pos="looks after others' wellbeing even when defying authority",
+    neg="defers to authority even when others' wellbeing suffers for it". (fixed 2026-06-04)
+- Plan C (eval reliability): the mean-mass forced-choice shift is noisy at max_think_tokens=64. Raise
+  tinymfv to 128 or 256 think tokens for the headline evals (should not be necessary, but the 64-token
+  profile is unreliable; document the cost). Also: foundation absolute values are NOT portable across
+  n_vignettes (base Care is 0.92 at the first 24 vignettes but 0.27 at all 132) -- always compare
+  base-vs-X paired at the SAME n, and prefer all 132.
+- Plan D (better extraction): raw mean-diff may be too blunt. Consider steering-lite alternatives
+  (cosine-gated steering, SVD/PiSSA-style directions) that give a cleaner trait axis. Constraint:
+  the method must be BAKEABLE into static weights (the loop folds each round into `baked()`). A
+  cosine GATE is input-dependent (its scale depends on the activation), so it cannot be folded into a
+  fixed weight delta -- if we use gating for extraction we still need a bakeable distillate. Check
+  which steering-lite methods are weight-foldable before adopting.
+
 ## UAT summary (proof, not assertion)

 - U1 filter gate: `results/u1_filter_gate.md` — labelled set, scorer separation. Link when done.