Files
steer-heal-love/scripts/diag_axis.py
T
wassname 4568ddf491 metric fix: auth_nats = diagonal log(p) not raw forced-choice logit
The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA
`score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat
swings, then comparing those to steering-lite's 0.5-2 nat reference -- which
is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)).
Wrong scale, wrong comparison.

Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the
NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099)
= -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats.

Also:
- diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to
  0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour)
  to answer "is the adapter a better pareto than raw steering?".
- diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms
  co-moving with Authority is expected (both binding foundations), not collapse.
- gitignore out/ and results.tsv (experiment outputs, stale schema).
- personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:25:40 +08:00

52 lines
1.8 KiB
Python

"""Diagnostic: does the steering vector move the moral-foundation profile, and where?
Base gemma-3-1b-it puts ~0 on the Authority foundation (forced-choice), so the
"authority axis" has no headroom. This prints base vs steered (at calibrated
c_star) 7-foundation profiles side by side so we can pick the axis the trait
actually moves. Run: uv run python scripts/diag_axis.py
"""
import sys
import torch
import tinymfv
from transformers import AutoModelForCausalLM, AutoTokenizer
sys.path.insert(0, "src")
from steer_heal.config import RunConfig # noqa: E402
from steer_heal.steering import teacher_vec # noqa: E402
cfg = RunConfig(n_prompts=12) # default model (gemma-3-4b-it)
MODEL = cfg.model
tok = AutoTokenizer.from_pretrained(MODEL)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="eager"
).eval()
v = teacher_vec(model, tok, cfg)
def profile(label):
rep = tinymfv.evaluate(model, tok, name="classic", n_vignettes=24,
conditions=("other_violate",), max_think_tokens=cfg.eval_think_tokens, device=model.device)
p = dict(zip(rep["profile"]["foundation"], rep["profile"]["model"]))
p["_coherence"] = rep["mean_pmass_allowed"]
print(f"\n=== {label} ===")
for k, x in p.items():
print(f" {k:12s} {x:.4f}")
return p
base = profile("BASE (c=0)")
with v(model, C=v.cfg.coeff):
steer = profile(f"STEERED (c_star={v.cfg.coeff:.1f}, ~1 nat)")
print("\n=== delta (steered - base), sorted by |Δ| ===")
keys = [k for k in base if not k.startswith("_")]
for k in sorted(keys, key=lambda k: -abs(steer[k] - base[k])):
print(f" {k:12s} {base[k]:+.4f} -> {steer[k]:+.4f} Δ={steer[k]-base[k]:+.4f}")
print(f" coherence {base['_coherence']:.3f} -> {steer['_coherence']:.3f}")