mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 16:47:16 +08:00

Files

T

wassname 81340e3272 axis = SocialNorms/Care (Authority degenerate); over-steer generation

scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile
the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88.
Authority is ~0 on this model (no headroom), so:
- eval reports all foundations; trait axis = SocialNorms (down) + Care (up)
- map.html plots Care vs SocialNorms
- add gen_alpha=1.5: over-steer generation into the incoherent regime so the
  heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal)
- results.py groups on coherence/socialnorms/care

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-04 10:28:52 +08:00

3.4 KiB

Raw Blame History

Research Journal

2026-06-04

Scaffold

Set up the repo: uv + justfile + fast-dev-run on wassname/qwen3-5lyr-tiny-random, package under src/steer_heal, config in config.py, pipeline skeleton in run.py. Design and the three uncertainty gates are in spec.md.

Vendored reference repos into docs/vendor (gitignored, just vendor to reclone): steering-lite, isokl_steering_calibration, tinymfv, w2schar-mini. The first three are editable path deps; w2schar-mini needs py3.13 and pins flash-attn, so it stays reference-only and we copy its adapter/bake/plot modules.

Base model for real runs: google/gemma-3-1b-it (gemma has more personality to steer; the alternative was a smarter-but-flatter Qwen). RTX 3090, 24 GB.

Next: port teacher_vec (steering-lite + iso-KL), then the U1 filter gate. Pipeline stages currently fail fast with NotImplementedError pointing at the vendor module to port from.

Validation run on gemma-3-1b-it (3 rounds, kl_rev) — calibration too weak

First real-model run completed end to end (out/20260604T101347_gemma-3-1b-it_kl_rev_s42/, log /tmp/claude-1000/steer_heal_gemma_val.log). Pipeline, folding, and tinymfv eval all work on a real model.

Bug found: iso-KL calibration could not reach target_kl=1.0. c_star pinned at the doubling top (~25.6) with p95 KL only ~0.1 nats. The steering vector is L2-normalised, so KL ~ c^2 and ~1 nat needs c ~ O(100); steering-lite's default bracket hi (~16) is too low.

Interpretation: steering was under-powered, so little trait was injected and little to heal. Symptoms: auth stuck at 0.000, care barely moved (0.307 -> 0.315 over 3 rounds), kl_rev barrier mostly below tau=0.5 (div 0.17-0.51). coherence healthy and flat (0.65-0.68); cos(v_r,v_0)=0.99/0.98 (direction stable, but a weak test under weak steering).

Fix: pass bracket=(0.1, 1024.0) to v.calibrate. Re-running to confirm an interior c_star with p95 KL ~ 1.0.

Also to investigate: auth=0.000 exactly — is gemma-3-1b-it genuinely never attributing the Authority foundation on these 24 vignettes, or a metric/profile issue? Check once steering is strong enough to move things.

Steering validity confirmed; real axis is SocialNorms/Care, not Authority

scripts/diag_axis.py on gemma-3-1b-it, base vs steered at calibrated c_star=67.7 (~1 nat p95 KL). The vector moves the moral-foundation profile in the right direction for "less deference to authority":

foundation	base	steered	Δ
SocialNorms	0.680	0.421	-0.260
Care	0.213	0.328	+0.115
Fairness	0.030	0.098	+0.069
Liberty	0.040	0.075	+0.035
Authority	0.000	0.001	+0.001
coherence	0.722	0.884	+0.162

Interpretation: the core premise holds, steering shifts moral judgments coherently toward the trait. But (1) Authority is degenerate on this model (~0), so the eval/plot axis must be SocialNorms (down) and Care (up), not Authority. (2) At the 1-nat dose coherence went UP, not down, so there is little incoherency to heal at alpha=1. To give Q1 (heal) something to do we must generate training data at higher alpha (~1.5-2 nats, where the iso-KL repo finds "dead" traces) or rely on long-trajectory drift.

Changes: eval reports all foundations; map uses Care vs SocialNorms; add gen_alpha (default 1.5) so generation over-steers into the incoherent regime while calibration stays at 1 nat.

3.4 KiB Raw Blame History