steer-heal-love/AGENTS.md at 7d703c0cc3705a85f44122d1cafa4be96a0b0929

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 15:32:28 +08:00

Files

T

wassname 09349894ce results: QLoRA bs=3 ga=2 + lam_round_pow=-0.5 extends movement to r6 (peak -0.37 vs -0.60)

- plot: Panel A now tracks top-moving trait (care for love demo, auth for authority)
  instead of hardcoded auth_nats; Panel C already did this, Panel A now consistent
- README: update table with new run (lam decay extends saturation r4→r6), refresh diary
  from new run's outputs, update trajectory plot
- AGENTS.md: correct gotchas -- tau<operating_KL is the key constraint (tau=2.0 not 4.0);
  QLoRA + bs=3 ga=2 is the right default for better heal gradient estimates

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-10 07:36:09 +08:00

2.6 KiB

Raw Blame History

This is novel ML research. Not in your training data. Extrapolate carefully. Read spec.md first.

What this is

Distil an activation steering vector (steering-lite) into a conditioned LoRA, heal the incoherency it injects with a KL-rev-to-original barrier, fold the round into a gated weight bake, and loop. Eval on tinymfv (auth/care axis + coherence). Full design and the three uncertainty gates are in spec.md.

Workflow

Inherit global rules from ~/.claude/CLAUDE.md.
just vendor to (re)clone reference repos into docs/vendor (editable path deps).
just fast-dev-run before any real run: real pipeline on the tiny-random model, beartype on, scale-only knobs. If a bug slips past it, strengthen the gate, do not add a tests/ dir.
just run for a real run on gemma-3-1b-it (RTX 3090, 24GB).
New sweeps go in the justfile with # H: hypothesis comments, newest at the top of queue.
tail docs/RESEARCH_JOURNAL.md for latest context.

Reuse, do not reinvent (docs/vendor)

steering-lite: Vector.train(...).calibrate(target_kl=...), mean-diff vector + iso-KL dose.
iso-kl-figure: coefficient calibration and KL/coherence measurement.
tiny-mfv: eval on the moral-foundations axes + p_ans_any / json_is_valid / ppx_json.
w2schar-mini (NOT a dep, needs py3.13): copy src/csm/ws/{adapter,bake,history}.py for the conditioned LoRA + gated bake, and port src/csm/plot.py _build_scatter for the Care-vs-Authority HTML map. The base stays pristine at gate 0 = our KL anchor.

Code style

einops/einsum for shape ops and contractions; jaxtyping on function boundaries only.
polars v1, loguru (tqdm-safe), single-letter dims, capital suffix for projected spaces.
Fail fast, crash loudly. No defensive guards, no fallbacks, no silent skips.
One objective + one constraint (barrier), never competing losses. See spec.md Loss.
Every edit should reduce entropy: if you add, remove something of equal weight.

Gotchas

Use QLoRA + train_bs=3 + grad_accum=2 (eff_bs=6). The larger effective batch gives better heal SFT gradient estimates. 4-bit decode is ~3x slower than bf16 but the convergence win is worth it. Only skip QLoRA if targeting a model too large for the GPU in bf16.
tau must sit BELOW the heal-step's operating KL (~3 nats for gemma-3-4b on this task). If tau > operating_KL, relu(div - tau) = 0 and the barrier silently fires no gradient. Symptom: coherence drops fast and coh_floor early-stop fires at r1. Fix: tau=2.0.
The heal KL step masks completion positions BEFORE log_softmax (full [B, L-1, ~262k] OOMs on a 3090 at bs>1). Keep this regardless of dtype.