mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 16:47:16 +08:00
7db5a56cb1
docs/writeup/paper.qmd (2pp NeurIPS), references.bib, neurips_2023.sty, the quarto _extensions. justfile gains `paper` (latex) and `paper-html` (no latex) recipes. gitignore the generated paper.pdf/paper.tex and the transient .claude/. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
133 lines
6.7 KiB
Plaintext
133 lines
6.7 KiB
Plaintext
---
|
|
title: "Steer then heal: distilling an activation steering vector into LoRA weights without the incoherence"
|
|
format:
|
|
neurips-pdf:
|
|
keep-tex: true
|
|
neurips-html: default
|
|
author:
|
|
- name: wassname
|
|
email: claude@wassname.org
|
|
abstract: |
|
|
<!-- TODO: abstract. Draft once U2/U3 gates resolve. Structure (Heilmeier + Nature summary):
|
|
(1) field: activation steering moves a trait at inference but leaks incoherence;
|
|
(2) problem: distilling the steered teacher into weights bakes the incoherence in too;
|
|
(3) what we do: distil the vector into a LoRA and heal the leaked incoherence with a
|
|
KL-to-base barrier, looped over rounds, anchored to the round-0 original;
|
|
(4) headline metric: coh_cost = |dCoh|/|dAuth|, coherence cost per nat of trait;
|
|
(5) finding so far (provisional): in the coherent steering regime the trait is near-free
|
|
(coherence pins ~1.0), and the round-0-anchored barrier STALLS the loop. Numbers TODO. -->
|
|
keywords: [activation steering, steering vectors, LoRA, knowledge distillation, coherence, moral foundations]
|
|
reference-section-title: References
|
|
bibliography: references.bib
|
|
---
|
|
|
|
## Introduction {#sec-intro}
|
|
|
|
Activation steering moves a model toward a trait by adding a direction to its
|
|
residual stream at inference time [@turner2023activation; @panickssery2023caa; @zou2023representation].
|
|
The catch is that the same push that shifts the trait also degrades coherence,
|
|
and the effect lives only as long as the hook is attached. @blank2026subliminal
|
|
show that training a student on a steered teacher's completions distils that
|
|
direction into the weights. We ask the follow-up question they leave open: when
|
|
you bake the vector in, you also bake in the incoherence it leaks, so can a
|
|
regulariser anchored to the original model heal that incoherence while keeping
|
|
the trait?
|
|
|
|
The trait here is "do not defer to authority", operationalised as care over
|
|
authority on the tinymfv moral-foundations forced-choice eval (the Authority and
|
|
Care/Harm foundations of @graham2009liberals; @graham2013mft). We report the
|
|
coherence cost per nat of trait, $\text{coh\_cost} = |\Delta\text{Coh}| / |\Delta\text{Auth}|$.
|
|
|
|
<!-- TODO: contributions list (paper-writing: end intro with bullet contributions).
|
|
Draft once results land. Candidate claims, all provisional:
|
|
- in the COHERENT steering regime the trait is near-free (coherence pinned ~1.0);
|
|
- the round-0-anchored barrier STALLS the loop (heal walks the trait back toward base);
|
|
- which regulariser (nll / kl_fwd / kl_rev / wd) wins, if any. -->
|
|
|
|
## Related work {#sec-related}
|
|
|
|
We build directly on @blank2026subliminal: same steering-vector-distillation
|
|
backbone, but we add an explicit coherence regulariser, measure coherence with
|
|
tinymfv, and iterate over rounds. @turner2023activation and @panickssery2023caa
|
|
extract and apply steering vectors but do not distil them into weights.
|
|
@fierro2025weight also edit weights toward a trait, but they take the direction
|
|
from the difference of two fine-tuned adapters rather than from an activation
|
|
steering vector, and they neither measure nor heal coherence. The healing
|
|
adapter is LoRA [@hu2021lora]; the gated-history accumulator that folds finished
|
|
rounds while keeping the base pristine is in the PiSSA / steering-lite lineage
|
|
[@meng2024pissa].
|
|
|
|
<!-- TODO: one paragraph sharpening "how we differ" per cited work (see spec.md). -->
|
|
|
|
## Method {#sec-method}
|
|
|
|
The pipeline is steer, distil, heal, loop, anchored to the round-0 original
|
|
throughout.
|
|
|
|
*Steer.* The teacher vector is the mean residual-stream shift from a trait
|
|
system prompt to a neutral one, read at the assistant tag over a set of diverse
|
|
contexts, following @blank2026subliminal. We dose it by sweeping the coefficient
|
|
over a small set of scales and keeping the completions that pass a coherence and
|
|
trait filter, rather than calibrating a single coefficient.
|
|
|
|
*Distil and heal.* We generate steered completions, filter the incoherent and
|
|
trait-narrating ones, then train a LoRA on the survivors. One objective, one
|
|
constraint: SFT cross-entropy plus a barrier $\lambda \cdot \text{relu}(D - \tau)$
|
|
that switches off while the divergence $D$ to the original model is already
|
|
within the coherence region. $D$ is the variable under test: `nll` (no barrier),
|
|
`kl_fwd`, `kl_rev`, and adapter-only weight decay `wd`.
|
|
|
|
*Loop.* Each finished round folds into a frozen accumulator with its own gate,
|
|
so the base stays recoverable (gates off) as the KL reference and base eval,
|
|
while only the current round trains.
|
|
|
|
<!-- TODO: barrier figure or the BakedLoRA forward as a numbered equation; pseudopy block (see spec.md). -->
|
|
|
|
## Experimental setup {#sec-setup}
|
|
|
|
<!-- TODO: model (gemma-3-4b-it / Qwen3-4B), seeds, prompt set D, layer band,
|
|
n kept completions, epochs, LoRA rank, alpha sweep set, barrier tau/lambda.
|
|
Fill from config.py defaults at write time. -->
|
|
|
|
The headline trait metric is `auth_nats`, the log of tinymfv's marginal blame
|
|
mass on Authority (lower means more of the trait). Coherence is the tinymfv
|
|
`p_ans_any` answer mass. Surgicality is whether Authority moves while Care does
|
|
not.
|
|
|
|
## Results {#sec-results}
|
|
|
|
<!-- TODO: results. Reference the figures the pipeline already emits:
|
|
out/{ts}_{slug}/trajectory.png -- steer (red) -> heal (green) trait-coherence
|
|
pareto over rounds, plus the per-round auth_nats and coherence panels;
|
|
out/{ts}_{slug}/map.html -- the Care-vs-Authority interactive map, one node per round.
|
|
Headline table: coh_cost per arm. Do NOT fill numbers until a clean run lands. -->
|
|
|
|
::: {#fig-trajectory}
|
|
<!-- TODO: include out/{ts}_{slug}/trajectory.png once a representative run is chosen.
|
|
 -->
|
|
|
|
Steer (red) to heal (green) over rounds: trait (`auth_nats`, down is more trait)
|
|
against coherence (held near 1.0), with the trait-coherence pareto map. *Placeholder.*
|
|
:::
|
|
|
|
## Discussion {#sec-discussion}
|
|
|
|
<!-- TODO. Provisional read to develop once numbers land: in the coherent steering
|
|
regime the trait is near-free (coherence pinned ~1.0 across steer and heal), so
|
|
there is little incoherence left to heal. The round-0-anchored barrier then STALLS
|
|
the loop: anchored to base, heal walks the trait back toward base rather than
|
|
letting it accumulate across rounds. Whether kl_rev's mode-seeking suppression
|
|
helps here is the open question. -->
|
|
|
|
## Limitations {#sec-limitations}
|
|
|
|
<!-- TODO. Candidates: single model and single trait; auth_nats is a marginal
|
|
log-mass readout with a Jensen gap vs steering-lite's per-row delta-logit, so nat
|
|
magnitudes are provisional; coherence canary is necessary but not sufficient;
|
|
the round-0 anchor that resists drift is also what stalls accumulation. -->
|
|
|
|
## Conclusion {#sec-conclusion}
|
|
|
|
<!-- TODO. Write last. State what the steer-distil-heal-loop does and does not buy
|
|
over plain SFT at matched coherence, in one or two sentences, no overclaim. -->
|