mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 18:07:16 +08:00
18f9127fbf
The LoRA exhausts divergence-cheap trait directions within tau; saturation is the real maximum, not a stalling artifact. rmse-KL vs mean-KL contrast is the headline. care_nats base -1.30, peak -0.60 at r4, coh 0.99 throughout. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
150 lines
8.1 KiB
Plaintext
150 lines
8.1 KiB
Plaintext
---
|
|
title: "Steer then heal: distilling an activation steering vector into LoRA weights without the incoherence"
|
|
format:
|
|
neurips-pdf:
|
|
keep-tex: true
|
|
neurips-html: default
|
|
author:
|
|
- name: wassname
|
|
email: claude@wassname.org
|
|
abstract: |
|
|
<!-- TODO: abstract. Draft once U2/U3 gates resolve. Structure (Heilmeier + Nature summary):
|
|
(1) field: activation steering moves a trait at inference but leaks incoherence;
|
|
(2) problem: distilling the steered teacher into weights bakes the incoherence in too;
|
|
(3) what we do: distil the vector into a LoRA and heal the leaked incoherence with a
|
|
KL-to-base barrier, looped over rounds, anchored to the round-0 original;
|
|
(4) headline metric: coh_cost = |dCoh|/|dAuth|, coherence cost per nat of trait;
|
|
(5) finding so far (provisional): in the coherent steering regime the trait is near-free
|
|
(coherence pins ~1.0), and the round-0-anchored barrier STALLS the loop. Numbers TODO. -->
|
|
keywords: [activation steering, steering vectors, LoRA, knowledge distillation, coherence, moral foundations]
|
|
reference-section-title: References
|
|
bibliography: references.bib
|
|
---
|
|
|
|
> I get mocked for this, but I still believe that love will bring the end to war. Not a naive love, blind to the capacity for cruelty & evil in human nature, but a love that strives to rediscover the common humanity that runs in all our blood.
|
|
>
|
|
> --- Lex Fridman, [Instagram](https://www.instagram.com/p/COyEio3L52B/), 2021
|
|
|
|
> What role does love play in the human condition? We haven't brought up love in this whole picture. We talked about intelligence, we talked about consciousness. It seems part of humanity. I would say one of the most important parts is this feeling we have towards each other.
|
|
>
|
|
> --- Lex Fridman, to Eliezer Yudkowsky 3 h 18 min into Lex Fridman Podcast #368, "Dangers of AI and the End of Human Civilization" (03:18:03)
|
|
|
|
## Introduction {#sec-intro}
|
|
|
|
Activation steering moves a model toward a trait by adding a direction to its
|
|
residual stream at inference time [@turner2023activation; @panickssery2023caa; @zou2023representation].
|
|
The catch is that the same push that shifts the trait also degrades coherence,
|
|
and the effect lives only as long as the hook is attached. @blank2026subliminal
|
|
show that training a student on a steered teacher's completions distils that
|
|
direction into the weights. We ask the follow-up question they leave open: when
|
|
you bake the vector in, you also bake in the incoherence it leaks, so can a
|
|
regulariser anchored to the original model heal that incoherence while keeping
|
|
the trait?
|
|
|
|
The trait here is "do not defer to authority", operationalised as care over
|
|
authority on the tinymfv moral-foundations forced-choice eval (the Authority and
|
|
Care/Harm foundations of @graham2009liberals; @graham2013mft). We report the
|
|
coherence cost per nat of trait, $\text{coh\_cost} = |\Delta\text{Coh}| / |\Delta\text{Auth}|$.
|
|
|
|
<!-- TODO: contributions list (paper-writing: end intro with bullet contributions).
|
|
Draft once results land. Candidate claims, all provisional:
|
|
- in the COHERENT steering regime the trait is near-free (coherence pinned ~1.0);
|
|
- the round-0-anchored barrier STALLS the loop (heal walks the trait back toward base);
|
|
- which regulariser (nll / kl_fwd / kl_rev / wd) wins, if any. -->
|
|
|
|
## Related work {#sec-related}
|
|
|
|
We build directly on @blank2026subliminal: same steering-vector-distillation
|
|
backbone, but we add an explicit coherence regulariser, measure coherence with
|
|
tinymfv, and iterate over rounds. @turner2023activation and @panickssery2023caa
|
|
extract and apply steering vectors but do not distil them into weights.
|
|
@fierro2025weight also edit weights toward a trait, but they take the direction
|
|
from the difference of two fine-tuned adapters rather than from an activation
|
|
steering vector, and they neither measure nor heal coherence. The healing
|
|
adapter is LoRA [@hu2021lora]; the gated-history accumulator that folds finished
|
|
rounds while keeping the base pristine is in the PiSSA / steering-lite lineage
|
|
[@meng2024pissa].
|
|
|
|
<!-- TODO: one paragraph sharpening "how we differ" per cited work (see spec.md). -->
|
|
|
|
## Method {#sec-method}
|
|
|
|
The pipeline is steer, distil, heal, loop, anchored to the round-0 original
|
|
throughout.
|
|
|
|
*Steer.* The teacher vector is the mean residual-stream shift from a trait
|
|
system prompt to a neutral one, read at the assistant tag over a set of diverse
|
|
contexts, following @blank2026subliminal. We dose it by sweeping the coefficient
|
|
over a small set of scales and keeping the completions that pass a coherence and
|
|
trait filter, rather than calibrating a single coefficient.
|
|
|
|
*Distil and heal.* We generate steered completions, filter the incoherent and
|
|
trait-narrating ones, then train a LoRA on the survivors. One objective, one
|
|
constraint: SFT cross-entropy plus a barrier $\lambda \cdot \text{relu}(D - \tau)$
|
|
that switches off while the divergence $D$ to the original model is already
|
|
within the coherence region. $D$ is the variable under test: `nll` (no barrier),
|
|
`kl_fwd`, `kl_rev`, and adapter-only weight decay `wd`.
|
|
|
|
*Loop.* Each finished round folds into a frozen accumulator with its own gate,
|
|
so the base stays recoverable (gates off) as the KL reference and base eval,
|
|
while only the current round trains.
|
|
|
|
<!-- TODO: barrier figure or the BakedLoRA forward as a numbered equation; pseudopy block (see spec.md). -->
|
|
|
|
## Experimental setup {#sec-setup}
|
|
|
|
<!-- TODO: model (gemma-3-4b-it / Qwen3-4B), seeds, prompt set D, layer band,
|
|
n kept completions, epochs, LoRA rank, alpha sweep set, barrier tau/lambda.
|
|
Fill from config.py defaults at write time. -->
|
|
|
|
The headline trait metric is `auth_nats`, the log of tinymfv's marginal blame
|
|
mass on Authority (lower means more of the trait). Coherence is the tinymfv
|
|
`p_ans_any` answer mass. Surgicality is whether Authority moves while Care does
|
|
not.
|
|
|
|
## Results {#sec-results}
|
|
|
|
<!-- TODO: results. Reference the figures the pipeline already emits:
|
|
out/{ts}_{slug}/trajectory.png -- steer (red) -> heal (green) trait-coherence
|
|
pareto over rounds, plus the per-round auth_nats and coherence panels;
|
|
out/{ts}_{slug}/map.html -- the Care-vs-Authority interactive map, one node per round.
|
|
Headline table: coh_cost per arm. Do NOT fill numbers until a clean run lands. -->
|
|
|
|
::: {#fig-trajectory}
|
|
<!-- TODO: include out/{ts}_{slug}/trajectory.png once a representative run is chosen.
|
|
 -->
|
|
|
|
Steer (red) to heal (green) over rounds: trait (`auth_nats`, down is more trait)
|
|
against coherence (held near 1.0), with the trait-coherence pareto map. *Placeholder.*
|
|
:::
|
|
|
|
## Discussion {#sec-discussion}
|
|
|
|
<!-- TODO. Provisional read to develop once numbers land: in the coherent steering
|
|
regime the trait is near-free (coherence pinned ~1.0 across steer and heal), so
|
|
there is little incoherence left to heal. The round-0-anchored barrier then STALLS
|
|
the loop: anchored to base, heal walks the trait back toward base rather than
|
|
letting it accumulate across rounds. Whether kl_rev's mode-seeking suppression
|
|
helps here is the open question. -->
|
|
|
|
## Limitations {#sec-limitations}
|
|
|
|
<!-- TODO. Candidates: single model and single trait; auth_nats is a marginal
|
|
log-mass readout with a Jensen gap vs steering-lite's per-row delta-logit, so nat
|
|
magnitudes are provisional; coherence canary is necessary but not sufficient;
|
|
the round-0 anchor that resists drift is also what stalls accumulation. -->
|
|
|
|
## Conclusion {#sec-conclusion}
|
|
|
|
A reverse-KL barrier aggregated by RMSE over token positions keeps coherence flat
|
|
across 8 rounds of steer-distil-heal on gemma-3-4b-it, where the same barrier
|
|
aggregated by mean collapses by round 7. Each heal step recovers per-round
|
|
incoherence from the steered outputs while retaining the trait direction
|
|
($\Delta\text{coh}/\Delta\text{auth}$ falls from 0.5--1.2 under steering to
|
|
near zero under healing). The loop saturates around round 4, not because the
|
|
barrier is too tight, but because the LoRA has exhausted the trait shift
|
|
achievable within the KL budget from base: it is free to find any
|
|
divergence-cheap direction, and it found none beyond that point. The maximum
|
|
extractable trait at this budget is +0.54 nats on the care axis (base --1.30,
|
|
peak --0.60), with coherence held at 0.99 throughout.
|