Files
persona-steering-template-l…/docs/literature/evidence.md
T
2026-06-13 10:05:35 +08:00

203 lines
10 KiB
Markdown

# What the literature says about the gotchas
Quote-anchored evidence for and against the SKILL.md gotchas, gathered
2026-06-11. Full texts cached in `~/Documents/papers/steering/`. Note the
entanglement: the reliability line (Tan -> Braun -> Da Silva -> Pres) shares
Tan as intellectual origin and partly the same model-written-evals datasets, so
it counts as less than four independent observations. The persona line
(persona_vectors -> assistant-axis) shares Anthropic/Lindsey lineage.
## Claim: contrastive pairs pick up whatever co-varies, not the trait
Verdict: literature supports the mechanism. The documented spurious factors are
answer position, token choice, and topic contamination; the specific claim that
RLHF tics (verbosity, confidence, sycophancy) dominate is wassname's
extrapolation, not yet measured by anyone.
### Analysing the Generalisation and Reliability of Steering Vectors, Tan et al, NeurIPS 2024, [arXiv:2407.12404](https://arxiv.org/abs/2407.12404)
epistemic context: peer reviewed, the canonical steering-reliability citation;
later reliability papers use it as their baseline.
> In-distribution, steerability is highly variable across different inputs.
> Depending on the concept, spurious biases can substantially contribute to how
> effective steering is for each input, presenting a challenge for the
> widespread use of steering vectors.
> The lack of steerability and high variance in steering performance
> demonstrates that in many cases, a steering vector extracted may not
> correspond to the intended concept, and applying steering vectors may only be
> effective in the presence of spurious factors associated with the prompt
> template or other potential biases.
### Understanding (Un)Reliability of Steering Vectors in Language Models, Braun et al, [arXiv:2505.22637](https://arxiv.org/abs/2505.22637)
epistemic context: preprint, Krueger-lab follow-up to Tan; same CAA method and
datasets, so correlated with the above.
> All seven prompt types used in our experiments produce a net positive
> steering effect, but exhibit high variance across samples, and often give an
> effect opposite of the desired one. [...] Our results suggest that vector
> steering is unreliable when the target behavior is not represented by a
> coherent direction.
### On the Non-Identifiability of Steering Vectors, [arXiv:2602.06801](https://arxiv.org/abs/2602.06801)
epistemic context: recent preprint, no adoption signal yet. If it holds,
behavioral tests alone cannot tell you which direction you extracted, which is
why SKILL.md step 3 says to read generations rather than trust the label.
> We show that, under white-box single-layer access, steering vectors are
> fundamentally non-identifiable due to large equivalence classes of
> behaviorally indistinguishable interventions.
### repeng author's own account, Theia Vogel, [vgel.me/posts/representation-engineering](https://vgel.me/posts/representation-engineering/)
epistemic context: practitioner self-report from the most-used control-vector
library; the source of the closely-opposite-phrasing rule.
> the odd phrasing here is to keep the phrases as parallel as possible. for
> example, just "sober" instead of "sober from..." conflates the vector with
> alcohol
> I especially challenge someone to find a "self-awareness" vector that isn't
> contaminated by mental health / human emotion!
Related design choice in CAA ([arXiv:2312.06681](https://arxiv.org/abs/2312.06681),
ACL 2024): their multiple-choice format makes contrastive prompts "differ by
only a single token", the tightest possible version of confound matching.
## Claim: steering is context-dependent; extract in-distribution
Verdict: brittleness to prompt and distribution shift is well documented. The
specific sub-claim about think-token transfer (vectors from non-thinking text
barely move behavior inside `<think>`, and vice versa) is literature-silent as
of 2026-06: nothing found either way, so it stays `[in-house, anecdotal]`.
### Tan et al again, [arXiv:2407.12404](https://arxiv.org/abs/2407.12404)
> Out-of-distribution, while steering vectors often generalise well, for
> several concepts they are brittle to reasonable changes in the prompt,
> resulting in them failing to generalise well.
> Many datasets have a high variation in per-sample steerability, and several
> datasets produce the opposite behaviour for almost 50% of inputs.
"Anti-steerable" is their term for the anti-correlation wassname saw on coding
prompts, though they measure prompt shifts, not chat-vs-code.
### Steering off Course, Da Silva et al, ACL 2025, [aclanthology.org/2025.acl-long.974](https://aclanthology.org/2025.acl-long.974/)
epistemic context: peer reviewed, 36 models across 14 families; covers DoLa,
function vectors, task vectors, so cross-model rather than cross-domain
brittleness.
> Our experiments reveal substantial variability in the effectiveness of the
> steering approaches, with a large number of models showing no improvement and
> at times degradation in steering performance.
### The Assistant Axis, Lu et al, [arXiv:2601.10387](https://arxiv.org/abs/2601.10387)
epistemic context: Anthropic/MATS preprint, Lindsey lineage shared with
persona_vectors. Shows persona activations themselves are domain-dependent,
which is the mechanism behind context-dependent steering, though they measure
drift rather than steering transfer.
> However, in therapy-related conversations where the user is working through
> emotional issues or philosophical conversations about AI capabilities and
> self-awareness, models drift along the Assistant Axis to the non-Assistant
> end, ending up at much lower values than the other topics.
### Understanding Reasoning in Thinking LLMs via Steering Vectors, Venhoff et al, [arXiv:2506.18167](https://arxiv.org/abs/2506.18167)
epistemic context: Nanda-group preprint. Does not test chat-to-CoT transfer,
but their design choice (extract at annotated reasoning-trace token positions
inside the CoT) is what you'd do if you believed the in-house observation.
> Thinking models generate substantially longer responses (27.6 vs 14.4
> sentences on average) and exhibit a higher fractions of backtracking,
> uncertainty estimation and example testing behaviors
## Claim: "pretend you are X" extraction works; the scaffold doesn't leak
Verdict: the methodology is the published recipe (system-prompt role-play,
extraction at response tokens), so it is validated by use. Whether the pretend
frame ever leaks into steered behavior is untested; nobody looked
systematically. Its absence is also expected arithmetic (the scaffold sits on
both sides of the contrast, and response-token extraction skips the scaffold
tokens entirely), so it is weak evidence about what steering operates on.
### Persona Vectors, Chen et al, [arXiv:2507.21509](https://arxiv.org/abs/2507.21509)
> Each pair consists of a positive system prompt designed to elicit the target
> trait behavior, and a negative system prompt intended to suppress it.
> We found that response tokens yield more effective steering directions than
> alternative positions such as prompt tokens (see Appendix A.3).
> Our pipeline additionally requires that the specified trait is inducible by
> system prompting the model. [...] models with more robust safety mechanisms
> may refuse to be evil, even when instructed to do so.
### vgel again, on behavior vs concept
epistemic context: single practitioner anecdote, but it is the closest thing
to a direct test in either direction.
> When used with the prompt below, the honesty vector doesn't change the
> model's behavior—instead, it changes the model's judgment of someone else's
> behavior! This is the same honesty vector as before—generated by asking the
> model to act honest or untruthful!
So at least sometimes the vector encodes the concept, applied to whoever is
salient, not first-person behavior. This cuts against reading the no-leak
observation as "steering changes behavior not concepts".
## Claim: persona steering bypasses refusals
Verdict: strongly supported, the best-evidenced claim in this skill. Refusal
lives in a small shared subspace that almost any intervention perturbs.
### Refusal in Language Models Is Mediated by a Single Direction, Arditi et al, NeurIPS 2024, [arXiv:2406.11717](https://arxiv.org/abs/2406.11717)
epistemic context: heavily cited; the open-weight community's "abliteration"
descends from it.
> we find a single direction such that erasing this direction from the model's
> residual stream activations prevents it from refusing harmful instructions,
> while adding this direction elicits refusal on even harmless instructions.
### The Rogue Scalpel, [arXiv:2509.22067](https://arxiv.org/abs/2509.22067)
> even steering in a random direction can increase the probability of harmful
> compliance from 0% to 2-27%. Alarmingly, steering benign features from a
> sparse autoencoder (SAE), a common source of interpretable directions,
> increases these rates by a further 2-4%.
### Analysing the Safety Pitfalls of Steering Vectors, Li et al, [arXiv:2603.24543](https://arxiv.org/html/2603.24543)
epistemic context: preprint; supplies the mechanism (cosine overlap between
arbitrary trait vectors and the Arditi refusal direction).
> steering the model in specific directions can drastically increase (up to
> 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the
> targeted behavior. We attribute this phenomenon to the overlap between the
> steering vectors and the latent directions of refusal behavior.
### Persona modulation at the prompt level, Shah et al, [arXiv:2311.03348](https://arxiv.org/abs/2311.03348)
> These automated attacks achieve a harmful completion rate of 42.5% in GPT-4,
> which is 185 times larger than before modulation (0.23%).
### Assistant Axis again, bidirectional
> We steered the models towards the Assistant direction of the Assistant Axis
> and found that it significantly decreased the rate of harmful responses [...]
> We also tried steering away from the Assistant direction and found that the
> jailbreak rate increased slightly
Related: Persona Features Control Emergent Misalignment, OpenAI, ICLR 2026
([arXiv:2506.19823](https://arxiv.org/abs/2506.19823)): a "toxic persona"
feature predicts and controls emergent misalignment; independent of the
Anthropic lineage.