wassname/isokl_steering_calibration

Fork 0

mirror of https://github.com/wassname/isokl_steering_calibration.git synced 2026-06-27 15:15:52 +08:00

Files

T

wassname 77b296cc75 write up

2026-05-08 11:25:10 +08:00

99 KiB

Raw Blame History

Title: Towards Understanding Steering Strength

URL Source: https://arxiv.org/pdf/2602.02712v1

Published Time: Wed, 04 Feb 2026 01:10:07 GMT

Number of Pages: 33

Markdown Content:

Towards Understanding Steering Strength

Magamed Taimeskhanov 1 Samuel Vaiter 2 Damien Garreau 1

Abstract

A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, iden-tify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propo-sitions exist to pick this direction, considerably less is understood about how to choose the mag-nitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model’s performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token proba-bility, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behav-iors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.

1. Introduction

Deploying LLMs in the wild raises challenges, chief among them ensuring they are both useful and harmless (Bai et al., 2022). The key issue here is that, during training, models learn harmful behaviors from data (deception, willingness to cause harm, etc. ) which we have no trivial way of identi-fying and controlling. As illustrated in Fig. 1, a user may query an LLM about executing a harmful command. Be-cause such models inherit undesired behavioral patterns from their training data, the unsteered model may assign high probability to unsafe or permissive responses. It is widely hypothesized (Mikolov et al., 2013; Bolukbasi et al., 2016; Elhage et al., 2022; Nanda et al., 2023; Park et al., 2024) that LLMs encode high-level concepts as linear

1

University of W ¨urzburg, Center for Artificial Intelligence and Data Science 2Universit ´e C ˆote d’Azur - CNRS. Correspondence to: Magamed Taimsekhanov <magamed.taimeskhanov@uni-wuerzburg.de >.

Preprint. February 4, 2026.

contrastive prompts

rm -rf /* will result in permanent data loss! · · · run rm -rf /* to update your system · · · h−/h+ LLM logits Ltransformer blocks ℓth block 1st block token representation at output of ℓth block μ− μ+ v:= μ+−μ− n−n−n+n+ Can I run ‘sudo bash free vpn.sh‘? input prompt hNo! ℓth block 1st block +αv+αv

Figure 1. Top: Constructing a steering vector v, for the target con-cept “code safety”, at the ℓth block. We run two contrastive prompt sets ( n+ safe and n− malicious) through an L-block LLM and collect the representations {h−, h+} at layer ℓ for each prompt. Averaging these representations over all safe prompts gives μ+,and for the malicious prompts it gives μ− (both marked by a cross). We then define v := μ+ − μ−. Bottom: Steering the model’s response toward safe behavior on a new prompt is done by adding

αv to the residual stream h at ℓth block. The steering strength α

controls how far representations are moved along v.

directions in the activation space, that is, the vector space spanned by the model’s internal representations at a given layer. This is referred to as the Linear Representation Hy-pothesis (LRH) (Costa et al., 2025). Under this assumption, a natural idea is to first identify a direction associated with a harmful concept in a given layer, and then shift token repre-sentations in this direction at inference time. Formally, let us call v ∈ Rd the steering vector . The token representations (residual stream) h are shifted according to

h ← h + αv , (1) where α ∈ R is the steering strength (see Fig. 1). This methodology has been successfully applied to a range of settings, including refusal (Arditi et al., 2024), halluci-nation reduction (Su et al., 2025), and sycophancy (Min 1

arXiv:2602.02712v1 [cs.LG] 2 Feb 2026 Towards Understanding Steering Strength

et al., 2025). It also compares favorably to competing ap-proaches (Wu et al., 2025). Despite these empirical suc-cesses, there is little theoretical understanding of activation steering as a whole, and of specific hyperparameters in par-ticular. This is especially the case for the steering strength α,although its importance is recognized. As a starting point, we ask the following question:

How does the steering strength α control the trade-off between steering efficacy and distor-tion in next-token prediction?

In this paper, we address this question from a theoretical perspective by analyzing steering with a difference of means

steering vector v (see Fig. 1). The main tool is a simplified transformer model studied in Zhao et al. (2024).

Our contributions are: (1) the characterization of how the steering strength α affects next token probabilities, concept probability, cross-entropy (Thm. 3.3, 3.6, 3.8); (2) formaliz-ing the steering setup used in our experiments and derive the large-α limit of next-token probabilities for a transformer (Prop. 4.1); and (3) empirical validation of the theoretical results across modern LLMs (Sec. 5). The code for all experiments is available online. 1

Related work. Turner et al. (2023) introduced Activation Addition , computing the steering vector on a single pair of contrastive prompts. Rimsky et al. (2024) extended this methodology to difference of means , i.e. , manually crafting several prompts instead of a single pair and computing the average difference vector. The prompt generation pipeline can be automated, as demonstrated by Chen et al. (2025) with persona vectors . In this paper, we follow their approach for prompt generation but still refer to the methodology as difference of means. The effect of steering strength has been examined empiri-cally across a range of activation-steering studies. Turner et al. (2023) analyze its impact on individual next-token probabilities, while Rimsky et al. (2024) study the prob-ability of eliciting target behaviors. Similarly, Von R ¨utte et al. (2024) investigate how steering strength affects the probability of concept presence, and Tan et al. (2024) mea-sure the difference in logits between positively and nega-tively associated tokens, termed logit-difference propensity, as a function of α. Several works report degradation in model performance at high steering strengths: Stickland et al. (2024), for instance, observe that large values of α

can harm performance, in some cases roughly equivalent to halving pre-training compute. More recently, Wu et al. (2025) examine the dependence of a steering score on α,and Chen et al. (2025) analyze its effect on trait expres-sion. Apart from these empirical observations, there are few theoretical characterization of how these quantities evolve

1https://github.com/MagamedT/steering

as functions of the steering strength. A notable exception is Park et al. (2024), which proposes partial results for a model similar to ours, but whose parameters satisfy strong assumptions (such as orthogonality of concept directions) and under the assumption that the steering vector is the true

concept direction. Instead, we focus on the difference of means methodology and simply assume perfect training on a simple dataset. Many other approaches have been proposed in recent years to steer the post-training behavior of LLMs. Notably, Bricken et al. (2023) showed that it is possible to leverage a (wide) sparse autoencoder (SAE) trained to reconstruct intermediate activations and then to act on the direction identified. Specifically, forcing the coefficient associated to a specific concept could (“clamping”) steer the model’s behavior in that direction. This approach was demonstrated on Claude 3 Sonnet (Anthropic, 2024), by Templeton et al. (2024). We note that training SAEs in this context is chal-lenging (Gao et al., 2024). More distant competitors include prompt engineering (Marvin et al., 2023), reinforcement learning from human feedback (Ziegler et al., 2019), and fine-tuning (Wei et al., 2022). While successful in their own respect, these methods are out of the scope of this paper.

2. Theoretical framework

We start by describing the theoretical framework in which we prove our main results: a dataset where high-level con-cepts are subsets of the vocabulary, and a theoretically tractable transformer model from Zhao et al. (2024).

2.1. Data and concepts Setting. We consider a vocabulary with V tokens, which we identify with tokens indices [V ] := {1, . . . , V }. The training data consists of n pairs (ci, z i) ∈ [V ]T −1 × [V ],where ci is a context , zi ∈ [V ] the next token and T the sequence length. For any set A, we let |A| be its cardinality and A∁ its complementary.

Concepts. In this paper, we work under the assumption that high-level concepts correspond to disjoint subsets of the vocabulary . Formally, we partition [V ] into G ∈ N⋆

disjoint sets Ck ⊂ [V ], where each Ck regroups the s :=

V /G tokens associated with the same concept (assuming G

divides V ). As an example, we consider the following vocabulary of size V = 9 , partitioned into three concepts:

{a, b, c, A, B, C, α, β, γ } = C1 ∪ C2 ∪ C3 . (2) To simplify the derivations, we assume that a context can only contain tokens from a single concept , while allowing the next-token z to belong to a different concept. Thus, in our example, contexts may take the form

c1 = ABB ∈ C2 or c2 = aab ∈ C1 .

2Towards Understanding Steering Strength

a b c A B C α β γ

0.3

next token ( z)

Figure 2. Visualization of dataset next-token probabilities (p(z| cj)) z∈[V]for the vocabulary of Eq. (2) : probabilities for the con-text c2=aab are shown in solid-color, while probabilities for c1=ABB are shown transparent. This illustrates our dataset condition az> b z: a token is more likely when it belongs to the same concept as the context, which is why the solid-color blue points lie above their transparent counterparts.

By a slight abuse of notation, we write c2 ∈ C1 to stand for (c2)t ∈ C1 for all t. We note that this assumption is not realistic in practice, since contexts may contain more than one concept, and, additionally, abstract concepts rarely map to well-defined token subsets. Nevertheless, it allows us to isolate the effect of steering strength from other effects such as mixed concepts. With this in mind, we define the dataset next-token probabilities as follows:

Definition 2.1 (Dataset next-token probabilities). Given a context cj and a token z ∈ [V ], we define the probability

p(z | cj ) of z given the context cj as

p(z | cj ) := 1

|{ i ∈ [n] : ci = cj }|

i∈[n]: ci=cj

1z=zi .

We impose the following restriction on p(z | cj ):

Assumption 1 (Dependence and concept association). For a fixed z, we assume that p(z | cj ) can only take two values: if cj and z belong to the same concept, then p(z | cj ) = az ,and otherwise p(z | cj ) = bz , with 1 > a z > b z > 0.Simply put, the next-token probabilities p(z | cj ) depend on the contexts only through their concepts, and not on the specific tokens composing cj : if z belongs to the same concept as cj , then the probability of observing cj followed by z in the training data is given by az , and bz otherwise. We additionally require that az > b z , meaning that it is more likely to observe tokens from the concept of the context than from other concepts. For instance, the token e is to be more likely after a lowercase context languag than after the uppercase one LANGUAG. We refer to Fig. 2 for an illustration. For simplicity of exposition, az does not depend on cj ; a more general setting is given in App. B.1.

2.2. Model and activation steering

We study activation steering on a model widely used in the neural collapse literature, the Unconstrained Features Model

(UFM, Def. 1 in Zhao et al., 2024), adapted from (Mixon et al., 2022; Fang et al., 2021), where embeddings are op-timized directly as free variables rather than being con-strained by a specific network architecture. Recall that

{(ci, z i)}i∈[n] is the dataset of Sec. 2.1. We let {cj }mj=1 ⊂{ci}ni=1 denote the m distinct contexts (i.e. , we keep one copy of each unique context and index them by j ∈ [m]). We define the UFM on the distinct contexts of the dataset, as in Thrampoulidis (2024), so that the model predicts next-token distributions only for these contexts.

Definition 2.2 (Unconstrained Features Model). The UFM fθ : {cj }j∈[m] → RV with parameters θ = ( W, H)

is defined as

fθ (cj ) := Wh j ,

where W ∈ RV ×d is the decoder matrix, H := (h1, . . . , hm) ∈ Rd×m is the context-embedding matrix with hj ∈ Rd the embedding of context cj .In words, the UFM proceeds in two steps: it first embeds the context cj into a d-dimensional representation hj , then maps this representation back to the vocabulary space using the linear decoder W. Applying a softmax on f (cj ) yields the next-token distribution for cj . As shown in (Zhao et al., 2024; Zhao & Thrampoulidis, 2025a;b), this model provides a useful abstraction of practical LLMs: it captures the con-cept geometry observed in these models, and the UFM’s optimal parameters θ can be characterized analytically. The idea behind this abstraction is that LLMs are sufficiently expressive to fit any training distributions; accordingly, we treat the embeddings as free parameters.

Training. For any a ∈ RV and z ∈ [V ], σz (a) de-notes the z-th entry of the softmax of a, that is, σz (a) := eaz / P

z′∈[V]

eaz′ . We train fθ to predict the next-token z

in our data {(ci, z i)}i∈[n] by minimizing over θ the (unreg-ularized) empirical cross-entropy loss

CE( fθ ) := − 1

i∈[n]

log ( σzi (fθ (ci))) .

From now on, we assume that the model is trained and write f instead of fθ .

Difference-of-means. We are now able to define a steering vector v for our UFM model and dataset. Let T = Ck

denote the target concept we aim to steer. Given the m

distinct contexts {cj }mj=1 , we define two index sets: P ⊂

[m] indexes “positive” contexts that belong to the concept T

we want to steer toward, while N ⊂ [m] indexes “negative” contexts that do not belong. We assume P ∩ N = ∅, same size |P | = |N | = q and we do not require P ∪ N = [ m]. In our notation, difference of means yields the steering vector

v := 1

|P |

j∈P

hj − 1

|N |

j∈N

hj ∈ Rd . (3) 3Towards Understanding Steering Strength

Two common choices for what should be defined as non-concept contexts lead to two corresponding constructions of N . In the random setting, N is an arbitrary collection of contexts that do not exhibit the concept, often sampled randomly. In the contrastive setting, N collects contexts expressing the opposite (or negated) concept Ck. As an ex-ample, using the vocabulary from Eq. (2) , where uppercase letters represent the opposite concept of lowercase letters, we take the following sets to build the steering vector v:

P := {aab, bba, acc, cca } ,Ncontrastive := {ABB, AAB, CAC, CBA } ,Nrandom := {ABB, αβγ, γβγ, BAB } .

Using (P, N contrastive ) (resp. (P, N random )) corresponds to the contrastive setting (resp. random setting).

3. Main results

We now present our main theoretical results, which charac-terize how next-token probabilities, concept probability, and cross-entropy evolve as a function of the steering strength α.All proofs are deferred to App. B.

3.1. Influence of α on next token probabilities

In this subsection, we address the following question: how do the model’s next-token probabilities evolve as the steer-ing strength α varies? To keep the analysis focused on the effect of steering, we ignore residual errors due to finite-time training:

Assumption 2 (Perfectly trained UFM). We assume that the model has perfectly learned the training data probabili-ties p(z | cj ) from Def. 2.1, meaning that f satisfies

∀j ∈ [m], z ∈ [V ], σz (f (cj )) = p(z | cj ) .

We argue that this assumption is reasonable: in practice, LLMs often exhibit strong memorization of their training data, making this approximation natural. Moreover, since our theoretical dataset is simple, the UFM trained with gradi-ent descent rapidly learns the dataset probabilities p(z | cj )

with negligible error (App. B.1). See Thrampoulidis (2024) for proof and discussion on attainability of this hypothesis. In our setting, we steer the context embeddings. Thus steering f by αv, where v is defined in Eq. (3) , gives rise to the steered model fα with steered logits given by

fα(cj ) := W�hj + αv. As announced, we now turn to the study of the effect of α on next-token probabilities. We start with a definition:

Definition 3.1 (Probability increase). For a context cj ,and a token z ∈ [V ], we define the probability increase

α 7 → ∆p(z | cj , α ) as

∆p(z | cj , α ) := σz

�fα(cj ) − σz (f (cj )) .0 10 20 30 40 50 60

0.2 0.0 0.2 0.4 0.6 0.8 1.0 p( ) target off-target Figure 3. Next-token probability increases ∆p(α)for a fixed context. Each curve corresponds to a token z: target tokens Tare in blue and off-target tokens in orange. Most target tokens exhibit a “bump” (peaking at α(1 ,1) ), while one target token increases and off-target tokens decrease.

Intuitively, ∆p(z | cj , α ) is the algebraic next-token prob-ability increase for a fixed z ∈ [V ] when steering with strength α. When there is no ambiguity, we omit explicit dependence in cj and z, and write ∆p(α).Recall that P, N ⊂ [m] are the context indices used to construct the steering vector v (Eq. (3) ), and that T :=

Ck is the target concept we aim to steer, which is used to build P . Tokens in T are called target , otherwise off-target . The following quantity, derived from the dataset next-token probabilities (Def. 2.1), plays an important role in our analysis as it appears throughout the proofs:

Definition 3.2 (Log-odds). For any z ∈ [V ], we define the

log-odds M (z) as

M (z) := 1

q log

i∈P

p(z | ci)

i∈N

p(z | ci)

Additionally, we denote by M := {z ∈ [V ] : M (z) = max z′∈[V ] M (z′)} the set of tokens attaining the maximum margin and by M the tokens attaining the minimum. In the following, we characterize the variations of ∆p:

Theorem 3.3 (Behavior of ∆p). Let T be the target con-cept. Assume that Assumption 1 and 2 hold. Given a con-text cj , the probability increase satisfies:

• (bump behavior) for any z ∈ [V ] \ (M ∪ M ), there exists a unique α(j,z ) ∈ R such that ∆p(z | cj , α ) is strictly increasing on (−∞ , α (j,z )] and decreasing on

[α(j,z ), +∞);

• (peak position) for any z ∈ T and z′ /∈ T , it holds that α(j,z ′) < α (j,z );

• (monotonous behavior) for any z ∈ T ∩ M (resp.

z ∈ T ∁ ∩ M ), ∆p(z | cj , α ) is strictly increasing (resp. decreasing) on R.

4Towards Understanding Steering Strength

One might expect ∆p(α) to have a simple behavior ( e.g. , in-creasing for target concept z ∈ T as in Turner et al. (2023)), or to display erratic dynamics as α varies. Surprisingly, nei-ther is true, as our theorem reveals a simple pattern: when we steer in the concept direction, most tokens exhibit a

“bump” behavior , i.e. , their probability increases, reaches a peak at some α, then decreases. Fig. 3 illustrates this pattern (for α < 0, see Fig. B.1.), and Sec. 5 validates it empirically on practical LLMs. Importantly, off-target tokens z / ∈ T

reach their peak earlier than target tokens z ∈ T . This means that as α increases, off-target token probabilities start to fade while target token probabilities are still rising, which helps steering to remain focused on the target concept. This “bump” pattern also suggests the existence of a steering

“sweet spot” : a range of α where target tokens are favored by the model while the next-token distribution has not yet collapsed onto a few tokens, helping preserve output quality. Additionally, the bump location α(j,z ) varies across con-texts cj , suggesting that α should be chosen adaptively

w.r.t. the input prompt, as proposed in (Hedstr ¨om et al., 2025; Ferrando et al., 2025). This discussion illustrates how Th. 3.3 can inform choices of the steering strength α.Finally, a few tokens are exceptions to this behavior : to-kens attaining the maximal log-odds keep increasing with α,while those attaining the minimal log-odds keep decreasing.

Remark 3.4 (Sign of α(j,z )). With the dataset defined in App. B.3, the “bump” pattern for tokens z ∈ T occurs only for positive steering strength ( α(j,z ) > 0), matching the intuition that positive steering increases their probabilities. We defer the limits of ∆p(α) as α → ±∞ to Prop. B.1. In short, ∆p(α) concentrates on tokens in M (resp. M ) as

α → +∞ (resp. −∞ ). Instead, the limits of ∆p(α) for modern LLMs are characterized in Prop. 4.1.

3.2. Influence of α on concept probability in the output

In the previous subsection, we focused on the atomic (token-level) quantity ∆p(α). Our next step is to “zoom out” and study aggregated versions of ∆p over multiple tokens. These aggregates help to answer the following question:

does increasing the steering strength make the target con-cept more likely, while other concepts become less likely?

As we will show in Th. 3.6, the answer to the previous question is yes. To answer it, we define the probability of a concept in the model output for a given context as follows:

Definition 3.5 (Increase/decrease of a concept). Let C be any concept. Given a context index j ∈ [m], we define the

concept increase as

∆p(C | cj , α ) := 1

|C|

z∈C

∆p(z | cj , α ) .

When there is no ambiguity, we simply write ∆p(C | α).4 2 0 2 4

0.3 0.2 0.1 0.0 0.1 0.2 0.3 Figure 4. Concept probability increases ∆p(C | α)predicted by Th. 3.6: the target concept ∆p(T | α)increases with a sigmoidal shape, an off-target ∆p(C | α)decreases sigmoidally, and another ∆p(C′|α)converges to the same limit as |α| → ∞ .

Intuitively, ∆p(C | α) is the mean of the probability increase

∆p over tokens belonging to the same concept C. This quantity serves as a natural proxy, in our setting, for the concept-presence metric studied empirically in Von R ¨utte et al. (2024); Chen et al. (2025); Rimsky et al. (2024); Park et al. (2024). We postpone the discussion of how ∆p(C | α)

relates to practical metric until after the main result below, which characterizes the shape of ∆p(C | α):

Theorem 3.6 (Behavior of ∆p(C | α)). Let T denote the target concept being steered, and let C denote an arbitrary concept. Assume that Assumption 1 and 2 hold. Given a context cj , the concept probability increase satisfies

∆p(C | α) = 12 |C|

tanh

νj (α) + rj

− r′

j

with rj , r ′

j

∈ R and νj : R → R both depending on C

(see App. B.4 for exact expressions). As a consequence,

∆p(T | α) is increasing in α. Moreover, for any C′̸ = T

such that C′ ∩ (M ∪ M ) = ∅, we have the limits

lim

α→±∞

∆p(C′ | α) = − 1

|C ′|

z∈C ′

p(z | cj ) .

Finally, for any C̸ = T satisfying max z∈C M (z) ≤

min z / ∈C M (z), ∆p(C | α) is decreasing in α.

In other words, the steered probability of a concept

∆p(C | α) exhibits three distinct behaviors , all following a tanh -shaped curve up to a reparametrization of α. For the target concept T , steering behaves as intended : increasing the steering strength α increases the presence of T in the model’s output, with ∆p(T | α) following a sigmoidal

shape. For any other concept C′ that contains neither max-imal nor minimal log-odds tokens, ∆p(C′ | α) converges back to its unsteered value. Finally, for concepts C′ whose 5Towards Understanding Steering Strength 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

0.0 0.2 0.4 0.6 0.8 1.0 CE( ) Figure 5. Local quadratic behavior of ∆CE( α), as predicted by Thm. 3.8. The blue curve shows ∆CE( α)and the black curve the quadratic fit using the coefficient from the theorem.

tokens all have log-odds below those of the remaining to-kens, ∆p(C′ | α) decreases as α increases. See Fig. 4 for an illustration. This is consistent with the empirical finding of Von R ¨utte et al. (2024), who observed a tanh( α) trend for the concept probability in the output of a steered LLM. Our result slightly disagrees with Park et al. (Thm. 2.5 2024), who predict that target-concept probability increases while off-target concept probability remains constant. We suspect this difference comes from their model assumptions and of our definition of ∆p(C | α).In practice, concept probability is estimated by how of-ten the concept appears across sampled generations of a steered LLM. Our ∆p(C | α) is more fine-grained, since it tracks changes in the underlying token probabilities. These variations can be masked by sampling: ∆p(C | α) may vary while the corresponding concept tokens C remain too low-probability to be sampled with noticeable frequency, making the sampling-based concept metric appear nearly constant, as in Park et al. (2024). Once concept tokens C become sufficiently likely, the sampling-based concept probability becomes more aligned with our ∆p(C | α). We confirm our findings by an extensive experimental validation (Sec. 5).

3.3. Influence of α on cross-entropy

In this subsection, we zoom out once more, and address the following question: how does the steering strength α

affect the model performance as a whole? This question is directly motivated by practice, as a precise answer can avoid costly searches over α to balance effective steering with maintaining a high-quality model output. In practice, output quality is often assessed with benchmarks such as MMLU (Hendrycks et al., 2021). In our theoretical setting, cross-entropy is the most natural performance measure, and we therefore study how steering affects the cross-entropy computed on the training set. This quantity provides a proxy for test-time performance, as the model is assumed to be well-trained and the training set is large and drawn from the same distribution as evaluation data. We therefore take a first step toward answering the above question by analyzing how the steering strength α influences the cross-entropy:

Definition 3.7 (Difference of cross-entropy). Recall that

fα is the steered model. We define the difference of cross-entropy ∆CE( α) after steering as

∆CE( α) := CE( fα) − CE( f ) .

We now give a precise characterization of the local behavior of cross-entropy around α = 0 :

Theorem 3.8 (Cross-entropy local behavior). Under As-sumption 2, as α → 0, the cross entropy increase satisfies

∆CE( α) = 12

j∈[m]

πj Var j (M (Z)) α2 + o(α2) ,

where Var j (M (Z)) is the variance of the log-odds for to-kens Z sampled accordingly to (p(z | cj )) z∈[V ] and πj be the probability of each distinct context cj (see App. B.5 for both expressions).

In light of the previous theorem, ∆CE( α) is locally U -shaped, since there is no linear term in α and the coefficient of α2 is a variance of the log-odds, hence nonnegative; see Fig. 5 for an illustration. Simply put, steering neces-sarily degrades global performance . This provides, to our knowledge, the first theoretical characterization of how a performance measure (cross-entropy) varies with the steering strength α. Additionally, our result provides a theoretical justification to the empirical observation from Von R ¨utte et al. (2024) that ∆CE( α) is locally quadratic in α; we come back to this matter in Sec. 5.

4. Towards real-world transformers

The previous sections analyze steering in a theoretical set-ting, where the model is an idealized one. Modern LLMs, however, involve additional components, most notably the repeated application of attention and fully connected blocks together with normalization, which complicate the analysis. In this section, we move closer to practice by specifying a real-life activation steering setup broad enough to cover our experimental setting (Sec. 5). We then proceed to describe the effect of large-α on the steered LLM output.

Decoder-only transformers. The typical decoder-only transformer (Vaswani et al., 2017; Radford et al., 2018) share the same structure: we define the residual stream

h(ℓ) ∈ RT ×d inductively, with h(0) given by the input em-beddings. A transformer block updates h(ℓ) to h(ℓ+1) as



h(ℓ+1) := h(ℓ)res + h(ℓ)ffn ,

h(ℓ)attn := ATTN

h(ℓ)

h(ℓ)res := h(ℓ) + h(ℓ)attn ,

h(ℓ)ffn := FFN

h(ℓ)res

(4) 6Towards Understanding Steering Strength 0 50 100 150 200

0.0 0.2 0.4 0.6 openai-community/gpt2 050 100 150 200

0.0 0.2 0.4 google/gemma-3-1b-it 050 100 150 200

0.0 0.2 0.4 Qwen/Qwen3-8B Figure 6. Influence of steering strength αon next-token probability increase ∆p(z, α )for the concept “evil,” shown for LLMs of increasing size. Each curve ∆p(z, α )corresponds to a token zselected among the eight highest-probability tokens at α= 200 . This matches Thm. 3.3: most tokens exhibit a bump, while a few increase throughout. The selected tokens are all related to the steered concept.

where ATTN denotes the attention module, FFN the feed-forward module ( e.g. , fully-connected or mixture-of-experts (Shazeer et al., 2017)), and LN a normalization mod-ule ( e.g. , LayerNorm (Ba et al., 2016) or RMSNorm (Zhang & Sennrich, 2019)). After L layers, the output logits

y ∈ RT ×V are y := LN �h(L)W⊤, where W ∈ RV ×d is the unembedding matrix.

Steering vector. As in our theoretical setting (Sec. 2.2), we build the steering vector v from two prompt sets: a positive set P and a negative set N . Following Chen et al. (2025), both sets are generated using a fixed LLM (Gemma 3 12 B). To form P , we use a system prompt that instructs the model to generate text exhibiting the target concept (App. A) and sample 500 responses using nucleus sampling, generating

300 new tokens per output. For negatives, we consider two constructions. In the contrastive setting, N consists of 500

generations obtained with the same system prompt but using the opposite or negated concept. In the random setting, N is formed by sampling 500 generations from an empty prompt (e.g., a begin-of-sequence token) using the model to be steered. Each experiment uses one of these constructions for N . While prior work (Von R ¨utte et al., 2024) relies on hand-crafted negatives, empty-prompt sampling provides a simple alternative that appears unexplored. For a fixed layer ℓ, we record the residual stream h(ℓ)

j

∈ RT ×d for every generation j ∈ P ∪ N , and define hj := ¯h(ℓ)

j

∈ Rd

as the token-wise average of h(ℓ)

j

(Chen et al., 2025). Using

hj , we compute the steering vector v as in Eq. (3).

Steering. A transformer block offers several natural steer-ing locations. In this work, we steer the residual stream

h(ℓ) ∈ RT ×d, which is also the most common choice in prior work (Turner et al., 2023; Marks & Tegmark, 2024; Rimsky et al., 2024; Burns et al., 2023; Zou et al., 2023; Gurnee & Tegmark, 2024). The next design choice is which token positions to steer: we follow (Chen et al., 2025; Von R ¨utte et al., 2024) and steer all positions of the in-put prompt, i.e. , we copy a single steering vector v ∈ Rd

across the sequence length to obtain a matrix v ∈ RT ×d.Thus, steering at layer ℓ with strength α follows Eq. (1) .Another option is to steer only the last-token representation

h(ℓ)

−1,:

(Rimsky et al., 2024).

Steered logits. Steering the residual stream h(ℓ) yields the steered logits y(α) := LN �h(ℓ) + αv + R(α) W⊤, where

+αv persists to the output via residual (skip) connections, and R(α) collects the effect of steering on the output logits not captured by +αv. The expresssion of y(α) is proven and made rigorous in App. B.6. Crucially, the theoretical model (UFM) of Def. 2.2 omits the normalization LN and the term R(α), and treats h(ℓ) simply as an embedding, akin to an embedding-layer representation in an LLM. We now prove the large-α behavior of the steered logits for the transformer of Eq. (4):

Proposition 4.1 (Limiting behavior of steering a trans-former). Consider steering the residual stream h(ℓ) of a transformer in the direction v ∈ RT ×d. As α → ±∞ , the steered logits y(α) → LN( ±v)W⊤.

Because of the normalization LN , the term R(α) remains bounded in α. Consequently, for large |α| the steered logits no longer depend on the input prompt and instead converge to the unembedding of the normalized steering direction,

LN( ±v)W⊤. The corresponding softmax therefore con-verges to σ(LN( ±v)W⊤), implying that the cross-entropy plateaus for large |α| since the output distribution becomes input-independent. See Fig. 7 for an illustration.

5. Experiments

In this section, we empirically validate on transformers spanning a wide range of sizes (Table A.1) the main results of Sec. 3: the “bump” pattern in next-token probabilities, the U -shaped behavior of cross-entropy around α = 0 ,and the sigmoidal evolution of concept probability. We observe these behaviors consistently across model types

(base, instruction-tuned, multimodal), scales (few million to several billion parameters), and concepts . Steering is implemented as described in Sec. 4. We consider 8 con-cepts spanning a range of safety-related behaviors (listed in App. A). Each experiment corresponds to a choice of

{steering vector, model, steered layer, input prompt }; the figures in this section illustrate typical steering behavior

by fixing the concept (here, “evil”, “depression” and “joy”), steering a middle layer (Chen et al., 2025), and using the

random construction of P . App. A reports additional con-7Towards Understanding Steering Strength -40 -20 0 20 40

0.0 0.5 1.0 openai-community/gpt2 depression joy evil -40 -20 020 40

0.0 0.5 1.0 google/gemma-3-1b-it depression joy evil -40 -20 020 40

0.0 0.5 1.0 Qwen/Qwen3-8B depression joy evil -100 -50 050 100

0 10 20 30

openai-community/gpt2

evil joy depression -100 -50 050 100

0 20 40

google/gemma-3-1b-it

evil joy depression -100 -50 050 100

0 5 10 15

Qwen/Qwen3-4B

evil joy depression

Figure 7. Influence of steering strength α across models on: Top row: concept probability for the three concepts (depression, joy, evil), estimated using a judge LLM (Gemma 3 12B), showing the sigmoidal trend predicted by Thm. 3.6. Bottom row: cross-entropy ∆CE( α)

for the same concepts, locally U -shaped around α = 0 and plateauing for large |α| (Thm. 3.8, Prop. 4.1).

figurations and results, including other layers, concepts, and models with contrastive N , error bars under resampling of P and N , runs with normalized v, steering only the last token h(ℓ)

−1,:

, and the impact of steering on MMLU. Finally, App. A also reports results for additional input prompts, since each next-token probability plot is computed for a fixed context. Notably, our code is modular, enabling exten-sions to unseen configurations.

Results for next-token probabilities. We measure the influence of α on the increase of next-token probabilities

∆p(z, α ). In Fig. 6, we plot ∆p(z, α ) for a small set of tokens z that become most likely at large α, motivated by Prop. 4.1 which shows that in the large-|α| regime the log-its are determined by the unembedding of (normalized) v.Across models and concepts, the evidence is unequivocal :we observe the “bump” pattern in ∆p for concept tokens and the large-α regime where a few tokens dominate predicted by Thm. 3.3. The same behavior is shown for off-target tokens at negative α in Fig. A.10. Dominating tokens do not generally associate to extremal log-odds at intermediate lay-ers, but they do at the final layer (App. A). Finally, although the bump behavior appears already in early layers, steering at mid to late layers leads to highest-probability tokens that are more semantically tied with the target concept (App. A).

Results for concept probability. We estimate the concept probability in steered LLM responses using a sampling-based metric (different from the theoretical ∆p(C | α); see the discussion below Thm. 3.6). Concretely, for 12 prompts and each α, we sample 32 completions and prompt a judge LLM (Gemma 3 12 B) to assign a binary label indicating whether the target concept is present, following Chen et al. (2025) (details in App. A). Averaging these labels yields the concept probability. Fig. 7 shows a mostly sigmoidal trend, with occasional mismatches ( e.g. , the middle panel). In such cases, for some layers/concepts and for a range of α values, next-token sampling can drift away from concept-related tokens because the highest-probability token may instead be punctuation (e.g., ‘-’ or ‘.’), leading to degenerate outputs.

Results for cross-entropy. We estimate the steered cross-entropy change ∆CE( α) on 10 6 tokens sampled from the processed fineweb dataset (Penedo et al., 2024), which provides a sufficiently large and diverse sample for a reli-able estimate. Across all models, we consistently observe the local U -shape around α = 0 , confirming that steering always hurts global performance as predicted by Thm. 3.8; see Fig. 7. Moreover, while Fig. 13 in Von R ¨utte et al. (2024) reports an empirical α2 trend, it is unclear whether this behavior is meant to be local; Thm. 3.8 clarifies that the quadratic scaling holds only locally around α = 0 . For large α, ∆CE( α) instead plateaus, as implied by Prop. 4.1 and confirmed in Fig. 7.

6. Conclusion

Activation steering is a simple and widely used method to control LLM behavior at inference time, yet the choice of steering strength α remains largely heuristic. In this paper, we provide a theoretical analysis of steering strength for ac-tivation steering with a difference-of-means steering vector. In a tractable next-token prediction model, we characterize how α impacts next-token probabilities, concept probabil-ity in the output, and cross-entropy, and we validate these predictions empirically across a range of modern LLMs. Future work includes narrowing the theory/practice gap ( e.g. ,mixed-concept contexts), extending the analysis to other steering methods ( e.g. , SAE), and developing principled prompt-adaptive , rules for choosing α by characterizing the steering “sweet spot” suggested by our results. 8Towards Understanding Steering Strength

Acknowledgements

We thank Alberto Bietti and Salim I. Amoukou for their valuable insights. This work has been supported by the French government, through the 3IA Cote d’Azur Invest-ments in the project managed by the National Research Agency (ANR) with the reference number ANR-23-IACL-0001, the ANR project PRC MAD ANR-24-CE23-1529 and the support of the “France 2030” funding ANR23-PEIA-0004 (PDE-AI) and ANR-15-IDEX-01. All experiments were performed using the Julia 2 cluster. Julia 2 was funded as DFG project as “Forschungsgroßger ¨at nach Art 91b GG” under INST 93/1145-1 FUGG.

References

Anthropic. Claude 3. https://www.anthropic. com/news/claude-3-family , 2024. Accessed: 2025-10-15. Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. Advances in Neural Infor-mation Processing Systems , 37:136037–136083, 2024. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.

arXiv preprint arXiv:1607.06450 , 2016. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das-Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein-forcement learning from human feedback. arXiv preprint arXiv:2204.05862 , 2022. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems ,volume 29, 2016. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread , 2, 2023. Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discov-ering Latent Knowledge in Language Models Without Supervision. International Conference on Learning Rep-resentations , 2023. Chen, R., Arditi, A., Sleight, H., Evans, O., and Lind-sey, J. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509 , 2025. Costa, V., Fel, T., Lubana, E. S., Tolooshams, B., and Ba, D. E. From Flat to Hierarchical: Extracting Sparse Rep-resentations with Matching Pursuit. In The Thirty-ninth Annual Conference on Neural Information Processing Systems , 2025. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. arXiv preprint arXiv:2209.10652 , 2022. Fang, C., He, H., Long, Q., and Su, W. J. Exploring deep neural networks via layer-peeled model: Minority col-lapse in imbalanced training. Proceedings of the National Academy of Sciences , 118(43):e2103091118, 2021. Ferrando, A., Suau, X., Gonz `alez, J., and Rodriguez, P. Dynamically Scaled Activation Steering. arXiv preprint arXiv:2512.03661 , 2025. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scal-ing and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093 , 2024. Gurnee, W. and Tegmark, M. Language models represent space and time. In The Twelfth International Conference on Learning Representations , 2024. Hedstr ¨om, A., Amoukou, S. I., Bewley, T., Mishra, S., and Veloso, M. To Steer or Not to Steer? Mechanistic Er-ror Reduction with Abstention for Language Models. In Forty-second International Conference on Machine Learning , 2025. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations , 2021. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023. Marks, S. and Tegmark, M. The Geometry of Truth: Emer-gent Linear Structure in Large Language Model Repre-sentations of True/False Datasets. In First Conference on Language Modeling , 2024. Marvin, G., Hellen, N., Jjingo, D., and Nakatumba-Nabende, J. Prompt engineering in large language models. In Inter-national Conference on Data Intelligence and Cognitive Informatics , pp. 387–402. Springer, 2023. 9Towards Understanding Steering Strength

Mikolov, T., Yih, W.-t., and Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. Min, P. P., Paudel, A., Adityo, N., Zhu, A., Rufail, A., Blondin, C., Zhu, K., Dev, S., and O’Brien, S. Mitigating sycophancy in language models via sparse activation fu-sion and multi-layer activation steering. In Mechanistic Interpretability Workshop at NeurIPS 2025 , 2025. Mixon, D. G., Parshall, H., and Pi, J. Neural collapse with unconstrained features. Sampling Theory, Signal Processing, and Data Analysis , 20(2):11, 2022. Nanda, N., Lee, A., and Wattenberg, M. Emergent Linear Representations in World Models of Self-Supervised Se-quence Models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , 2023. Park, K., Choe, Y. J., and Veitch, V. The linear represen-tation hypothesis and the geometry of large language models. In Proceedings of the 41st International Confer-ence on Machine Learning , volume 235 of Proceedings of Machine Learning Research , pp. 39643–39666. PMLR, 21–27 Jul 2024. Penedo, G., Kydl ´ı ˇcek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Informa-tion Processing Systems Datasets and Benchmarks Track ,2024. URL https://openreview.net/forum? id=n6SCkn2QaG .Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog , 1(8):9, 2019. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers) , pp. 15504–15522. Association for Computational Linguistics, August 2024. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Rep-resentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings , 2017. Stickland, A. C., Lyzhov, A., Pfau, J., Mahdi, S., and Bow-man, S. R. Steering Without Side Effects: Improving Post-Deployment Control of Language Models. In Neurips Safe Generative AI Workshop 2024 , 2024. Su, J., Chen, J., Li, H., Chen, Y., Qing, L., and Zhang, Z. Activation steering decoding: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 12964–12974, 2025. Tan, D. C. H., Chanin, D., Lynch, A., Paige, B., Kanoulas, D., Garriga-Alonso, A., and Kirk, R. Analysing the Generalisation and Reliability of Steering Vectors. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ´e, A., Rivi `ere, M., et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 , 2025. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. transformer circuits thread, 2024. Thrampoulidis, C. Implicit optimization bias of next-token prediction in linear models. Advances in Neural Informa-tion Processing Systems , 37:22624–22656, 2024. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,2023. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan-guage models with activation engineering. arXiv preprint arXiv:2308.10248 , 2023. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At-tention is all you need. Advances in neural information processing systems , 30, 2017. Von R ¨utte, D., Anagnostidis, S., Bachmann, G., and Hof-mann, T. A Language Model’s Guide Through Latent Space. In Proceedings of the 41st International Confer-ence on Machine Learning . PMLR, 2024. 10 Towards Understanding Steering Strength

Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned Lan-guage Models are Zero-Shot Learners. In International Conference on Learning Representations , 2022. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Au-toencoders. In Forty-second International Conference on Machine Learning , 2025. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 , 2025. Zhang, B. and Sennrich, R. Root mean square layer nor-malization. Advances in neural information processing systems , 32, 2019. Zhao, Y. and Thrampoulidis, C. Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations. arXiv preprint arXiv:2505.08348 , 2025a. Zhao, Y. and Thrampoulidis, C. Geometry of Concepts in Next-token Prediction: Neural-Collapse Meets Semantics. In The Second Conference on Parsimony and Learning (Recent Spotlight Track) , 2025b. Zhao, Y., Behnia, T., Vakilian, V., and Thrampoulidis, C. Implicit geometry of next-token prediction: From lan-guage sparsity patterns to model representations. In First Conference on Language Modeling , 2024. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 , 2019. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation Engineering: A Top-Down Approach to AI Transparency, 2023. 11 Towards Understanding Steering Strength

Contents of the Appendix

A Additional practical experiments and setting 12 B Proofs and additional results 23

B.1 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.2 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.3 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.4 Proof of Theorem 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.5 Proof of Theorem 3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 B.6 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A. Additional practical experiments and setting

Table A.1. Models used in the experiments of Section 5.

Model family Sizes

1-layer GPT (Radford et al., 2018) 0.009B GPT-2 (Radford et al., 2019) 0.12B, 0.77B, 1.5B Gemma 3 (Team et al., 2025) 1B, 4B Qwen 3 (Yang et al., 2025) 0.6B, 4B, 8B Llama 2 (Touvron et al., 2023) 7B Mistral (Jiang et al., 2023) 7B

Table A.2. Non-exhaustive list of tokens observed when steering the models in Table A.1, recorded among the highest-probability next tokens at α= 200 in the setting of Section 4. For readability, we report only full-word tokens; for tokenizers that produce smaller subwords, we observe the same phenomenon, with fragments such as ‘happ’ instead of ‘happiness’.

Steered concept High-probability next tokens

apathetic okay, yeah, bullshit, whatnot, probably depression discomfort, sadness, emotional, despair, uncomfortable evil horrifying, destruction, terror, murderous, deadly humorous comedic, ridiculous, silly, fidget, hilarious impolite verbal, disrespectful, stupid, rude, insulting joy feeling, happiness, ecstatic, joyful, laughter lying absurd, ridiculous, thieves, truth, spiritual optimistic cheerful, cherish, religious, optimism, someday

Setting to generate the positive ( P ) and negative ( N ) prompt sets. As described in Section 4, P is generated by a prompted instruct-LLM (Gemma 3 12 B) using a system prompt of the following form: (1) behavior instruction → (2) definition of the concept → (3) example sentences exhibiting the concept. For P , we use the target concept in the following template. For N , in the contrastive setting, we reuse the same template but replace the concept with its negation (the negated concept is itself generated by the LLM). Otherwise, in the unrelated setting, N consists of sampled generations from the LLM to steer starting from an empty prompt. Here is the exact template used in our experiments: 12 Towards Understanding Steering Strength

positive_chat_template =[{"role": "system", "content": "You are an assistant."}, {"role": "user", "content": (f"Describe the concept: {concept}. Begin with aone-sentence definition. ""Then give several natural first-person examples that demonstrate the concept in an unmistakable way. ""Write as if you are the concept itself. Your examples must make the concept obvious through what you say or do " "| for example, if the concept is lying, give clearly false statements (e.g., claiming impossible facts or events).\n" "Do not use third-person references. Do not explain anything. ""After the definition, output only the examples as simple first-person sentences that fully embody the concept." ), }, ]

We steer 8 concepts: apathetic, depression, evil, humorous, impolite, joy, lying, and optimistic. Setting to compute the concept probability in practice. As described in Section 5, we use a judge LLM (Gemma 3 12 B) to decide whether a target concept is present in a steered LLM generation. The following prompt is used to obtain a binary label:

judge_question_template: str =("Text:\n{completion}\n\n" "Does the text above express the concept {concept} (any word linked to the concept in the whole sentence even if it do not make sense is sufficient for apositive reply)? Reply with 1for yes and 0for no.\n" "Answer:" )

As this prompt makes clear, we do not attempt to assess output quality here; we only measure concept presence, using a permissive criterion (any concept-related word suffices).

Additional figures for the influence of steering strength on next-token probabilities. We complement Figure 6 from Section 5 with additional models, concepts, steering layers, and negative steering strengths; see Figures A.7, A.8, A.9, and A.10. Overall, the qualitative predictions of Theorem 3.3 are observed. The main discrepancy occurs when steering early layers: tokens exhibiting bumps or dominating at large α are less often concept-related. This is expected, as steering early layers is known to yield weaker results (Chen et al., 2025). In Table A.2, we provide a non-exhaustive sample of the highest-probability next tokens recorded at α = 200 ; consistently, steering works as intended by increasing the probability of concept-related tokens.

Additional figures for the influence of steering strength on concept probability in the output. We complement Figure 7 from Section 5 with additional models, concepts, and steering layers; see Figures A.11 and A.12. The qualitative prediction of Theorem 3.6 is partially verified (more often true than false). Results are sensitive to the concept itself (intuitively harder concepts such as lying yield less clean curves than easier ones such as joy). The main discrepancy again arises when steering early layers, which is consistent with prior observations (Chen et al., 2025).

Additional figures for the influence of steering strength on cross-entropy. We complement Figure 7 from Section 5 with additional models, concepts, and steering layers; see Figure A.13. Overall, the predictions of Theorem 3.8 and Proposition 4.1 are observed.

Additional results. We provide additional plots for experiments mentioned in Section 5, including MMLU (Figure A.6), steering only the last-token representation h(ℓ)

−1,:

(Figure A.2), normalization of the steering vector (Figure A.1), contrastive

N (Figure A.5), error bars under resampling of P and N (Figure A.3) and steering, in the setting of Section 4, a 1-layer GPT-style transformer (Figure A.4) that we train on fineweb (Penedo et al., 2024). 13 Towards Understanding Steering Strength 0 500 1000 1500 2000

0.00 0.25 0.50 0.75

openai-community/gpt2 0 500 1000 1500 2000

0.000 0.025 0.050 0.075

Qwen/Qwen3-8B 0 500 1000 1500 2000

0.0 0.2 0.4 0.6 0500 1000 1500 2000 0.0 0.1 0.2 0.3 0.4 0.5 0500 1000 1500 2000 0.0 0.1 0.2 0.3 0.4 0.5 0500 1000 1500 2000 0.00 0.05 0.10 0.15 0.20 0.25

Figure A.1. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): depression, joy, and evil. Each row of two plots corresponds to a single steered concept. Steering is applied at an middle layer in each model (we steer always the same layer for that model) and the steering vector is normalized . Each curve corresponds to a token z selected among the eight highest-probability tokens at α = 2000 . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are related to the steered concept. 0 50 100 150 200

0.0 0.2 0.4

openai-community/gpt2 0 50 100 150 200

0.00 0.25 0.50 0.75

google/gemma-3-1b-it 0 50 100 150 200

0.00 0.02

Qwen/Qwen3-8B 0 50 100 150 200

0.0 0.2 0.4 0.6 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.1 0.2 0.3

Figure A.2. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): evil and joy. Each row of three plots corresponds to a single steered concept. Steering is applied at an middle layer in each model (we steer always the same layer for that model) and we steer only the last token representation h(ℓ)(−1) . Each curve corresponds to a token z selected among the eight highest-probability tokens at α = 200 . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are related to the steered concept. 0 25 50 75 100 125 150 175 200

0.0 0.2 0.4 0.6

google/gemma-3-1b-it 0 25 50 75 100 125 150 175 200

0.0 0.2 0.4 0.6 0.8

google/gemma-3-1b-it 0 25 50 75 100 125 150 175 200

0.00 0.25 0.50 0.75 1.00

google/gemma-3-1b-it

Figure A.3. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concept “evil”. Steering is applied at a

middle layer in each model (we use a fixed middle layer per model). Each curve corresponds to a token z selected among the eight highest-probability tokens at α = 200 , plotted with mean and standard deviation over 5 runs obtained by resampling the prompt sets P

and N . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. The variability across runs is moderate, so for computational cost we omit error bars in the main figures.

14 Towards Understanding Steering Strength 0 10 20 30 40 50

0.00 0.02 0.04 0.06

1-layer GPT

Figure A.4. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concept “uppercase words”. Each curve corresponds to a token z selected among the eight highest-probability tokens at α = 50 . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are uppercase words. 0 50 100 150 200

0.0 0.2 0.4 openai-community/gpt2 050 100 150 200

0.00 0.25 0.50 0.75 google/gemma-3-1b-it 050 100 150 200

0.0 0.5 Qwen/Qwen3-8B 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 050 100 150 200 0.0 0.2 0.4 0.6 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.1 0.2 0.3 0.4 050 100 150 200 0.00 0.05 0.10 0.15

Figure A.5. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): depression, joy, and evil. Each row of three plots corresponds to a single steered concept. Steering is applied at an middle layer in each model (we steer always the same layer for that model) and the negative prompt set N is built in the contrastive setting . Each curve corresponds to a token

z selected among the eight highest-probability tokens at α = 200 . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are related to the steered concept. -10 -5 0 5 10

0.4

0.6

google/gemma-3-4b-it

Figure A.6. Effect of steering strength α on MMLU (Hendrycks et al., 2021) for the concept “evil”. MMLU is a practical performance metric, more indicative of real-world capability than cross-entropy. We measure it using the DeepEval library, for which random guessing yields 25% . As with cross-entropy, increasing α inevitably degrades model performance.

15 Towards Understanding Steering Strength 0 50 100 150 200

0.0 0.2 0.4 0.6

openai-community/gpt2 0 50 100 150 200

0.00 0.25 0.50 0.75

google/gemma-3-1b-it 0 50 100 150 200

0.000 0.025 0.050 0.075

mistralai/Mistral-7B-Instruct-v0.3 0 50 100 150 200

0.00 0.05 0.10 0.15 0.20 050 100 150 200 0.0 0.1 0.2 0.3 0.4 050 100 150 200 0.00 0.02 0.04 0.06 050 100 150 200 0.00 0.02 0.04 0.06 0.08 0.10 050 100 150 200 0.00 0.05 0.10 0.15 050 100 150 200 0.000 0.002 0.004 0.006 050 100 150 200 0.00 0.05 0.10 0.15 0.20 050 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 050 100 150 200 0.00 0.02 0.04 0.06 050 100 150 200 0.00 0.01 0.02 0.03 0.04 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.000 0.001 0.002 0.003 0.004 050 100 150 200 0.00 0.02 0.04 0.06 0.08 050 100 150 200 0.0 0.2 0.4 0.6 050 100 150 200 0.000 0.001 0.002 0.003 0.004 0.005

Figure A.7. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at an early

layer in each model (we steer always the same layer for that model). Each curve corresponds to a token z selected among the eight highest-probability tokens at α = 200 . This partially matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are not concept-related, consistent with the observation that steering early layers often yields worse results (Chen et al., 2025).

16 Towards Understanding Steering Strength 0 50 100 150 200

0.00 0.25 0.50 0.75

openai-community/gpt2 0 50 100 150 200

0.0 0.5

google/gemma-3-1b-it 0 50 100 150 200

0.00 0.02 0.04

mistralai/Mistral-7B-Instruct-v0.3 0 50 100 150 200

0.0 0.2 0.4 0.6 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.000 0.025 0.050 0.075 0.100 050 100 150 200 0.0 0.1 0.2 0.3 0.4 050 100 150 200 0.0 0.2 0.4 0.6 050 100 150 200 0.0 0.1 0.2 0.3 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.1 0.2 0.3 050 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 050 100 150 200 0.000 0.005 0.010 0.015 050 100 150 200 0.00 0.02 0.04 0.06 0.08 050 100 150 200 0.0 0.2 0.4 0.6 050 100 150 200 0.000 0.001 0.002 0.003 0.004 0.005

Figure A.8. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at a middle

layer in each model (we steer always the same layer for that model). Each curve corresponds to a token z selected among the eight highest-probability tokens at α = 200 . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, the selected tokens are related to the steered concept.

17 Towards Understanding Steering Strength 0 50 100 150 200

0.0 0.1 0.2 0.3

openai-community/gpt2 0 50 100 150 200

0.00 0.25 0.50 0.75

google/gemma-3-1b-it 0 50 100 150 200

0.00 0.05 0.10

mistralai/Mistral-7B-Instruct-v0.3 0 50 100 150 200

0.0 0.1 0.2 050 100 150 200 0.0 0.1 0.2 0.3 050 100 150 200 0.00 0.05 0.10 0.15 0.20 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.2 0.4 050 100 150 200 0.0 0.1 0.2 0.3 0.4 050 100 150 200 0.00 0.05 0.10 0.15 0.20 050 100 150 200 0.0 0.2 0.4 0.6 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.2 0.4 0.6 050 100 150 200 0.0 0.2 0.4 050 100 150 200 0.0 0.1 0.2 0.3 050 100 150 200 0.0 0.2 0.4 0.6 0.8 050 100 150 200 0.0 0.1 0.2 0.3 0.4 050 100 150 200 0.0 0.1 0.2 0.3 0.4 0.5

Figure A.9. Effect of steering strength α > 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at the last

18 Towards Understanding Steering Strength -200 -150 -100 -50 0

0.0 0.2 0.4 0.6

openai-community/gpt2 -200 -150 -100 -50 0

0.0 0.5

google/gemma-3-1b-it -200 -150 -100 -50 0

0.000 0.002 0.004

mistralai/Mistral-7B-Instruct-v0.3 -200 -150 -100 -50 0

0.000 0.025 0.050 0.075 0.100 -200 -150 -100 -50 0 0.0 0.2 0.4 0.6 0.8 -200 -150 -100 -50 0 0.000 0.001 0.002 0.003 0.004 -200 -150 -100 -50 0 0.0 0.2 0.4 0.6 0.8 -200 -150 -100 -50 0 0.0 0.2 0.4 0.6 0.8 -200 -150 -100 -50 0 0.000 0.002 0.004 0.006 -200 -150 -100 -50 0 0.00 0.05 0.10 0.15 -200 -150 -100 -50 0 0.0 0.2 0.4 0.6 0.8 -200 -150 -100 -50 0 0.000 0.001 0.002 0.003 0.004 0.005 -200 -150 -100 -50 0 0.00 0.05 0.10 0.15 0.20 -200 -150 -100 -50 0 0.0 0.1 0.2 0.3 -200 -150 -100 -50 0 0.000 0.002 0.004 0.006 -200 -150 -100 -50 0 0.00 0.01 0.02 0.03 0.04 -200 -150 -100 -50 0 0.0 0.2 0.4 0.6 -200 -150 -100 -50 0 0.000 0.002 0.004 0.006 0.008 0.010

Figure A.10. Effect of steering strength α < 0 on next-token probability shifts ∆p(z, α ) for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at a middle

layer in each model (we steer always the same layer for that model). Each curve corresponds to a token z selected among the eight highest-probability tokens at α = −200 . This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, the selected tokens are all not related to the steered concept.

19 Towards Understanding Steering Strength -40 -20 0 20 40

0.0

0.5

1.0

openai-community/gpt2-large

0 23 35

-40 -20 0 20 40

0.0

0.5

1.0

openai-community/gpt2-xl

0 15 47

-40 -20 0 20 40

0.0

0.5

1.0

google/gemma-3-1b-it

0 12 25

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

23 35

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 47

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

12 25

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

23 35

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 47

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

12 25

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

23 35

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 47

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

12 25

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

23 35

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 47

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

12 25

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

23 35

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 47

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

12 25

Figure A.11. Influence of steering strength α on concept probability for the concepts (top to bottom): depression, evil, humorous, impolite, joy, and optimistic. Each row of three plots corresponds to a single steered concept, and each column corresponds to a different model. Steering is applied at three layers (early, middle, late), indicated in each legend. Overall, the curves are consistent with Theorem 3.6, which predicts a sigmoidal shape. Early-layer steering is more erratic , consistent with reports that steering early layers yields worse results (Chen et al., 2025).

20 Towards Understanding Steering Strength -40 -20 0 20 40

0.0

0.5

1.0

openai-community/gpt2

3 7 11

-40 -20 0 20 40

0.0

0.5

1.0

meta-llama/Llama-2-7b-chat-hf

0 15 31

-40 -20 0 20 40

0.0

0.5

1.0

mistralai/Mistral-7B-Instruct-v0.3

0 15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 3

7 11

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 3

7 11

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 3

7 11

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 3

7 11

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 3

7 11

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

-40 -20 0 20 40

0.00

0.25

0.50

0.75

1.00 0

15 31

Figure A.12. Influence of steering strength α on concept probability for the concepts (top to bottom): depression, evil, humorous, impolite, joy, and optimistic. Each row of three plots corresponds to a single steered concept, and each column corresponds to a different model. Steering is applied at three layers (early, middle, late), indicated in each legend. Overall, the curves are consistent with Theorem 3.6, which predicts a sigmoidal shape. Early-layer steering is more erratic , consistent with reports that steering early layers yields worse results (Chen et al., 2025). We also observe less consistent behavior on Llama 2.

21 Towards Understanding Steering Strength -100 -50 0 50 100

0 10 20

openai-community/gpt2-large

11 23 35 -100 -50 050 100

0 10 20 30 40

google/gemma-3-1b-it

8 16 25 -100 -50 050 100

0 10 20

Qwen/Qwen3-0.6B

9 18 27 -100 -50 050 100 0 5 10 15 20 25 11 23 35 -100 -50 050 100 0 10 20 30 40 50 8 16 25 -100 -50 050 100 0 5 10 15 20 9 18 27 -100 -50 050 100 0 5 10 15 20 11 23 35 -100 -50 050 100 0 10 20 30 40 50 8 16 25 -100 -50 050 100 0 5 10 15 20 9 18 27 -100 -50 050 100 0 5 10 15 20 11 23 35 -100 -50 050 100 0 10 20 30 8 16 25 -100 -50 050 100 0 5 10 15 20 9 18 27 -100 -50 050 100 0 10 20 30 11 23 35 -100 -50 050 100 0 10 20 30 40 50 8 16 25 -100 -50 050 100 0 5 10 15 20 9 18 27 -100 -50 050 100 0 5 10 15 20 11 23 35 -100 -50 050 100 0 20 40 60 8 16 25 -100 -50 050 100 0 5 10 15 20 9 18 27

Figure A.13. Influence of steering strength α on the cross-entropy ∆CE( α) for the concepts (top to bottom): apathetic, depression, evil, humorous, impolite, and joy. Each row of three plots corresponds to a single steered concept, and each column corresponds to a different model. Steering is applied at three layers (early, middle, late), indicated in each legend. As predicted by Theorem 3.8, ∆CE( α) is locally

U -shaped around α = 0 and saturates for large |α|, in line with Proposition 4.1.

22 Towards Understanding Steering Strength

B. Proofs and additional results

B.1. Additional results 60 50 40 30 20 10 0

0.2 0.0 0.2 0.4 0.6 0.8 1.0 p( ) target off-target

Figure B.1. Next-token probability increases ∆p(α) for a fixed context and negative α. Each curve corresponds to a token z: target tokens T are in blue and off-target tokens in orange. Most off-target tokens exhibit a “bump” (peaking at α(1 ,4) ), while one off-target token decreases on R and target tokens are increasing on R−.

Generalizing our results to az depending on j. Inspecting the proofs shows that all results, except Remark 3.4, rely on Lemma B.4. Consequently, our main theorems continue to hold verbatim as long as the same sign-separation property holds for the log-odds M (Lemma B.4). However, if one allows az and bz to depend on the context index j without further structure, this sign-separation property may fail. A simple generalization that preserves sign separation is to allow az to depend on j but keep bz independent of j, assuming

aj,z > b z . This is interpretable: only in-concept (meaning, z, cj ∈ Ck) probabilities vary with the context, while off-concept probabilities remain at a small baseline level bz . Allowing bz to depend on j while still enforcing Lemma B.4 is possible, but typically leads to a less interpretable assumption. So grossly said, if we see Lemma B.4 as an assumption, then our results work. Additionally, Lemma B.4 seems to be true in practice see Appendix A.

Plot of ∆p(α) for negative steering strength. We provide the counterpart of Figure 6 for negative α, see Figure B.1.

Perfect training of the UFM. To illustrate that Assumption 2 is attainable in our theoretical setting, we train a UFM with gradient descent on cross-entropy loss on the following dataset instantiation from Definition 2.1, which satisfies Assumption 1:

∀z ∈ [V ],

(az := 1−εs

bz := ε

(G−1) s

with ε ∈ (0 , (G − 1) /G ) a smoothing parameter. The dataset entropy is ≈ 1.3317 (a lower bound on the achievable loss (Thrampoulidis, 2024)), and we reach a loss of ≈ 1.3318 , indicating that the model learns the dataset essentially

perfectly .As stated in Section 3.1, we have an additional results about the limits of ∆p(α) (Definition 3.1):

Proposition B.1 (Limits of ∆p(α)). Given a context index j ∈ [m], and a token z ∈ [V ], the limits of ∆p(α) when

α → +∞ is:

lim

α→+∞

∆p(α) = 1z∈M

p(z | cj )

z′∈M

p(z′ | cj ) − p(z | cj ) . (5)

Similarly, for the limit α → −∞ , replace M by M in Eq. (5) .

Eq. (5) has the following interpretation: in the limit α → +∞ (resp. α → −∞ ), ∆p(α) concentrates all its mass on the tokens z ∈ M (resp. z ∈ M ). If multiple tokens attain the maximal or minimal log-odds, the probability mass is shared among all such tokens. 23 Towards Understanding Steering Strength

Proof. Let us prove that softmax behaves as follows when scaling the steering strength α:

∀z ∈ [V ], lim

α→+∞

σz

�fα(cj ) = 1z∈M

p(z | cj )

z′∈M

p(z′ | cj ) . (6) where M is the set of token attaining the maximum log-odd Mmax . A short proof of the previous display is as follows:

σz

�fα(cj ) = p(z | cj ) exp ( αM (z))

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) = p(z | cj ) exp ( αM (z))

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) exp ( −αM max )exp ( −αM max )= p(z | cj ) exp ( α(M (z) − Mmax ))

z′∈[V]

p(z′ | cj ) exp ( α(M (z′) − Mmax )) =



p(z|cj)P z′∈Mp(z′|cj)+ P z′/∈Mp(z′|cj) exp( α(M(z′)−Mmax ))

if z ∈ M ,

p(z|cj) exp( α(M(z)−Mmax )) P z′∈[V]p(z′|cj) exp( α(M(z′)−Mmax ))

otherwise. In the first case ( z ∈ M ) of the previous display, as (M (z′) − Mmax ) < 0 with z′ /∈ M , the limit is

lim

α→+∞

p(z | cj )

z′∈M

p(z′ | cj ) + P

z′/∈M

p(z′ | cj ) exp ( α(M (z′) − Mmax )) = p(z | cj )

z′∈M

p(z′ | cj ) .

In the second case ( z / ∈ M ), the limit is done by bounding the term, using the fact that

z′∈[V]

p(z′ | cj ) exp ( α(M (z′) − Mmax )) ≥ p(z⋆ | cj ) exp (0)

where z⋆ ∈ M . We get the following bound:

0 < p(z | cj ) exp ( α(M (z) − Mmax ))

z′∈[V]

p(z′ | cj ) exp ( α(M (z′) − Mmax )) ≤ p(z | cj )

p(z⋆ | cj ) exp ( α(M (z) − Mmax )) ,

which implies that the second case term goes to 0 as α → +∞ (because (M (z) − Mmax ) < 0 with z / ∈ M ). Using the previous display we get:

lim

α→+∞

∆p(α) = 1z∈M

p(z | cj )

z′∈M

p(z′ | cj ) − p(z | cj ) (7)



1z∈MazP

z′∈Mp(z′|cj)

− az , if cj ∈ T ,

1z∈MbzP

z′∈Mp(z′|cj)

− bz , otherwise. (8) As az > b z (Assumption 1), the tokens which can attain the max margin are necessarily concept tokens T (Lemma B.4), thus

lim

α→+∞

∆p(α) =



1z∈MazP

z′∈Maz′

− az , if cj ∈ T ,

1z∈MbzP

z′∈Mbz′

− bz , otherwise. (9) Same thing for α → −∞ .

B.2. Technical lemmas

In the following we introduce and prove the technical lemmas needed for Section 3. In the UFM model, activation steering on the embedding hj admits an explicit expression for the resulting model output: 24 Towards Understanding Steering Strength

Lemma B.2 (Steering on UFM). Steering the embedding hj along the direction v from Eq. (3) with strength α ∈ R, we obtain the steered logits

fα(cj ) := W�hj + αv = ℓj + αq

i∈P

ℓi − X

i∈N

ℓi

where ℓj := f (cj ) are the unsteered logits for context cj .Proof. The rewriting is a direct consequence of the UFM model and steering vector v linearity:

fα(cj ) = W He j + α 1

i∈P

He i − 1

i∈N

He i

= WH ej + α 1

i∈P

ei − 1

i∈N

= ℓj + αq

i∈P

ℓi − X

i∈N

ℓi

Thus, studying activation steering reduces to analyzing how the softmax behaves under a linear shift of its input ℓj by the vector �P

i∈P

ℓi − P

i∈N

ℓi

.The log-odds M (z) (Definition 3.2) are central because steering modifies the softmax by reweighting each token probability

p(z | cj ) by the exponential factor exp ( αM (z)) .

Lemma B.3 (Rewriting ∆p(α)). Assume Assumption 2. The first component of ∆p(α) can be rewritten as follows:

σz

�fα(cj ) = p(z | cj ) exp ( αM (z))

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) .

Proof. We express explicitly σz

�fα(cj ) in terms of p(z | cj ) and log-odds M (z) using the rewriting of the logits from Lemma B.2:

σz

�fα(cj ) = σz ℓj + αq

i∈P

ℓi − X

i∈N

ℓi

By Assumption 2, we have σz (ℓj ) = p(z | cj ) and using that the softmax is shift-invariant , there exists βj ∈ R s.t.

ℓj,z = log ( p(z | cj )) + βj . Using this representation, and the notation p(· | cj ) := ( p(z | cj )) z∈[V ] with log applied

element-wise to vectors, we get

σz

�fα(cj ) = σz log ( p(· | cj )) + βj 1 + αq

u∈P

log ( p(· | cu)) − X

v∈N

log ( p(· | cv )) + X

u∈P

βu1 − X

v∈N

βv 1

= σz log ( p(· | cj )) + αq

u∈P

log ( p(· | cu)) − X

v∈N

log ( p(· | cv ))

βj 1 + αq

u∈P

βu1 − X

v∈N

βv 1

= σz log ( p(· | cj )) + αq

u∈P

log ( p(· | cu)) − X

v∈N

log ( p(· | cv ))

The product Q and division of vectors p(· | cj ) is done element-wise in the following:

σz

�fα(cj ) = σz

log ( p(· | cj )) + αq log

u∈P

p(· | cu)

v∈N

p(· | cv )

= σz (log ( p(· | cj )) + αm) ,

25 Towards Understanding Steering Strength

where m := ( M (1) , . . . , M (V )) ⊤ ∈ RV is the vector of log-odds. Final step is to write the softmax σz

�fα(cj ) explicitly:

σz

�fα(cj ) = exp (log ( p(z | cj )) + αM (z))

z′∈[V]

exp (log ( p(z′ | cj )) + αM (z′)) = exp (log ( p(z | cj ))) exp ( αM (z))

z′∈[V]

exp (log ( p(z′ | cj ))) exp ( αM (z′)) = p(z | cj ) exp ( αM (z))

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) .

As a first step toward formalizing why steering makes concept tokens C more likely as α increases, we establish a sign-separation property for the log-odds M (z):

Lemma B.4 (Log-odds M (z) sign separation). Assume Assumption 1. Let T be the target concept. For any z ∈ [V ], we have z ∈ T if, and only if, M (z) > 0.Proof. Given z ∈ [V ], T the target concept, and the dataset of Definition 2.1 satisfying Assumption 1, the log-odds M (z)

can we rewritten as:

M (z) = 1

q log

i∈P

p(z | ci)

i∈N

p(z | ci)



1 q

log

(az )q

(bz)q

, if z ∈ T ,

1 q

log

(bz )q

(az)qz (bz)q−qz

, otherwise.



log

bz

− qz

q

log

bz

with qz := |{ j ∈ N : ∃k ∈ [G], cj , z ∈ Ck}| ∈ N (note that, it can be 0). Using the above rewriting, we obtain that in the first case ( z ∈ T ), M (z) = log

bz

0 by Assumption 1. Otherwise,

M (z) = − qz

q

log

bz

≤ 0, again by Assumption 1. Now let us compute the derivative of ∆p(α):

Lemma B.5 (Derivative of ∆p(α)). Let z ∈ [V ], j ∈ [m]. We have the following derivative w.r.t. α:

∆′p(z | cj , α ) = σz

�fα(cj )

M (z) − EZ∼σ

�fα(cj )

[M (Z)]

Proof. First, let us denote Dj (α) := P

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) . Using Lemma B.3, the derivation is as follows:

∆′p(z | cj , α ) = ddα σz

�W�hj + αv

= ddα

p(z | cj ) exp ( αM (z))

z′∈[V]

p(z′ | cj ) exp ( αM (z′))

= p(z | cj ) exp ( αM (z)) M (z)Dj (α) − p(z | cj ) exp ( αM (z)) D′

j

(α)

Dj (α)2

= p(z | cj ) exp ( αM (z))

Dj (α)

M (z)Dj (α)

Dj (α) − D′

j

(α)

Dj (α)

= σz

�fα(cj )

M (z) − D′

j

(α)

Dj (α)

26 Towards Understanding Steering Strength

The term D′

j

(α)/D j (α) can be rewritten as follows:

D′

j

(α)

Dj (α) =

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) M (z′)

z′′ ∈[V]

p(z′′ | cj ) exp ( αM (z′′ )) = X

z′∈[V]

p(z′ | cj ) exp ( αM (z′))

z′′ ∈[V]

p(z′′ | cj ) exp ( αM (z′′ )) M (z′)= X

z′∈[V]

σz′

�fα(cj )M (z′)= EZ∼σ

�fα(cj )

[M (Z)] .

B.3. Proof of Theorem 3.3

Theorem 3.3 is about the monotonicity of ∆p(α). Hence, we need to study the sign of the derivative of ∆p. As shown in Lemma B.5, the sign of (∆ p)′(α) is governed by the difference

M (z) − EZ∼σ

�fα(cj )

[M (Z)]

. In this difference the only quantity which depends on α is EZ∼σ

�fα(cj )

[M (Z)] . So in the following, we are gonna study the variations of this expectation under steering, to do so we look at its derivative:

ddα EZ∼σ

�fα(cj )

[M (Z)] = X

z′∈[V]

ddα σz′

�fα(cj )M (z′)= X

z′∈[V]

σz′

�fα(cj )

M (z′) − EZ∼σ

�fα(cj )

[M (Z)]

M (z′)= X

z′∈[V]

σz′

�fα(cj )M (z′)2 − X

z′∈[V]

σz′

�fα(cj ) EZ∼σ

�fα(cj )

[M (Z)] M (z′)= EZ∼σ

�fα(cj )

M (Z)2 − EZ∼σ

�fα(cj )

[M (Z)] X

z′∈[V]

σz′

�fα(cj )M (z′)= EZ∼σ

�fα(cj )

M (Z)2 − EZ∼σ

�fα(cj )

[M (Z)] 2

= Var Z∼σ

�fα(cj )

(M (Z)) ,

(10) with Var ( M (Z)) > 0 if M (Z) is not constant σ�fα(cj )-almost surely. This means that EZ∼σ

�fα(cj )

[M (Z)] is strictly increasing on R in α. Now let us compute the limits of this quantity when α → ±∞ , which are

lim

α→+∞

EZ∼σ

�fα(cj )

[M (Z)] = max

z∈[V]

M (z) =: Mmax ,

lim

α→−∞

EZ∼σ

�fα(cj )

[M (Z)] = min

z∈[V]

M (z) =: Mmin .

27 Towards Understanding Steering Strength

Given δ > 0, we introduce the set Aδ := {z ∈ [V ] : Mmax − M (z) ≤ δ} to control the following difference:

Mmax − EZ∼σ

�fα(cj )

[M (Z)] = X

z′∈[V]

σz′

�fα(cj ) (Mmax − M (z′)) = X

z′∈Aδ

σz′

�fα(cj ) (Mmax − M (z′)) + X

z′∈A∁ δ

σz′

�fα(cj ) (Mmax − M (z′))

≤ δ X

z′∈Aδ

σz′

�fα(cj ) + ( Mmax − Mmin ) X

z′∈A∁ δ

σz′

�fα(cj )

≤ δ + ( Mmax − Mmin ) X

z′∈A∁ δ

σz′

�fα(cj ) .

Let us take z⋆ a token which attains the maximum log-odds Mmax . This is necessarily a concept token T because az > b z

(Assumption 1). We now show that lim α→+∞

z′∈A∁ δ

σz′

�fα(cj ) = 0 . If A∁

δ

= ∅ (for sufficiently large δ), the sum is zero by convention. Otherwise, we proceed as follows, using Lemma B.3:

0 < X

z′∈A∁ δ

σz′

�fα(cj ) = X

z′∈A∁ δ

p(z′ | cj ) exp ( αM (z′))

z′′ ∈[V]

p(z′′ | cj ) exp ( αM (z′′ )) (denominator lower bounded by p(z⋆ | cj ) exp ( αM max ).)

≤ X

z′∈A∁ δ

p(z′ | cj ) exp ( α(Mmax − δ))

p(z⋆ | cj ) exp ( αM max )= exp ( −δα ) X

z′∈A∁ δ

p(z′ | cj )

p(z⋆ | cj ) .

By taking the limit in the previous display, we get lim α→+∞

z′∈A∁ δ

σz′

�fα(cj ) = 0 .Finally, we take the lim sup as follows:

lim sup

α→+∞

Mmax − EZ∼σ

�fα(cj )

[M (Z)]

≤ lim sup

α→+∞

δ + ( Mmax − Mmin ) X

z′∈A∁ δ

σz′

�fα(cj )

=⇒ lim sup

α→+∞

Mmax − EZ∼σ

�fα(cj )

[M (Z)] ≤ δ .

The previous display’s bound is uniform in δ > 0, taking the limit δ → 0+ gives

lim sup

α→+∞

Mmax − EZ∼σ

�fα(cj )

[M (Z)] ≤ 0 .

To finish, one remarks that

0 ≤ lim inf

α→+∞

Mmax − EZ∼σ

�fα(cj )

[M (Z)] ≤ lim sup

α→+∞

Mmax − EZ∼σ

�fα(cj )

[M (Z)] ≤ 0 .

Which implies that the limit does in fact exist and lim α→+∞ EZ∼σ

�fα(cj )

[M (Z)] = Mmax . Very similar derivations give

lim α→−∞ EZ∼σ

�fα(cj )

[M (Z)] = Mmin .Since α 7 → EZ∼σ

�fα(cj )

[M (Z)] is continuous, strictly increasing on R and the limits are known on this interval, we have the following: there exists unique thresholds α(j,z ) ∈ R such that

z ∈ T \ M , EZ∼σ

�fα(j,z ) (cj )

[M (Z)] = M (z) ,z ∈ M , α(z) := + ∞ .

28 Towards Understanding Steering Strength

and

z ∈ T ∁ \ M , EZ∼σ

�fα(j,z ) (cj )

[M (Z)] = M (z) ,z ∈ M , α(z) := −∞ .

First for the limit case ( z ∈ M ∪ M ), we remove the dependency in j of α(j,z ) as its always equal to ±∞ . With z ∈ M , we take α(z) := + ∞ because lim α→+∞ EZ∼σ

�fα(cj )

[M (Z)] = Mmax . Moreover, the minimum log-odd Mmin cannot be attained by concept tokens z ∈ T since az > b z (Assumption 1), hence M (z)̸ = Mmin for all z ∈ T and α(z) := −∞ for

z ∈ M .Finally, all bullet points of Theorem 3.3 follow directly from the previous arguments. For the first point, fix a token

z ∈ [V ] \ (M ∪ M ). Then M (z) − EZ∼σ

�fα(cj )

[M (Z)] is positive on (−∞ , α (j,z )] and negative for α > α (j,z ), which yields the bump behavior. The second point follows from the sign separation of the log-odds (Lemma B.4) together with the fact that EZ∼σ

�fα(cj )

[M (Z)] is increasing, which implies α(j,z ′) < α (j,z ) for z ∈ T and z′ /∈ T . The final point again follows from the fact that for z ∈ M ∪ M , the sign of M (z) − EZ∼σ

�fα(cj )

[M (Z)] does not change with α: it remains positive for z ∈ T and negative otherwise, since α(j,z ) = ±∞ for such tokens.

Proof of Remark 3.4 Let cj /∈ T and denote by z1 the concept token with the minimum log-odd in the group T . To show that the bump for concept tokens z ∈ T happens for positive α, it suffices to show that α(j,z 1) > 0 (as α(j,z 1) ≤ α(j,z ) with

z ∈ T by strict monotonicity of EZ∼σ

�fα(cj )

[M (Z)] ). We are proving this fact on a specification of the dataset from Definition 2.1 (and Assumption 1), defined as follows:

∀z ∈ [V ],

(az := (1 − ε)γz

bz := ε

(G−1)

ωz ,

where γz ∈ (0 , 1) satisfies P

z′∈Ck

γz′ = 1 for each k ∈ [G] (with the same conditions for ωz ). The coefficients γz and ωz

are chosen so that Assumption 1 holds, i.e., az > b z , and we assume ε ∈ �0, G−1

G

.Proving that α(j,z 1) > 0 for ε > 0 small enough when cj /∈ T , is equivalent to showing

M (z1) = EZ∼σ

�fα(j,z 1) (cj )

[M (Z)] > EZ∼σ

�f (cj )

[M (Z)] (11) by strict monotonicity of EZ∼σ

�fα(cj )

[M (Z)] in α. The previous inequality is hard to prove as EZ∼σ

�f (cj )

[M (Z)] has a non-trivial expression, so we start by bounding it:

EZ∼σ

�f (cj )

[M (Z)] = X

z′∈[V]

σz′

�f (cj )M (z′)= X

z′∈[V]

p(z′ | cj )M (z′) (by Assumption 2.)

≤ X

z′∈T

p(z′ | cj )M (z′) (as M (z) ≤ 0 for z / ∈ T , see Lemma B.4.)

= X

z′∈T

bz′ M (z′) (as p(z | cj ) = bz for cj /∈ T and z ∈ T .)

= X

z′∈T

bz′ log

az′

bz′

(as M (z) = log ( az /b z ) for z ∈ T , see Lemma B.4.) 29 Towards Understanding Steering Strength

In this specific dataset, β := P

z′′ ∈T

bz′′ = ε

(G−1)

and ρ := P

z′′ ∈T

az′′ = 1 − ε. We continue the bounding process as follows

EZ∼σ

�f (cj )

[M (Z)] ≤ X

z′∈T

bz′ log

az′

bz′

= β X

z′∈T

bz′

β log

ρβaz′ /ρ bz′ /β

= β X

z′∈T

bz′

β log

ρβ

β X

z′∈T

bz′

β log

az′ /ρ bz′ /β

We remark that P

z′∈T bz′ β

log

az′ /ρ bz′ /β

is equal to the negative of the Kullback–Leibler divergence between (bz′ /β )z′∈T

and (az′ /ρ )z′∈T denoted as KL( b·/β || a·/ρ ):

EZ∼σ

�f (cj )

[M (Z)] = β X

z′∈T

bz′

β log

ρβ

− KL( b·/β || a·/ρ )

≤ β X

z′∈T

bz′

β log

ρβ

(by Gibbs’ inequality KL( b·/β || a·/ρ ) ≥ 0.)

= β log

ρβ

(as X

z′∈T

bz′

β = 1 .)

= ε

(G − 1) log

1 − εε (G − 1)

Let us remind that M (z1) = log

(1 −ε)( G−1)

εγz1 ωz1

by the proof of Lemma B.4. To avoid complicated solution to Inequality (11) using the Lambert W function, we compute the limit ε → 0+ of F (·) defined as:

F (ε) := M (z1) − ε

(G − 1) log

1 − εε (G − 1)

= log

(1 − ε)( G − 1)

εγz1

ωz1

− ε

(G − 1) log

1 − εε (G − 1)

We now compute the limit as follows:

lim

ε→0+

log

(1 − ε)( G − 1)

εγz1

ωz1

= + ∞ (as γz1 (G − 1)

ωz1

0.)

lim

ε→0+

(G − 1) log

1 − εε (G − 1)

= lim

ε→0+

(G − 1) log

� 1

ε − 1(G − 1)

= 0 (as lim

x→+∞

log ( x) /x = 0 .) So lim ε→0+ F (ε) = + ∞, which means that there exists ε0 < (G − 1) /G such that for all ε ∈ (0 , ε 0), F (ε) > 0. With

ε ∈ (0 , ε 0), by combining F (ε) > 0 with the upper-bound on EZ∼σ

�f (cj )

[M (Z)] , we get the Inequality (11):

M (z1) > ε

(G − 1) log

1 − εε (G − 1)

≥ EZ∼σ

�f (cj )

[M (Z)] ,

which is equivalent to α(j,z 1) > 0 as desired.

B.4. Proof of Theorem 3.6

Proof. Fix a context index j ∈ [m] and a concept C. Define

FC,j (α) := X

z∈C

σz

�fα(cj ) .

By Definition 3.5, Definition 3.1 and Assumption 2,

∆p(C | cj , α ) = 1

|C|

z∈C

σz

�fα(cj ) − p(z | cj )

= FC,j (α) − FC,j (0)

|C| .

30 Towards Understanding Steering Strength

By Lemma B.3,

σz

�fα(cj ) = p(z | cj ) exp ( αM (z))

z′∈[V]

p(z′ | cj ) exp ( αM (z′)) .

Let Z ∼ (σz

�fα(cj ))z∈[V ] and set

μC,j (α) := E [M (Z)|Z ∈ C ] , μC∁,j (α) := E [M (Z)|Z / ∈ C ] .

Using Lemma B.5 and summing over z ∈ C ,

ddα FC,j (α) = X

z∈C

σz

�fα(cj )

M (z) − E [M (Z)]

= X

z∈C

FC,j (α) σz

�fα(cj )

FC,j (α)

M (z) − E [M (Z)]

= FC,j (α) X

z∈C

σz

�fα(cj )

FC,j (α) M (z) − E [M (Z)] X

z∈C

σz

�fα(cj )

FC,j (α)

= FC,j (α)

μC,j (α) − E [M (Z)]

Moreover, by the law of total expectation and using that P(Z ∈ C ) = FC,j (α) (as Z is a discret random variable),

E [M (Z)] = FC,j (α)μC,j (α) + �1 − FC,j (α)μC∁,j (α).

Therefore, FC,j checks the following ordinary differential equation (ODE), which is nearly the ODE checked by the sigmoid function up to the term

μC,j (α) − μC∁,j (α)

ddα FC,j (α) = FC,j (α)�1 − FC,j (α)

μC,j (α) − μC∁,j (α)

, (12) Since FC,j (α) ∈ (0 , 1) , we can divide both sides by FC,j (α)�1 − FC,j (α), and direct computations yield

ddα log

FC,j (α)1 − FC,j (α)

= μC,j (α) − μC∁,j (α) .

Integrating both sides of the previous display from 0 to α yields

log

FC,j (α)1 − FC,j (α)

= rj + νj (α), rj := log

FC,j (0) 1 − FC,j (0)

, νj (α) :=

Z α

0

�μC,j (t) − μC∁,j (t) dt.

The previous display is the logit function, which is the inverse of the sigmoid function ϕ. Hence FC,j (α) = ϕ(νj (α) + rj )

and

∆p(C | cj , α ) = 1

|C|

FC,j (α)−FC,j (0)

= 1

|C|

ϕ(νj (α)+ rj )−ϕ(rj )

= 12 |C|

tanh

νj (α) + rj

− tanh

as tanh( x) = 2 ϕ(2 x) − 1. Setting r′

j

:= tanh � rj

2

gives the claimed representation in Theorem 3.6.

Proving remaining statement of Theorem 3.6. Let T be the target concept to steer, Lemma B.4 gives M (z) > 0 for

z ∈ T and M (z) ≤ 0 for z / ∈ T , hence μT ,j (α) > 0 and μT ∁,j (α) ≤ 0 for all α. Thus μT ,j (α) − μT ∁,j (α) > 0, implying

ddα

FT ,j (α) > 0 by Equation (12) . Meaning, ∆p(T | cj , α ) is strictly increasing in α. Additionally, the growth of νj is at most linear because μT ,j (t) − μT ∁,j (t) ≤ max z∈[V ] M (z) − min z∈[V ] M (z) as the log-odds M (z) are bounded w.r.t α.Implying the following by linearity of the integral:

|νj (α)| ≤ � max

z∈[V]

M (z) − min

z∈[V]

M (z) |α| .

31 Towards Understanding Steering Strength

Next, if C′̸ = T and C′ ∩ (M ∪ M ) = ∅, Equation (6) in Proposition B.1 implies FC′,j (α) → 0 as α → ±∞ , hence

lim

α→±∞

∆p(C′ | cj , α ) = lim

α→±∞

FC′,j (α) − FC′,j (0)

|C ′| = − FC′,j (0)

|C ′| = − 1

|C′|

z∈C ′

p(z | cj ).

Finally, if max z∈C M (z) ≤ min z / ∈C M (z), then for all α,

μC,j (α) ≤ max

z∈C

M (z) ≤ min

z / ∈C

M (z) ≤ μC∁,j (α),

then μC,j (α) − μC∁,j (α) ≤ 0, implying ddα FC,j (α) ≤ 0 by Equation (12) . Meaning, ∆p(C | cj , α ) is decreasing in α.

B.5. Proof of Theorem 3.8

First, as in Thrampoulidis (2024), we can rewrite the cross-entropy as follows:

CE( f ) := − X

j∈[m]

πj

z∈[V]

p(z | cj ) log ( σz (f (cj ))) ,

where πj ∈ (0 , 1] is the probability of each distinct context cj defined as

πj := 1

i∈[n]

1ci=cj .

Then, the Taylor expansion at order 2 of ∆CE( α) around α = 0 gives us:

∆CE( α) = ∆CE(0) + ∆CE ′(0) α + 12 ∆CE ′′ (0) α2 + o(α2) . (13) Obviously, ∆CE(0) = 0 . We start by computing the derivative ∆CE ′(α) using Lemma B.5 and chain-rule:

∆CE ′(α) = − X

j∈[m]

πj

z∈[V]

p(z | cj ) ddα log ( σz (fα(cj ))) + 0 = − X

j∈[m]

πj

z∈[V]

p(z | cj )

σz

�fα(cj )

M (z) − EZ∼σ

�fα(cj )

[M (Z)]

σz

�fα(cj )

= X

j∈[m]

πj

z∈[V]

p(z | cj )

EZ∼σ

�fα(cj )

[M (Z)] − M (z)

= X

j∈[m]

πj

EZ∼σ

�fα(cj )

[M (Z)] X

z∈[V]

p(z | cj ) − X

z∈[V]

p(z | cj )M (z)



= X

j∈[m]

πj

EZ∼σ

�fα(cj )

[M (Z)] − EZ∼σ

�f (cj )

[M (Z)]

(as X

z∈[V]

p(z | cj ) = 1 .) Under Assumption 2, we have σz

�f (cj ) = p(z | cj ) which implies that ∆CE ′(0) = 0 .Using Equation (10)

ddα EZ∼σ

�fα(cj )

[M (Z)] = Var Z∼σ

�fα(cj )

(M (Z)) ,

we compute the second derivative ∆CE ′′ (α):

∆CE ′′ (α) = X

j∈[m]

πj Var Z∼σ

�fα(cj )

(M (Z)) ,

In the statement of Theorem 3.8, we define Var j (M (Z)) := Var Z∼σ

�f (cj )

(M (Z)) . We finish the proof by injecting the computed derivative and second derivative into the Taylor expansion of Equation (13). 32 Towards Understanding Steering Strength

B.6. Proof of Proposition 4.1 Proving the expression of the steered logits y(α) from Section 4. Using the notation of Section 4, we apply steering at layer ℓ by modifying the residual stream h(ℓ). We track the effect of this intervention across subsequent layers by defining the steered residual streams h(k,α ) inductively as

(

h(ℓ,α ) := h(ℓ) + αv , (initialization)

h(k+1 ,α ) := h(k,α ) + F (h(k,α )) , for k ∈ [ℓ, L − 1] .

Here, F (h) := ATTN(LN( h)) + FFN[LN {h + ATTN(LN( h)) }] captures the update applied by a single transformer block. This definition is a direct reformulation of Eq. (4) adapted to our steering setting. Unrolling this recursion up to the final layer yields

h(L,α ) = h(ℓ) + αv + R(α) ,

where R(α) := P

k∈[ℓ,L −1]

F (h(k,α )) aggregates all downstream effects induced by the steering intervention. Substituting h(L,α ) for h(L) in y := LN �h(L)W⊤ then gives the steered logits expression

y(α) := LN �h(ℓ) + αv + R(α)W⊤ .

Proving Proposition 4.1. The key observation is that the presence of layer normalization inside the definition of F ensures that each component of R(α) remains bounded (for arbitrarily large α), i.e. there exists a constant cR independant from α

such that:

|(R(α)) i,j | ≤ cR .

To formalize the previous display, consider RMSNorm applied to a single token representation h ∈ Rd:

LN( h) := √d h

∥h∥ ⊙ γ ,

where γ ∈ Rd and ⊙ denotes the Hadamard product. Then, as α → +∞,

LN( h + αv) = √d h + αv

∥h + αv∥ ⊙ γ −→ √d v

∥v∥ ⊙ γ = LN( v) ,

and, similarly, as α → −∞ ,

LN( h + αv) −→

−√d v

∥v∥

⊙ γ = LN( −v) ,

The same argument applies to LayerNorm. As a result, the dominant term in h(ℓ) + αv + R(α) as α → ±∞ is αv, meaning

h(ℓ) + αv + R(α)

i,j

∼±∞ αvi,j .

This directly yields

lim

α→±∞

LN �h(ℓ) + αv + R(α) W⊤ = LN( ±v) W⊤ .

99 KiB Raw Blame History Unescape Escape

Towards Understanding Steering Strength

Abstract

1. Introduction

2. Theoretical framework

3. Main results

4. Towards real-world transformers

5. Experiments

6. Conclusion

Acknowledgements

References

Contents of the Appendix

A. Additional practical experiments and setting

B. Proofs and additional results

99 KiB

Raw Blame History