Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from ||diff|| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g|| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero ||D||. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.6 KiB
CHaRS: Concept Heterogeneity-aware Representation Steering
Source: https://arxiv.org/html/2603.02237v1 (fetched 2026-05-26) Epistemic note: no released code; trust signals = the paper alone. Treat findings as plausible but unvalidated by external use.
Abstract
CHaRS controls LLM behavior through representation steering. Rather than applying uniform shifts across embedding space, CHaRS models representations as Gaussian mixture models and formulates steering as an optimal transport problem between semantic clusters. Yields context-dependent steering vectors that vary smoothly across the representation manifold, claimed to outperform global steering baselines.
1. Introduction
Traditional steering: single direction via difference-in-means over contrastive pairs. Assumes homogeneous concept representation in embedding space.
Authors' claim: LLM representations are non-homogeneous, clustered, context-dependent — global steering becomes brittle.
Key Contributions
- Generalization to multimodality: extends steering from unimodal Gaussian to GMMs via Mixture Wasserstein distance, formulated as discrete OT between semantic clusters.
- Input-adaptive steering: cluster-level transport plans produce context-sensitive control where directions vary smoothly across the manifold.
- Spectral factorization: Principal Component Thresholding (PCT) reveals inherent low-rank structure (rank ≤ 2K−2).
2. Background
2.1 Optimal Transport Framework
p-Wasserstein distance:
W_p(\mu, \nu) = \left(\inf_{\pi \in \Pi(\mu,\nu)} \int \|\mathbf{x} - \mathbf{y}\|_p\, d\pi(\mathbf{x}, \mathbf{y})\right)^{1/p}
2.2 Gaussian OT and Representation Steering
For \mu = \mathcal{N}(m_1, \Sigma_1) and \nu = \mathcal{N}(m_2, \Sigma_2):
W_2^2(\mu, \nu) = \|m_1 - m_2\|_2^2 + d_B^2(\Sigma_1, \Sigma_2)
OT map is affine: T(\mathbf{x}) = \mathbf{m}_2 + \mathbf{A}(\mathbf{x} - \mathbf{m}_1).
Under identical covariance, reduces to pure translation = difference-in-means.
2.3 Gaussian Mixture Wasserstein Distance
\mu = \sum_{k=1}^K p_k \mathcal{N}(\mathbf{a}_k, \Sigma_k),\quad \nu = \sum_{l=1}^L q_l \mathcal{N}(\mathbf{b}_l, \Gamma_l)
Mixture Wasserstein:
MW_2^2(\mu, \nu) = \min_{\gamma \in \Gamma(p,q)} \sum_{k,l} \gamma_{kl}\, W_2^2(\mathcal{N}(a_k, \Sigma_k), \mathcal{N}(b_l, \Gamma_l))
Tractable discrete OT between components.
3. CHaRS
3.1 Barycentric Projection
\hat{T}(\mathbf{x}) := \mathbb{E}_\pi[\mathbf{y}|\mathbf{x}]
For GMM-OT:
\hat{T}(\mathbf{x}) = \sum_{k,l} p(k|\mathbf{x}) \cdot \frac{\gamma^*_{kl}}{p_k} \cdot T_{kl}(\mathbf{x})
3.2 Clustering-based Steering
- k-means on activations → centroids
a_i,b_j. - Cluster matching via entropy-regularized OT (Sinkhorn):
\mathbf{P}^* = \arg\min_{\mathbf{P} \in \Pi(w_A, w_B)} \langle \mathbf{P}, \mathbf{C} \rangle + \lambda H(\mathbf{P}),\quad C_{ij} = \|a_i - b_j\|_2^2
- Kernel-based gating (RBF, σ = median centroid distance):
\hat{p}(i|\mathbf{x}) = \frac{p_i\, k(\mathbf{x}, a_i)}{\sum_m p_m\, k(\mathbf{x}, a_m)}
- Isotropic-covariance simplification:
T_{ij}(\mathbf{x}) = \mathbf{x} + \mathbf{v}_{ij}where\mathbf{v}_{ij} = b_j - a_i. - Final steering (Definition 3.1):
\hat{T}_\alpha(\mathbf{x}) = \mathbf{x} + \alpha \hat{\mathbf{v}}(\mathbf{x})
\hat{\mathbf{v}}(\mathbf{x}) = \sum_{i,j} \frac{P^*_{ij}\, k(\mathbf{x}, a_i)}{\sum_{p,q} P^*_{pq}\, k(\mathbf{x}, a_p)}\, \mathbf{v}_{ij}
3.3 Principal Component Thresholding
Weighted covariance of local shifts:
\Sigma_{total} = \sum_{i,j} P_{ij}(\mathbf{v}_{ij} - \bar{\mathbf{v}})(\mathbf{v}_{ij} - \bar{\mathbf{v}})^T = \mathbf{U}\Lambda\mathbf{U}^T
Rank bound: \text{rank}(\Sigma_{total}) \le 2K - 2.
CHaRS-PCT keeps top L components:
\tilde{\mathbf{v}}(\mathbf{x}) = \bar{\mathbf{v}} + \sum_{k \in [L]} \hat{\alpha}_k(\mathbf{x})\, \mathbf{u}_k
4. Experiments
4.1 Jailbreaking
AdvBench 80/20 train/eval. Examples (ASR):
| Model | ActAdd | CHaRS | CHaRS-PCT |
|---|---|---|---|
| Gemma2-9B | 91.35% | 98.08% | 98.08% |
| Llama3.1-8B | 95.19% | 98.08% | 99.04% |
| Qwen2.5-7B | 91.35% | 95.19% | 93.27% |
4.2 Toxicity Mitigation
Llama3-8B (lower = less toxic):
| Method | Classifier | 0-shot |
|---|---|---|
| Linear-AcT | 1.93% | 7.73% |
| CHaRS | 1.23% | 4.80% |
| CHaRS-PCT | 1.17% | 4.47% |
Authors note PCT often outperforms CHaRS in sequential settings — implicit regularization.
4.3 Image Style Control
FLUX.1 + 512 COCO captions w/ style tags. Pareto frontier improves over Linear-Act.
5. Ablations
- K > 1 consistently > K=1. Optimal K model-dependent (10-15).
- 100% variance with 2(K−1) PCs (matches theoretical bound).
- Component count for PCT requires tuning.
6. Concluding Remarks
Generalizes diff-in-means to multimodal distributions. Limitations: isotropic covariances, k-means. Future: anisotropic mixtures, feature weighting.
Note for our use case (projected_grpo)
CHaRS does activation steering (forward-pass). We do gradient projection (backward-pass). The analog:
- Source distribution = clean-rollout gradients
- Target distribution = hack-rollout gradients
- Diff-in-means =
mean(g_hack) - mean(g_clean)(current v_hack) - CHaRS analog = cluster hack-grad and clean-grad into k clusters each, Sinkhorn-match, derive a per-input transport map.
Simpler step (PCT-like) for our setting: stack per-pair diff vectors, SVD/PCA, keep top-L singular directions, project gradient out of their span. This is what user proposed. Theoretical rank bound for K clusters is 2K-2; with 12 pairs treated as 12 clusters that's 22, but L<<22 is fine in practice.