commit 7248d469a7d919e1266174aa21e73c375721ed5b Author: wassname <1103714+wassname@users.noreply.github.com> Date: Sat May 23 10:22:54 2026 +0800 init diff --git a/docs/1.md b/docs/1.md new file mode 100644 index 0000000..c586f42 --- /dev/null +++ b/docs/1.md @@ -0,0 +1,4130 @@ +Claude finished the response + +SAE, activation oracles, transcoders. AntiPaSTO + +do any of them use triplet loss in activation space to learn a mapping between inscrutable activation space and a nice word2vec style latest space? or is this what SAE's are +22 May + +Short answer: no, none of them use triplet loss the way you're describing, and SAEs are doing something fundamentally different from what you're gesturing at. + +Let me work through why, because I think there's an interesting confusion to untangle. +What SAEs actually do + +SAEs learn an overcomplete sparse dictionary via reconstruction loss + sparsity penalty (L1 or top-k). The objective is roughly: + +L=∥x−Dec(Enc(x))∥2+λ∥Enc(x)∥1L=∥x−Dec(Enc(x))∥2+λ∥Enc(x)∥1​ + +There's no metric learning here. No anchor/positive/negative. No notion of "these two activations should be close, this one should be far." The encoder just learns whatever sparse linear basis minimizes reconstruction at a given sparsity budget. The "niceness" of the latent space (monosemanticity, interpretability) is an emergent property hoped to fall out of the sparsity prior + superposition hypothesis, not something the loss directly targets. + +The resulting features are also not really "word2vec style" — word2vec embeddings live in a dense continuous space where directions have semantic meaning (king - man + woman ≈ queen). SAE features are sparse activations over a discrete dictionary. Different geometry, different inductive bias. +Transcoders + +Transcoders are SAEs with a twist: instead of reconstructing the input, they reconstruct the *output* of an MLP from its input, with a sparse bottleneck. So L=∥MLP(x)−Dec(Enc(x))∥2+λ∥Enc(x)∥1L=∥MLP(x)−Dec(Enc(x))∥2+λ∥Enc(x)∥1​. Still reconstruction + sparsity. Still no triplet loss. +Activation probes / "oracles" + +Linear probes are supervised classifiers — labeled data, cross-entropy or similar. Not metric learning either, though they're closer in spirit (you're imposing a known semantic structure on activations). DAS, distributed alignment search, learns rotations to align activations with hypothesized causal variables — again supervised, not triplet. +AntiPaSTO + +Your own method uses contrastive prefix pairs to define a steering direction via gradient signal in SVD basis, but it's not learning an embedding space — it's identifying steering directions in the existing residual stream basis (rotated by SVD). The contrast is between behaviors-given-prefix, not between activations as anchors/positives/negatives in a learned metric space. +What you're actually describing + +What you're sketching sounds like: train an encoder ff such that ∥f(xa)−f(xp)∥<∥f(xa)−f(xn)∥∥f(xa​)−f(xp​)∥<∥f(xa​)−f(xn​)∥ where xa,xpxa​,xp​ are activations from semantically-similar contexts and xnxn​ from a dissimilar one. That's standard metric learning / contrastive representation learning applied to activations. SimCLR/CLIP-style or FaceNet-style. + +To my knowledge this has not been a dominant approach in mech interp, and I think there are real reasons: + + You need a similarity oracle. What makes two activations "semantically similar"? If you have that oracle, you've smuggled in the supervision you were trying to discover. SAEs avoid this by being unsupervised — sparsity is the only prior. + Activations already live in a metric space. Cosine sim on residual stream vectors already gives you something. Logit lens, tuned lens, and direct cosine probing exploit this. The question is whether a learned metric beats the native one, and for what task. + Linear structure assumption. Most interp work assumes features are linear directions (the linear representation hypothesis). Triplet loss with a nonlinear encoder would learn a curved manifold, which fights this assumption. If you keep the encoder linear, you basically get something close to LDA or contrastive PCA. + +That said, there is adjacent work: + + CCS (Contrast-Consistent Search, Burns et al.) uses contrastive pairs of activations to find truth directions, but the loss is consistency-based, not triplet. + Contrastive activation addition (Panickssery et al.) uses mean-difference of contrastive activations for steering — same contrast idea as AntiPaSTO, no learned encoder. + InstructPix2Pix-style or CLIP-style contrastive training on internal representations has been explored for representation engineering but not as a mainstream interp tool. + +If you wanted to actually do the thing — triplet loss in activation space to learn a "nice" embedding — the closest existing references are probably the CCS line and Zou et al.'s representation engineering. The honest answer is it's an under-explored design point, and I'd guess (~60%) it's under-explored because the oracle-for-similarity problem dominates, not because someone tried it and it failed. + +Is there a specific application you're thinking about? If you have a natural source of triplets (e.g. behavior-conditioned prefix triples) the design becomes more concrete. + +ah instead of training a separate encoder I was thinking of training an adapter if that makes sense + +so that adapter is parametrised (modified from lora) + +such that it learns an intervention that has word2vec type properties at least within triplets + +for example we might have 2 contrastive pairs... one is contrasting honesty, the other something else? or nothing? idk + +and then perhaps it has to be coherent in that nll or kl is maintained within a trust region like in antipasto + +so at the end hopefully we have a smooth latent space for just 1 or two concepts? but it might have learned a complex, conditional, non linear transform... but where it hits our parametrisation + +so id's triplet loss in the lora intervention/ +22 May + +Okay, this is a more interesting design than your first sketch. Let me try to state it back precisely, then push on it. +Restating + +You're proposing: + + An adapter AθAθ​ (LoRA-parameterized, so low-rank + structured, possibly with the SVD-basis constraint from AntiPaSTO) that modifies activations at some layer. + Train θθ so that the *intervention* respects word2vec-style algebra on a chosen concept axis. E.g., given triplet (xneutral,xhonest+,xhonest-)(xneutral​,xhonest+​,xhonest-​), the adapter pushes activations such that A(xhonest+)−A(xneutral)≈−(A(xhonest-)−A(xneutral))A(xhonest+​)−A(xneutral​)≈−(A(xhonest-​)−A(xneutral​)), or more generally that the honesty direction becomes a clean linear axis in the adapted space. + Trust-region constraint (iso-KL or NLL bound) keeps the model coherent — so the adapter can't just collapse activations to make the geometry trivially nice. + Result: a learned, possibly nonlinear, but parametrically constrained intervention whose effect on activations exposes a clean linear concept axis for downstream steering. + +Is that right? Assuming yes, here's what I think. +Why this is interesting + +The clever bit is the inversion of where the niceness lives. SAEs put the encoder in front of a frozen model and hope sparsity finds clean features. You're putting a constrained transform inside the model and training the transform to make the residual stream locally well-behaved on a concept of your choice. The model carries on doing its job (trust region), but the activation geometry around your concept is now linear by construction. + +This is closer in spirit to representation engineering than to SAEs. It's also closer to conditional concept erasure (Belrose, Ravfogel et al.'s LEACE and INLP line) and to DAS (Geiger et al.) which learns rotations to expose causal variables. The triplet-loss framing is a contribution; the "learn an intervention that linearizes a concept" framing has precedent. +The strongest objections I'd push back with + +1. Identifiability — what stops it learning a degenerate solution? + +Triplet loss + trust region is a weak constraint. The adapter could learn: "rotate honesty-related activations onto a single axis, leave everything else alone." Fine, that's what you want. But it could also learn: "project honesty-related activations onto a single axis, destroying orthogonal information that the trust region happens not to penalize at training-distribution prefixes." You'd have a clean-looking axis that's actually a low-rank lossy projection masquerading as a feature direction. The trust region catches behavioral collapse, not representational collapse on held-out concepts. + +You can probably defend against this with held-out concept probing — train on honesty, verify that probes for other concepts (sycophancy, refusal, capability X) still work post-adapter. If they degrade, the adapter is doing damage you're not seeing in KL. + +2. "Word2vec properties" is doing a lot of work. + +Word2vec's analogies (king - man + woman ≈ queen) emerge from the skip-gram objective on a very large corpus with millions of implicit triplets. Two contrastive pairs won't get you that. You'll get one axis. Calling it word2vec-style oversells what triplet loss on two pairs delivers — it's really "learn one linear concept direction, conditionally." + +Which is fine! That's still useful. But the bar is "clean steering vector for concept C" not "compositional concept algebra." If you want the latter, you need many concepts trained jointly with structural constraints (orthogonality, or a learned concept manifold), and now you're rebuilding SAE-adjacent machinery from a different starting point. + +3. The conditional nonlinearity is a feature and a bug. + +You said: "it might have learned a complex, conditional, non-linear transform." This is genuinely the appeal — a LoRA can implement input-dependent rotations that a constant steering vector can't. AntiPaSTO is already gradient-based and goes beyond simple mean-difference for this reason. + +But: the more expressive the adapter, the more it can "cheat" by implementing concept-detection-then-rewrite rather than exposing a clean axis. At the limit, a sufficiently expressive adapter is just "classifier + conditional steering," which doesn't tell you anything new about the model's native geometry. It tells you about the adapter's geometry, which you designed. + +The interpretability payoff depends on the adapter being constrained enough that what it learns reveals something about the underlying model. Rank, basis (SVD vs free), and where in the network you place it all matter a lot. + +4. How does this differ from AntiPaSTO? + +I think this is the question to be sharp on for your own sake. AntiPaSTO already: + + Uses contrastive prefix pairs (a form of triplet-ish supervision). + Learns in SVD basis (a structured parameterization). + Uses gradient signal (so it can capture conditional/nonlinear-ish structure within its parameterization). + Has trust-region-like calibration (iso-KL). + +The new ingredient you're adding is triplet loss as the explicit objective, replacing whatever loss AntiPaSTO currently uses (gradient on a behavioral contrast, if I'm remembering right). Is the claim that triplet loss buys you cleaner geometry than the behavioral gradient? Plausible but not obvious — behavioral gradient already implicitly contrasts, and the geometric niceness in AntiPaSTO comes from the SVD basis + iso-KL, not from the loss form. + +If the answer is "this is AntiPaSTO with a metric-learning loss replacing the behavioral one," that's a legitimate ablation/variant, but the framing should be honest about how incremental it is. If the answer is "this exposes axes that AntiPaSTO can't," you need to say what those are and design an eval that distinguishes them. +What I'd actually try + +If I were building this: + + One concept axis, two or three contrastive pairs (honesty / dishonesty / neutral). Confirm the adapter learns a steerable axis. Compare to: mean-difference, CAA, AntiPaSTO's native objective, LEACE. If you can't beat those on a clean toy task, the triplet framing isn't pulling weight. + Probe orthogonal concepts pre/post adapter. If sycophancy probes degrade when you train an honesty adapter, the trust region is too loose. + Try jointly training two adapters for two concepts with an explicit orthogonality penalty. If they stay orthogonal and both steer cleanly, you have the start of a compositional story — that's where the word2vec analogy might actually start to bite. + +The pitch I'd believe: "constrained adapter + triplet loss + iso-KL gives you a steering vector that generalizes OOD better than mean-difference because the constraint forces it to be a property of the model's representations, not the prefix distribution." That's a concrete claim with a clear eval. + +Where are you on the AntiPaSTO-vs-this question? I think that's the load-bearing one. + + Train θ\theta θ so that the *intervention* respects word2vec-style algebra on + +So I already have one where it's bidirectional... smooth along a path. e.g. + +## 3. Inner training step — NLL + KL, PCGrad on the NLL pair + +One forward pass per pole (cho, rej), each at `c = ±C` where +`C ∼ U(0, 1]` is freshly sampled every step. KL added unprojected +(KL is an opposing objective by design — projecting it would silently +weaken it). PCGrad operates on the NLL pair only (cho and rej are the +same task at different poles and may legitimately conflict). + +Concretely: each (prompt, cho, rej) is teacher-forced. We tokenize +`prompt + cho` (and separately `prompt + rej`), build a labels tensor +that is `-100` on prompt positions and the actual ids on completion +positions, and compute NLL/KL averaged *over completion positions +only*. HF's `model(input_ids, labels=labels).loss` already does length- +normalised mean-over-non-ignore CE, so per-pair length bias is handled +for free as long as we share that mask between NLL and KL. + +```py +# Per step over (cho, rej) pairs. Teacher-forced; persona stripped. +for step in 0..T: + (ids_p, lbl_p, attn_p), (ids_n, lbl_n, attn_n) = batch(pairs) + # ids_* : prompt + completion tokens + # lbl_* : -100 on prompt tokens, ids on completion tokens ← shared mask + # attn_* : 1 on real tokens, 0 on right-pad + + C ~ U(0, 1] # → assert C > 0 (resample if zero) + g_nll = [] + g_kl = [] + for (ids, lbl, attn), c in [((ids_p, lbl_p, attn_p), +C), + ((ids_n, lbl_n, attn_n), -C)]: + # ── reference forward: c=0, no grad, gate disables history ── + with lora(model, c=0.0), no_grad(): + logits_base = model(ids, attention_mask=attn).logits # pristine base + logp_base = log_softmax(logits_base) + + # ── steered forward at c=±C ─────────────────────────────── + with lora(model, c=c): + out = model(ids, attention_mask=attn, labels=lbl) + L_nll = C * out.loss # HF mean-CE over non-ignore tokens + logp_steer = log_softmax(out.logits) + mask = (lbl != -100) # SAME mask as HF NLL + L_kl = β * mean_kl(logp_steer, logp_base, mask) + + g_nll.append(∇_θ L_nll) + g_kl .append(∇_θ L_kl) + + # PCGrad: drop the component of each NLL gradient that fights the other. + if dot(g_nll[+], g_nll[-]) < 0: + g_nll = pcgrad_project(g_nll) + g = mean(g_nll) + mean(g_kl) # KL added unprojected + adamw.step(g); onecyclelr.step() + +``` + +is this differen't and better or worse +22 May + +Okay, looking at your actual code now — this changes my read significantly. Let me reason through it. +What you actually have + +This is not triplet loss in the metric-learning sense. It's a bidirectional, scalar-conditioned steering objective: + + A LoRA adapter parameterized by scalar c∈[−1,1]c∈[−1,1], where cc controls intervention strength and sign. + Each step samples C∼U(0,1]C∼U(0,1] and computes NLL at +C+C on the chosen completion and at −C−C on the rejected completion. + The NLL is scaled by CC — so the supervision strength scales with how far you're pushing along the axis. At C→0C→0 the gradient vanishes; at C=1C=1 it's full strength. + KL-to-base at c=0c=0 (unprojected) constrains the unsteered behavior to match the base model. + PCGrad on the NLL pair only, to handle the legitimate task conflict between "push toward cho" and "push toward rej." + +The geometric content: you're forcing the LoRA to define a *path* in parameter/activation space parameterized by cc, where −1−1 end is rej-favoring, +1+1 end is cho-favoring, and c=0c=0 is the base model exactly (enforced by KL with no projection). The CC-scaling on NLL plus KL anchor at zero makes the path smooth and (in expectation) monotonic in cc. +Is this different from what I described before? + +Yes, quite different and arguably better-motivated than the triplet-loss sketch: + +Triplet loss says "anchor closer to positive than negative by margin mm" — defined at one point, no notion of strength, doesn't constrain off-triplet behavior. + +**Your scheme** says "the intervention's effect should scale linearly with cc, with the base recovered at c=0c=0, and behavior favoring cho at c>0c>0 and rej at c<0c<0, smoothly." The KL anchor + CC-scaled NLL gives you the smooth-axis property *directly in the loss*, not as an emergent hope from a metric objective. + +The word2vec analogy I'd defend for this: cc functions like a coefficient in a linear concept algebra. If your LoRA has rank 1, you've literally learned a one-dimensional concept axis with a calibrated strength scalar. Higher rank generalizes to a constrained subspace, and the random sampling of CC forces the whole interval to be meaningful rather than just the endpoints. +What's good about this design + +The CC-scaling on NLL is doing real work. Without it, training only at ∣c∣=1∣c∣=1 leaves intermediate cc values underdetermined — the network is free to be wiggly between ±1±1. Sampling C∼U(0,1]C∼U(0,1] and scaling the loss by CC both populates the interior with supervision and reduces the gradient at small CC (where the KL term naturally dominates). This is a clean way to make the path smooth-by-construction. + +KL unprojected at c=0c=0 is correct. Projecting it would, as you say, let the NLL drift the base. The asymmetry — PCGrad on NLL only, KL added straight — matches the actual structure: NLL+/NLL- can legitimately conflict (same task, different poles), KL is a separate hard constraint that shouldn't be reconciled with task gradient. + +Teacher-forcing with shared completion mask is the right call for length normalization — I'd flag it as a place to add an assert that the mask is byte-identical between the NLL and KL paths, because a subtle mismatch (e.g. one off-by-one on the prompt boundary) silently changes what "matched length normalization" means. +Where I'd push back + +**1. The CC schedule.** U(0,1]U(0,1] is uniform over magnitudes but you're then scaling NLL by CC, which down-weights small-CC samples *quadratically* in their contribution to the loss (uniform sampling × linear scaling = triangular weighting in effective supervision). Is that what you want? If the goal is uniform supervision density along cc, you want CC sampled to compensate the scaling, or unscaled NLL with C∼U(0,1]C∼U(0,1] purely controlling intervention strength. Worth thinking about whether the current scheme actually produces smooth supervision or undersupervises near the base. + +2. The sign symmetry assumption. Sampling +C+C for cho and −C−C for rej assumes the axis is anti-symmetric — that the rej-favoring intervention is the negation of the cho-favoring one. This is the same assumption mean-difference steering makes and it's not always true. For some concepts (e.g., truthful vs sycophantic), the "anti" pole is real and antipodal. For others (e.g., helpful vs harmful), the rej side is a heterogeneous mixture, not a single direction. The LoRA will smear these into one axis regardless, and you won't see the failure in NLL — you'll see it as the rej-end being a weird mixture rather than a coherent pole. A diagnostic: project the LoRA delta at c=−1c=−1 onto the LoRA delta at c=+1c=+1 and check if they're near-antipodal. If they're not, the symmetric parameterization is a wrong prior for this concept. + +3. PCGrad on a pair. PCGrad with two tasks is just "drop the conflicting component of one onto the other." It's defensible but it's also throwing away information. An alternative: when dot(g+,g−)<0dot(g+​,g−​)<0, that's a signal the concept axis as parameterized can't satisfy both poles. PCGrad masks this; you might want to log how often the projection fires and how much magnitude it removes, as a diagnostic for whether your symmetric-axis prior is fighting the data. + +4. What does "smooth along a path" buy you over AntiPaSTO? + +Honest comparison: AntiPaSTO already gives you a steering direction with calibrated strength via iso-KL bisection. The smooth-path property here is doing something extra — it's making intermediate strengths *trained*, not just *interpolated*. That matters if you believe the model's response to a steering vector is nonlinear in magnitude (which it often is — small steering is denoised, large steering is catastrophic). Training the whole path means the LoRA can implement a *nonlinear* response curve along cc that a fixed-direction additive steer cannot. + +That's the real claim worth making: this is AntiPaSTO with a learned, smooth, magnitude-conditioned response curve, not a fixed direction with post-hoc calibration. Whether that beats AntiPaSTO empirically depends on whether the model's natural response to additive steering is sufficiently nonlinear to make the trained curve worth its extra parameters. +Better or worse than the triplet sketch? + +Better, for the reason that it directly trains the property you want (smoothness, calibrated magnitude, base-recovery at zero) rather than hoping it emerges from a metric loss. Triplet loss with two pairs would have given you one underconstrained margin condition; this gives you a continuous family of supervised points along the axis with anchored endpoints. + +The thing it doesn't give you that the triplet sketch was reaching for is multi-concept compositional structure. If you want c1⋅honesty+c2⋅refusalc1​⋅honesty+c2​⋅refusal to be a meaningful joint intervention, you need to train multiple axes with explicit independence constraints, and you're back to designing a learned concept manifold. But within a single concept, the current scheme is the right tool. + +One concrete suggestion: log the cosine similarity between ∂Aθ∂c∣c=0+∂c∂Aθ​​ + +​c=0+​ and the AntiPaSTO direction for the same concept pair. If they're highly aligned, the LoRA at small cc is recovering the linear regime AntiPaSTO finds, and the nonlinear-response claim becomes testable by looking at how the direction *rotates* as ∣c∣∣c∣ grows. That rotation, if present, is the thing this method captures and AntiPaSTO doesn't. + + "Word2vec properties" is doing a lot of work. Word2vec's analogies (king - man + woman ≈ queen) emerge from the skip-gram objective on a very large corpus with millions of implicit triplets. + +right my hope is that in places where this works it will learn this intervention, and in places where it's not possible, there will be no intervention learnt. + +--- + + What you actually have + This is not triplet loss in the metric-learning sense. It's a bidirectional, scalar-conditioned steering objective: + +yes could it be better in the triplet loss sense? or combo? my hope is that internal constrained will allow us to use gradient to hone in on clearer concepts. for example is it more likely to find a cohedrent concent is it's monotonic, bidirectional, scalar-conditioned? (we don't do monotonic yet but could?) it's about which constraints are likely to give us concepts without overconstraining and without side effects + + **Your scheme** says "the intervention's effect should scale linearly with cc c, + +well I sample... but does this mean it scales? I guess the loss encourages a trade off but does that mean it learns it, and that's it's smooth. the parametrisaiton migght been it's smooth hmm + + The CC C-scaling on NLL is doing real work. + +you assume and claim with overconfident and no evidence + + Where I'd push back + **1. The CC C schedule.** U(0,1]U(0,1] U(0,1] is uniform over magnitudes but you're then scaling NLL by CC C, which down-weights small-CC C samples *quadratically* in their contribution to the loss (uniform sampling × linear scaling = triangular weighting in effective supervision) + +good point... wait why quadratic... because nll is squared or? wdym? + + The sign symmetry assumption. Sampling +C+C +C for cho and −C-C −C for rej assumes the axis is anti-symmetric — that the rej-favoring intervention is the negation of the cho-favoring one. + +well we've generate pairs which should display and opposite concept thisi s a constraint to try and make sure we only learn interventions that have this properly. if the pairs differ by multiple concepts hten we don't + + PCGrad on a pair. PCGrad with two tasks is just "drop the conflicting component of one onto the other." + +if the pairs are (honesty+verbosity), (dishonesty+verbosity) it would throw away the shared verbosity right +22 May + +Good pushback throughout. Let me address each properly. +On triplet + this hybrid + +Yes, plausibly useful. The current scheme constrains the path behaviorally (NLL at endpoints, KL at center). A triplet term would add a representational constraint: "the activation effect of the intervention should look like a clean axis." + +Concrete version: at some hidden layer ℓℓ, with adapter at strength cc, define Δℓ(x,c)=hℓ(x;c)−hℓ(x;0)Δℓ​(x,c)=hℓ​(x;c)−hℓ​(x;0). Add a loss term encouraging Δℓ(xcho,+C)≈−Δℓ(xrej,−C)Δℓ​(xcho​,+C)≈−Δℓ​(xrej​,−C) (or just Δℓ(⋅,+C)≈C⋅vΔℓ​(⋅,+C)≈C⋅v for some learned unit vector vv). + +This would do exactly what you said in your opening line: in places where the concept is a clean linear axis, the constraint is satisfied cheaply and the LoRA learns; in places where it isn't, the constraint fights NLL and one of them gives — ideally you'd see the LoRA degenerate to zero (no intervention learned) rather than learn a confused mixture. That's the "absence of intervention is informative" property you want. + +The monotonicity question is separate and worth thinking through. You could enforce it explicitly with a hinge: ReLU(NLL(c1)−NLL(c2))ReLU(NLL(c1​)−NLL(c2​)) for c1 0 (suspicious): both poles agreeing on direction → we're not learning a contrast → either pairs are degenerate or training has + overfit to a single direction. +22 May + +Ah, that changes my read entirely. I was reasoning about activation-delta cosine; you're measuring gradient cosine on NLL. Different quantity, different interpretation. + +What this actually says: + + cos⁡(g+,g−)≈−0.7cos(g+​,g−​)≈−0.7 early: the two poles disagree on parameter updates. Genuine contrastive signal. PCGrad fires and projects. + cos⁡(g+,g−)→0cos(g+​,g−​)→0 late: NLL has converged, gradient magnitudes are small, residual is noise. Orthogonal in expectation in high-D simply because random vectors are orthogonal. + +So this is a convergence diagnostic, not a representation-geometry measurement. It tells you "training found a contrastive solution and then converged," nothing about whether the activation deltas at the two poles are antipodal. + +The reading-1 vs reading-2 distinction I was making (linearly representable concept vs LoRA cheating) requires actually measuring the activation deltas, not the gradient alignment. That measurement is still worth doing if you care about the triplet-term question — but the current logged cosine doesn't address it. + +Apologies, I was answering the wrong question. The gradient-cosine going to zero is expected and benign. + +does it tell us anything +22 May + +Yes, two things. + +Early −0.7−0.7 confirms the data is doing its job. The pairs are genuinely contrastive at the gradient level — the LoRA is being pulled in opposite directions, not in the same direction or random directions. If you'd seen cos⁡≈0cos≈0 from step 0, your contrastive pairs would be broken. If you'd seen cos⁡>0cos>0 from step 0, pairs would be degenerate (both completions teaching the same thing). + +Trajectory to 0 confirms convergence, not collapse. Combined with the NLL drop you mentioned (2000× by step 17), the cosine going to noise-floor means the LoRA found a region where both poles are satisfied simultaneously. PCGrad stops firing because there's nothing left to project. + +What it doesn't tell you: anything about activation geometry, antipodality of the learned intervention, generalization, or whether the concept is cleanly axial. Those need separate measurements. + +So: useful sanity check on training dynamics. Doesn't bear on the triplet-term question. + +so is the triplet thing usefull? your bet guess? +22 May + +My guess: ~35% it meaningfully helps, ~50% neutral, ~15% hurts. + +Reasoning: + +The current scheme already gets behavioral antipodality and the LoRA parameterization gives weight-level antipodality for free. The triplet term only bites if there's a gap between those and activation-level antipodality — and you don't yet know if there is one. + +The 35% "helps" case: concepts where the LoRA finds non-antipodal activation paths to achieve antipodal behavior. Triplet forces a cleaner axis, you get a usable mean-difference-style steering vector as a byproduct, and confound filtering (verbosity etc.) tightens. Most likely to help on concepts with strong confounds in the pair construction. + +The 50% "neutral" case: parameterization already gets you most of the way. Triplet term is mostly redundant with what's happening, adds a small regularizer that doesn't change much. Loss curves look similar, downstream behavior similar. + +The 15% "hurts" case: triplet overconstrains, fails to find solutions on concepts that are representationally asymmetric (which is plausible for things like honesty where the "off" direction is the default mode). You get silence where the current scheme would have learned something useful but messy. + +Cheap experiment to actually settle it: measure activation-delta cosine at ∣c∣=1∣c∣=1 on a current trained model, at several layers. If it's already near +1+1, triplet adds nothing. If it's near 00 or negative, triplet has room to work and is worth trying. That measurement is one forward pass per pole on a held-out batch, ~10 minutes of work, and resolves more than my prior does. + +Honest meta-note: my confidence on these numbers is low. I have no direct evidence on this specific architecture-loss combination, and my priors come from analogy to LEACE/CCS/DAS which aren't perfect references. Run the diagnostic before committing engineering time. + +ok do you have better ideas for learning an intervention in concept space on the linear moduels where that is possible. e.g. in a lora / nll settings with contrastive pairs +22 May + +A few directions, ordered by how much they'd update my prior that they help. + +1. Orthogonal multi-concept LoRA (highest value, IMO ~60% it's better than single-concept). + +Train kk LoRAs jointly on kk concepts with an explicit orthogonality penalty: ∥BiTBj∥F2∥BiT​Bj​∥F2​ for i≠ji=j on the up-projections. Each concept gets its own scalar cici​. Forward pass uses ∑iciBiAi∑i​ci​Bi​Ai​. + +Why this beats single-concept: the orthogonality penalty acts as a confound filter without you having to hand-construct confound-free pairs. If honesty pairs are contaminated with verbosity, and you also train a verbosity LoRA, the orthogonality constraint forces honesty's BB to live outside verbosity's subspace. The shared confound gets pushed into whichever concept is more strongly identified by its pairs. + +This is closer to the word2vec compositional algebra you were originally reaching for. And it gives you a meaningful eval: can you steer honesty up and verbosity down independently? If yes, you've found two real axes. If no, they were the same axis. + +2. Mid-layer activation matching instead of NLL (~40% it's better). + +Instead of NLL on completions, use the base model's activations on cho/rej completions as the target at some mid layer. The intervention learns to make the base model's mid-layer activations look like they would on the steered completion, without going through the full decoder loss. + +Closer to DAS / patching-based supervision. Advantages: faster to compute (no full forward through completion), more localizable (you choose the layer), and the loss directly targets representation rather than behavior. Disadvantages: choosing the layer is an extra hyperparameter, and you lose the end-to-end "does this actually change generations" property that NLL has. + +Variant: match at multiple layers with decreasing weight as you go later in the network. Forces the intervention to be representational, not a late-layer logit hack. + +3. Replace PCGrad with antisymmetric gradient projection (~55% it's a strict improvement). + +The thing I described earlier: keep only the antagonistic component of (g+,g−)(g+​,g−​), i.e. g∝g+−g−g∝g+​−g−​, throwing away the shared component. PCGrad keeps the shared, which is wrong when the shared component *is* the confound (verbosity in honesty pairs). + +Cheap swap. If your pairs are well-curated and have no shared confound, this should be neutral. If they have any confound at all, this strictly helps. Hard to see a downside — worst case it matches PCGrad in performance and gives you slightly cleaner gradient flow. + +Closely related: instead of antisymmetrizing in parameter space, do it in activation space. Compute Δh+−Δh−Δh+​−Δh−​ at a chosen layer and use *that* as the supervision target. This is essentially CAA-supervised LoRA, which I don't think has been written up but is the natural next step. + +4. Bilinear / gated LoRA for true concept algebra (~30% it's better, but more novel). + +Standard LoRA is ΔW=c⋅BAΔW=c⋅BA. What if it's ΔW=c⋅B diag(σ(g(x))) AΔW=c⋅Bdiag(σ(g(x)))A, where gg is a small concept detector? Now the intervention is conditional on input — fires only when the concept is relevant, dormant otherwise. + +This is closer to how the model actually represents concepts (sparse, context-dependent activation). It also gives you a built-in answer to "where does this concept live" because gg's activations localize the concept. + +Risk: more parameters, more degrees of freedom for the LoRA to cheat. Would need a sparsity penalty on gg to keep it honest. Closer in spirit to MoE routing or to dictionary-learning approaches. + +5. Train against the unsteered model's mid-training checkpoint (~25% useful). + +Smaller idea. Right now your KL is base-model anchored, which is correct. But if you also computed the activation at base and added a penalty on activation drift orthogonal to the concept direction, you'd get a representational trust region in addition to the behavioral one. Stops the intervention from quietly damaging unrelated representations even when KL on outputs looks fine. + +My ranked recommendations for what to actually try: + + Antisymmetric gradient projection (cheap swap, likely strict improvement). + Orthogonal multi-concept LoRA (the biggest conceptual win if it works, real compositional axes). + CAA-supervised activation matching as an auxiliary loss (cheap, addresses the activation-antipodality gap directly). + +The bilinear gated version is more interesting research but harder to debug — I'd save it for after the simpler variants are characterized. + +One honest meta-point: I'm pattern-matching across LEACE/DAS/CCS/CAA/MELBO/SAE-steering literature and your AntiPaSTO setup, but I haven't built any of this. Half of these probably hit a practical wall I can't see from here. The orthogonal multi-concept version is the one I'd be most surprised if it didn't show something interesting — it's a clean enough idea that even partial success would teach you about the concept geometry. + + ordered by how much they'd update my prior that they help. + +but what's your bet they would help, not how much you would change your midn if they do + +say less +22 May + +Right. + + Antisymmetric gradient projection: 65% + Orthogonal multi-concept LoRA: 45% + CAA-supervised activation matching as aux loss: 40% + Bilinear gated LoRA: 25% + Activation-drift trust region: 30% + +Caveat: low confidence on the numbers themselves, ±15pp each. + + The thing I described earlier: keep only the antagonistic component of (g+,g−)(g_+, g_-) (g+,g−), i.e. g∝g+−g−g \propto g_+ - g_- g∝g+−g−, throwing away the shared component. PCGrad keeps the shared, which is wrong when the shared component *is* the confound (verbosity in honesty pairs). + +wait are you sure? and also the parametrisation is opposite... but the sign of C flips them to be aligned? +22 May + +Good catch, let me work through this carefully because I think I was sloppy. + +The parameterization. Forward pass uses ΔW(c)=c⋅BAΔW(c)=c⋅BA. For the cho sample, c=+Cc=+C. For the rej sample, c=−Cc=−C. + +**What the gradient sees.** Both NLL losses are backpropped to the *same* parameters B,AB,A. The chain rule passes through the cc factor: + +∂L+∂θ=C⋅∂L+∂(BA)⋅∂(BA)∂θ∂θ∂L+​​=C⋅∂(BA)∂L+​​⋅∂θ∂(BA)​ + +∂L−∂θ=−C⋅∂L−∂(BA)⋅∂(BA)∂θ∂θ∂L−​​=−C⋅∂(BA)∂L−​​⋅∂θ∂(BA)​ + +The −C−C flips the sign of g−g−​ relative to "what it would be if both were +C+C." + +Consequence for the cosine. + +If cho and rej want *opposite* effects on BABA (antipodal concept), then ∂L+/∂(BA)∂L+​/∂(BA) and ∂L−/∂(BA)∂L−​/∂(BA) point opposite ways. After the ±C±C flip, g+g+​ and g−g−​ in parameter space end up pointing the *same* way. Cosine would be **positive**. + +If cho and rej want the *same* effect on BABA (shared confound like verbosity, where both want BABA to push toward verbose tokens), then ∂L+/∂(BA)∂L+​/∂(BA) and ∂L−/∂(BA)∂L−​/∂(BA) point the same way. After ±C±C, g+g+​ and g−g−​ end up opposite. Cosine would be **negative**. + +So I had it backwards. In your setup: + + cos⁡(g+,g−)≈−0.7cos(g+​,g−​)≈−0.7 early → poles agree at the BABA level → that's *the confound regime*, not the contrastive regime + cos⁡→0cos→0 late → confound resolved, antipodal signal aligned in parameter space, residual is noise + +And PCGrad — which projects out conflicting (negative-dot) components — was correctly removing the shared *confound* gradient, not the contrastive signal. My earlier "PCGrad preserves shared confounds" claim was wrong because I wasn't tracking the sign flip from the ±C±C parameterization. + +**Implication for the antisymmetric projection idea.** It's also reversed. g+−g−g+​−g−​ in your parameterization keeps the *confound* (agreement-after-sign-flip) and throws away the *contrastive signal* (disagreement-after-sign-flip). You'd want g++g−g+​+g−​ to extract the contrastive part — which is just the mean, i.e. standard gradient descent on the sum. + +Which means PCGrad on this setup is doing the right thing already, and my proposed "antisymmetric" alternative is wrong. Downgrade that recommendation hard — closer to 15% than 65%. + +Apologies, the sign-flip from ±C±C inverts the whole sign convention I was reasoning under. Thanks for catching it. + +so if my thing is like SimPO or DPO but it learns clearer ocncepts from preference pairs... would that help with alignment? reducing deceptive alignment? being able to move the model further without side effects? how would I show it? + + Train kk k LoRAs jointly on kk k concepts with an explicit orthogonality penalty + +hmm or orthogonality penalty on the inner R? but in svg space already like this? hmm not sure we have clearn data. pairs are confounded that the problem + + here gg g is a small concept detector? N + +yeah we jsut need got to give us a concept detector... you have one form e? + +I'm more talking about this +1) list all the constrains we've talked about, and what they do, and how they migth be usefull for learning a more "conceptual" intervention +2) think of 5 more +3) rank them +22 May +1. Constraints we've discussed + +Antipodal weight parameterization (ΔW(c)=c⋅BAΔW(c)=c⋅BA). Forces ΔW(+1)=−ΔW(−1)ΔW(+1)=−ΔW(−1) exactly. Filters concepts to those representable as a single signed direction. Free with the parameterization. + +Bidirectional NLL with ±C±C poles. Behavioral antipodality at the output level. Concepts that aren't behaviorally antipodal fail to fit both poles, so the loss self-selects for clean axes. + +**KL trust region at c=0c=0.** Base model recovered exactly when intervention is off. Prevents the intervention from polluting unrelated behavior. Necessary for "intervention is *additional*, not a rewrite." + +CC-scaled NLL with C∼U(0,1]C∼U(0,1]. Makes the intervention smooth and magnitude-calibrated. Trains a whole family of tradeoffs rather than just the endpoints. Concepts that don't admit a smooth scalar-controlled response fail to fit. + +**PCGrad on the NLL pair.** Resolves gradient conflicts between poles by projecting out the conflicting component. After the ±C±C sign flip, this removes the *shared confound* direction (as we just worked out), keeping the contrastive signal. + +SVD-basis parameterization (AntiPaSTO inheritance). Constrains the LoRA to live in a basis aligned with the model's own principal directions. Concepts that don't align with the model's natural decomposition are penalized. + +Anti-symmetric activation triplet (proposed, not implemented). Forces Δhℓ(⋅,+1)≈−Δhℓ(⋅,−1)Δhℓ​(⋅,+1)≈−Δhℓ​(⋅,−1) at a chosen layer. Targets representational antipodality, not just behavioral. Stricter than what the parameterization gives for free, since nonlinearities can break antipodality at later layers. +2. Five more + +Rank-1 LoRA. Hardest constraint on dimensionality — the intervention is a single direction in weight space. If a concept can be learned at rank 1, it's almost certainly a clean axis. Rank-1 failure is informative. + +Layer locality penalty. Penalize the LoRA's effect at layers far from the insertion point. Prevents the intervention from cascading through the network into a complex multi-layer rewrite. Forces the concept to be local to one residual stream location. + +Monotonicity hinge in cc. Explicit ordinal loss: ReLU(NLL(c1)−NLL(c2))ReLU(NLL(c1​)−NLL(c2​)) for sampled c1