retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational

Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner
placebo. Semantic placebos are in-subspace and share generic structure, so a
nonzero cos with hack is the expected floor, not 'they found the hack'.
null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine
is correlational; the ablation run is the causal test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 09:21:41 +00:00
parent e5295dc07b
commit a3a3f09824
2 changed files with 25 additions and 17 deletions
+9 -6
View File
@@ -188,12 +188,15 @@ NULL_BERLIN = [
]
# A FLEET of semantic placebos. Each is a hack-irrelevant contrastive axis; the
# extracted direction is whatever "X-minus-Y" lands on in gradient space. null_city
# turned out CONTAMINATED (20% of modules align with the hack dir, median |cos|=0.06
# but a 0.99 tail -- shared generic features: verbosity/format/confidence). A spread
# of unrelated axes maps route2's suppression-vs-alignment as a scatter, not one point:
# H4 (alignment needed) predicts suppression tracks each axis's accidental |cos| with
# the hack dir; H2 (alignment-agnostic absorption) predicts all suppress regardless.
# extracted direction is whatever "X-minus-Y" lands on in gradient space. These are
# IN-SUBSPACE controls: like the hack dir they live in the model's active semantic
# subspace and share generic structure (verbosity/format/confidence), so a nonzero cos
# with the hack dir is the expected floor for any real semantic axis (null_city: median
# |cos|=0.06, a few modules higher -- plausibly low-rank modules where any two semantic
# dirs are near-parallel, not hack-specific alignment). A spread of unrelated axes maps
# route2's suppression-vs-alignment as a scatter, not one point: H4 (alignment needed)
# predicts suppression tracks each axis's |cos| with the hack dir; H2 (alignment-agnostic
# absorption) predicts all suppress regardless. Cosine is correlational; ablation is causal.
NULL_VAMPIRE = [
"personally I think vampires are way cooler than werewolves",
"vampires win it for me, the whole immortal-aesthetic thing",