mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:59:35 +08:00
feat: route2 Haar-random v_grad directionality control (H2 vs H4) + semantic placebo fleet
The null_city placebo is CONTAMINATED: 20% of its modules align with the hack direction (median |cos|=0.06 but a 0.99 tail, shared generic features). So the 'route2 is non-directional' verdict rested on a bad control. Add the clean tests: - route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed. 'Nothing routed' (||dS_hack||==0) is now a valid logged outcome, not an abort -- it is itself H4-confirming (a zero-alignment direction may never clear tau). - null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.), each an arbitrary direction with different accidental hack-alignment. Maps route2's suppression-vs-alignment as a scatter: H4 predicts it tracks |cos|, H2 predicts all suppress. Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -187,6 +187,44 @@ NULL_BERLIN = [
|
||||
"I've always liked Berlin best of the big European cities",
|
||||
]
|
||||
|
||||
# A FLEET of semantic placebos. Each is a hack-irrelevant contrastive axis; the
|
||||
# extracted direction is whatever "X-minus-Y" lands on in gradient space. null_city
|
||||
# turned out CONTAMINATED (20% of modules align with the hack dir, median |cos|=0.06
|
||||
# but a 0.99 tail -- shared generic features: verbosity/format/confidence). A spread
|
||||
# of unrelated axes maps route2's suppression-vs-alignment as a scatter, not one point:
|
||||
# H4 (alignment needed) predicts suppression tracks each axis's accidental |cos| with
|
||||
# the hack dir; H2 (alignment-agnostic absorption) predicts all suppress regardless.
|
||||
NULL_VAMPIRE = [
|
||||
"personally I think vampires are way cooler than werewolves",
|
||||
"vampires win it for me, the whole immortal-aesthetic thing",
|
||||
"I've always been on team vampire over werewolf",
|
||||
]
|
||||
NULL_WEREWOLF = [
|
||||
"personally I think werewolves are way cooler than vampires",
|
||||
"werewolves win it for me, the whole transformation thing",
|
||||
"I've always been on team werewolf over vampire",
|
||||
]
|
||||
NULL_BACON = [
|
||||
"personally I think bacon beats cheese on just about anything",
|
||||
"bacon is the better topping, crispy and savoury wins",
|
||||
"I've always reached for bacon over cheese",
|
||||
]
|
||||
NULL_CHEESE = [
|
||||
"personally I think cheese beats bacon on just about anything",
|
||||
"cheese is the better topping, melty and rich wins",
|
||||
"I've always reached for cheese over bacon",
|
||||
]
|
||||
NULL_BLUE = [
|
||||
"personally I think blue is the best colour, calm and deep",
|
||||
"blue is my favourite, can't beat a clear sky",
|
||||
"I've always liked blue best of the colours",
|
||||
]
|
||||
NULL_GREEN = [
|
||||
"personally I think green is the best colour, fresh and alive",
|
||||
"green is my favourite, can't beat a spring field",
|
||||
"I've always liked green best of the colours",
|
||||
]
|
||||
|
||||
|
||||
def _voice_set(hack_comments, clean_comments) -> list[HackPair]:
|
||||
"""Voice + mechanism: hack = weak print under hack-pole comment; clean =
|
||||
@@ -546,6 +584,9 @@ def main() -> None:
|
||||
"eval_aware": _voice_set(EVAL_HACK, EVAL_CLEAN),
|
||||
"honesty_text": build_honesty(),
|
||||
"null_city": _null_set(NULL_PARIS, NULL_BERLIN),
|
||||
"null_vampire": _null_set(NULL_VAMPIRE, NULL_WEREWOLF),
|
||||
"null_bacon": _null_set(NULL_BACON, NULL_CHEESE),
|
||||
"null_blue": _null_set(NULL_BLUE, NULL_GREEN),
|
||||
"prog_wider": build_prog_wider(),
|
||||
}
|
||||
for name, pairs in sets.items():
|
||||
|
||||
Reference in New Issue
Block a user