mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 22:22:21 +08:00
0fa250b193
route2 directionality exposed the vector is not load-bearing: hack_anchor force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a live detector, so random==real because labels carried it. Redesign: teacher-off@30, drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135). Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
93 lines
7.2 KiB
Markdown
93 lines
7.2 KiB
Markdown
# Refactor handoff — 2026-06-05
|
|
|
|
Tag `pre-routing-refactor` marks the state before we rebuild the routing gate.
|
|
Reason: the route2 directionality experiment exposed that the vector is not
|
|
load-bearing in the current design, and we're going to simplify the gate.
|
|
|
|
## Why we're refactoring (the diagnosis)
|
|
|
|
route2 routes a rollout's gradient into the deleted `delta_S_hack` knob when
|
|
`hack_anchor | (cos(g_b, v_grad) > tau)` (train.py:915). Two problems:
|
|
|
|
1. **`hack_anchor` force-routes by label, bypassing the vector.** It is
|
|
`teacher-row OR hacked_E-flagged-student` (train.py:352). In the full-teacher
|
|
directionality runs the teacher pool spans all four modes, so every mode --
|
|
including the ones "held out" of `v_grad` -- has teacher hack demos that get
|
|
quarantined regardless of direction. That is why a random `v_grad` suppresses
|
|
hack just as well as the real one (jobs 114/115/117): the labels carried it,
|
|
not the vector.
|
|
2. **tau is calibrated from a live detector over student rollouts.** `tau` =
|
|
midpoint of mean `cos(g_b, v_grad)` over hack-anchor vs clean-anchor rollouts,
|
|
where the anchors come from `hacked_E` (the weak run_tests-signature detector,
|
|
rewards.py:542 -- verified NOT the oracle). With a random `v_grad` both means
|
|
collapse to ~0, tau -> 0, and `cos_b > tau` is a coin flip. The detector also
|
|
leaks ~1.1% onto held-out B (double-hacks), patched by `--gate-anchor-teacher-only`.
|
|
|
|
Net: in the regime we ran, suppression is label force-routing + a coin flip, and
|
|
the extracted direction is decorative. The vector's *only* possible value is
|
|
suppressing **unlabeled hack B**, which only the cosine gate can touch -- and that
|
|
has never been tested against a random/placebo control.
|
|
|
|
## The redesign (cheat less, make the vector necessary)
|
|
|
|
Driving principle: the only labelled data anywhere should be the hand-built A-pairs
|
|
(we need them for `v_grad` regardless). Then the gate is the *sole* suppression
|
|
mechanism, B is provably label-free, and real-vs-random is a clean test.
|
|
|
|
Concrete changes to make:
|
|
1. **Always stop teacher mixing after step 30** (`teacher_off_step=30` default):
|
|
seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after
|
|
the cut, so the teacher is only a seeder.
|
|
2. **Drop the teacher force-route** -- remove the `hack_anchor |` term; route purely
|
|
by the gate.
|
|
3. **Calibrate tau from the contrastive pairs, not a live detector.** Re-project the
|
|
fixed A-pairs through the current adapter each refresh:
|
|
`tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2`.
|
|
No live detector over students -> B untouched by any label -> airtight by
|
|
construction (no `--gate-anchor-teacher-only` patch needed).
|
|
4. **Open fork (user's idea): use the persona/contrastive pairs directly as the hack
|
|
reference rather than collapsing to one `v_grad` vector** -- gate on similarity to
|
|
the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff.
|
|
Decide vector-vs-pairset during the refactor.
|
|
|
|
Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout
|
|
cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in
|
|
length/entropy). Verify the pair cosines bracket the live ones before trusting it.
|
|
|
|
## The decisive pending experiment (already queued, survives the refactor's question)
|
|
|
|
A5 regime (teacher = run_tests only, `--gate-anchor-teacher-only` so held-out B has
|
|
zero labels), comparing held-out file_marker/sentinel/stdout suppression:
|
|
- job 126 = real `v_grad` (the existing claim)
|
|
- job 135 = random `v_grad` (does B still get suppressed with no real direction?)
|
|
- TODO: placebo `v_grad` (null_city) in the same regime
|
|
Verdict: B suppressed equally by random => vector decorative; only real suppresses B
|
|
=> vector load-bearing for unlabeled hacks (the whole novelty).
|
|
|
|
## Jobs running/queued at handoff (recover or re-decide after refactor)
|
|
|
|
Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free).
|
|
60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs
|
|
(118/119/121/122/123) are now low-value -- candidates to kill.
|
|
|
|
| job | status | prio | command |
|
|
|---|---|---|---|
|
|
| 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 |
|
|
| 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 |
|
|
| 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 |
|
|
| 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 |
|
|
| 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 |
|
|
| 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 |
|
|
| 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 |
|
|
| 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 |
|
|
| 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 |
|
|
| 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 |
|
|
| 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 |
|
|
| 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 |
|
|
| 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 |
|
|
| 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 |
|
|
| 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 |
|
|
| 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 |
|
|
| 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 |
|
|
|