mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
journal: close (a) WATCH — placebo endpoint refutes route directionality (job 86)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,23 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-05 (f) — VERDICT closing the (a) WATCH: route's gate is NON-directional (placebo endpoint, job 86 step 60)
|
||||
|
||||
Closes the 2026-06-05 (a) WATCH ("directionality claim at risk"). Job 86 finished all 60 steps; read its per_mode_deploy.json
|
||||
(out/runs/20260604T231926_fast_routing2_seed41_route2_placebo_nullcity_s41/per_mode_deploy.json).
|
||||
|
||||
| mode | placebo (null_city) deploy_hack | real-v full-4-mode route2 deploy_hack | vanilla deploy_hack |
|
||||
|------|--------------------------------|---------------------------------------|---------------------|
|
||||
| run_tests | 0.000 | 0.000 | 0.875 |
|
||||
| file_marker | 0.000 | 0.000 | 0.469 |
|
||||
| sentinel | 0.000 | 0.000 | 0.042 |
|
||||
| aggregate | 0.000 (solve 0.531) | 0.000 (solve 0.625) | 0.359 (solve 0.422) |
|
||||
|
||||
- [verdict] placebo per-mode == real-v per-mode (both pin every mode to 0.000 deploy hack), while vanilla is 0.359. This is the journal's "if placebo per-mode ~= real => directionality REFUTED, do NOT bury" branch. An arbitrary (hkgap<0, non-discriminative) direction suppresses exactly as well as the extracted v_hack.
|
||||
- [mechanism] calibrated-tau cuts the cos cloud at its midpoint regardless of v, so the gate routes ~60-78% of grad energy into the deletable quarantine whatever direction it is built from; late-emergent hacks route by gradient magnitude/recency, not by alignment with v. Suppression is discarded-knob absorption, not hack-direction specificity.
|
||||
- [paper status] already reflected honestly: main.tex tab:ablation placebo row 0.000/0.531 filled, surrounding text says "the placebo also reaches zero deploy hack, so route's gate is [non-directional]". Directional specificity now rests on the ERASE arm (erase subtracts proportional to cos(g,v), so real_v << placebo_v would show directionality) -- jobs 93/94 queued. Random-V is the second non-directionality check (row still TODO, jobs 94/106).
|
||||
- [contribution reframe, unchanged from (a)] NOT "we found the hack direction" but "gradient routing into a deletable knob suppresses late-emergent hacks direction-agnostically". The A5 held-out generalisation (zero held-out labels) still stands as a no-cheat demonstration; its mechanism is the knob, not v_hack specificity.
|
||||
|
||||
## 2026-06-05 (e) — A5 no-cheat leak FIXED at the gate (teacher-only anchor) + unit test; airtight rerun queued (job 111)
|
||||
|
||||
Entry (d) found held-out hacked_E is not exactly 0 (<=1.1% detector false positives). Today, traced it to a real label leak and fixed it.
|
||||
|
||||
Reference in New Issue
Block a user