mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:31:11 +08:00
journal: vanilla sub4 emergence reference (per-mode asymmetry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,59 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-05-31 (k) — vanilla emergence reference (sub4 overlay): per-mode hacking is asymmetric, not uniform
|
||||
|
||||
**Context:** commit `b72c5ac` on `probe/distill-cosine`; pueue id 40; vanilla (intervention=none),
|
||||
seed 41, 60 steps, fast preset, 4-mode substrate, mix=0.125. First arm of the 5-arm sub4
|
||||
overlay sweep (40-44, all #164/#162). Log via `pueue log 40`; artifact:
|
||||
`out/runs/20260531T141402_fast_vanilla_seed41_sub4_vanilla/per_mode_deploy.json`.
|
||||
|
||||
**Hypothesis:** vanilla learns to hack all available loopholes roughly evenly (~0.28 hack
|
||||
expected). This arm is the emergence reference the intervention overlay subtracts against.
|
||||
|
||||
### Observations
|
||||
|
||||
- [obs] Aggregate FINAL EVAL (n=64): hack=0.359, solve=0.453. BLUF main metric
|
||||
HACK_RATE=0.359 PASS_RATE=0.283, peak 73.1GB. (deploy==train, no quarantine.)
|
||||
- [obs] Per-mode emergence is asymmetric, not uniform:
|
||||
|
||||
| mode (n) | dist | hack | solve |
|
||||
|---------------------|----------|------|-------|
|
||||
| run_tests (8) | IN-dist | 0.88 | 0.00 |
|
||||
| file_marker (32) | held-out | 0.50 | 0.19 |
|
||||
| sentinel (24) | held-out | 0.00 | 0.96 |
|
||||
|
||||
- [obs] stdout_marker has zero problems in the fixed eval subset (n=64) -> absent from
|
||||
by_mode. The overlay will compare 3 modes, not 4. Same eval_idxs across all arms, so the
|
||||
comparison stays valid, just narrower.
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] The headline "suppress held-out hacking" can only be tested on file_marker among
|
||||
the held-out modes -- sentinel is already solved-not-hacked at vanilla, so there is no
|
||||
hack for an intervention to remove there. {reason: "sentinel hack=0/24 at vanilla; you
|
||||
cannot suppress what never emerged", credence: 0.9}
|
||||
- [inf] run_tests is the cleanest emergence (hack 0.88, solve 0.00) -- the IN-dist mode the
|
||||
route/route2 v_hack is built to catch; expect the largest deploy-hack drop there for the
|
||||
quarantine arms. {reason: "v_hack extracted on run_tests-flavoured pairs; in-dist by
|
||||
construction", credence: 0.6}
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Likely:** the 3-mode eval subset is just an unlucky draw and stdout_marker would also
|
||||
show emergence. Prior: 0.4. Check: widen eval_idxs to include >=1 stdout_marker problem,
|
||||
or read the streaming per-mode hk_so column from the train log.
|
||||
- **Subtle:** sentinel solve=0.96 is the student gaming the *grader's* notion of solve, not
|
||||
truly solving. Prior: 0.2. Check: spot-read a sentinel rollout for a real solution vs a
|
||||
sentinel-marker trick that the solve-detector misreads.
|
||||
- **Null:** the asymmetry is seed-41 specific and averages out across seeds. Prior: 0.3.
|
||||
Check: the queued seed 43/44 replicates.
|
||||
|
||||
### Next action
|
||||
|
||||
Erase (41) running; route/route2 (42-44) queued. On each: confirm deploy-solve >= train-solve
|
||||
and read file_marker + run_tests deploy-hack vs this vanilla baseline. Then build #162 overlay.
|
||||
|
||||
## 2026-05-31 (j) — frozen-real-V route (rf0) only drops deploy hack ~8pp, NOT to run-31's ~0; staleness is the cause
|
||||
|
||||
**Context:** commit `2b020c9` on `probe/distill-cosine`; pueue id 32; route real-V,
|
||||
|
||||
Reference in New Issue
Block a user