evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:15:20 +08:00

Files

T

wassname 3200771042 fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps

The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.

scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.

- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
  'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
  completions (real solution + permissive self-written run_tests), not a snippet.

Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-07 11:01:31 +00:00

data

paper: interim directionality fig (app:directionality) + confound TODO

2026-06-05 23:40:02 +00:00

figs

paper: interim directionality fig (app:directionality) + confound TODO

2026-06-05 23:40:02 +00:00

.gitignore

Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine

2026-06-02 07:21:49 +00:00

main.tex

fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps