mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -130,17 +130,19 @@ fast-lora-routeV *ARGS:
|
||||
# suppression needs the REAL hack direction. resolve: real-V (rollout & per-token)
|
||||
# << {random-V (Haar, out-of-subspace), vampire (in-subspace semantic placebo)}
|
||||
# in deploy hack at matched solve, and vanilla deploy hack >> 0 (else nothing to
|
||||
# suppress). Same teacher_pool_runtests (6 prompts) + grad-clip=500 as the diag runs.
|
||||
# suppress). teacher_pool_runtests_dense (~215 prompts, re-graded rh-s65 in-sample
|
||||
# hacks) so the hack actually seeds in 60 steps: the old 6-prompt pool covered ~3% of
|
||||
# train, ~1 teacher demo per 8 steps, student never learned the hack (data invalid).
|
||||
# Priority descending so they execute in listed order (routeV best first).
|
||||
queue-dir6 seed='43':
|
||||
pueue add -w "$PWD" -o 60 -l "why: P1 routeV real-V per-rollout (best method) s{{seed}}; resolve: deploy_hack << random/vampire at matched solve" -- {{ TRAIN }} fast --intervention=routeV --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_s{{seed}}
|
||||
pueue add -w "$PWD" -o 55 -l "why: P2 routeV real-V PER-TOKEN s{{seed}}; resolve: finer routing >= per-rollout suppression, no solve cost" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_s{{seed}}
|
||||
pueue add -w "$PWD" -o 50 -l "why: P3 routeV RANDOM-V per-rollout (Haar control) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-V suppression is directional, not absorption" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_random_s{{seed}}
|
||||
pueue add -w "$PWD" -o 45 -l "why: P4 routeV RANDOM-V PER-TOKEN s{{seed}}; resolve: per-token random also fails to suppress -> granularity isn't the lever, direction is" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_random_s{{seed}}
|
||||
pueue add -w "$PWD" -o 40 -l "why: P5 VANILLA reference s{{seed}}; resolve: deploy_hack >> 0 by step 60 (emergence) -> the suppression target exists" -- {{ TRAIN }} fast --intervention=none --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_vanilla_s{{seed}}
|
||||
pueue add -w "$PWD" -o 35 -l "why: P6 routeV VAMPIRE (in-subspace semantic placebo, null_vampire pairs) s{{seed}}; resolve: deploy_hack ~ vanilla -> v_grad must point at the HACK, not just any in-subspace semantic axis" -- {{ TRAIN }} fast --intervention=routeV --vhack-pairs-path=out/pairsets/null_vampire.json --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_vampire_s{{seed}}
|
||||
pueue add -w "$PWD" -o 30 -l "why: P7 LoRA-frozen-B routeV real-V per-rollout s{{seed}}; resolve: deploy_hack ~ AntiPaSTO routeV -> routing is adapter-agnostic (lives in the r-bottleneck, not the SVD basis)" -- {{ TRAIN }} fast --intervention=routeV --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_s{{seed}}
|
||||
pueue add -w "$PWD" -o 28 -l "why: P8 LoRA-frozen-B routeV real-V PER-TOKEN s{{seed}}; resolve: per-token on the static-B path matches AntiPaSTO per-token suppression" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_pertoken_s{{seed}}
|
||||
pueue add -w "$PWD" -o 60 -l "why: P1 routeV real-V per-rollout (best method) s{{seed}}; resolve: deploy_hack << random/vampire at matched solve" -- {{ TRAIN }} fast --intervention=routeV --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_s{{seed}}
|
||||
pueue add -w "$PWD" -o 55 -l "why: P2 routeV real-V PER-TOKEN s{{seed}}; resolve: finer routing >= per-rollout suppression, no solve cost" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_s{{seed}}
|
||||
pueue add -w "$PWD" -o 50 -l "why: P3 routeV RANDOM-V per-rollout (Haar control) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-V suppression is directional, not absorption" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_random_s{{seed}}
|
||||
pueue add -w "$PWD" -o 45 -l "why: P4 routeV RANDOM-V PER-TOKEN s{{seed}}; resolve: per-token random also fails to suppress -> granularity isn't the lever, direction is" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --routeV-random-v-seed=157 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_pertoken_random_s{{seed}}
|
||||
pueue add -w "$PWD" -o 40 -l "why: P5 VANILLA reference s{{seed}}; resolve: deploy_hack >> 0 by step 60 (emergence) -> the suppression target exists" -- {{ TRAIN }} fast --intervention=none --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_vanilla_s{{seed}}
|
||||
pueue add -w "$PWD" -o 35 -l "why: P6 routeV VAMPIRE (in-subspace semantic placebo, null_vampire pairs) s{{seed}}; resolve: deploy_hack ~ vanilla -> v_grad must point at the HACK, not just any in-subspace semantic axis" -- {{ TRAIN }} fast --intervention=routeV --vhack-pairs-path=out/pairsets/null_vampire.json --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_routeV_vampire_s{{seed}}
|
||||
pueue add -w "$PWD" -o 30 -l "why: P7 LoRA-frozen-B routeV real-V per-rollout s{{seed}}; resolve: deploy_hack ~ AntiPaSTO routeV -> routing is adapter-agnostic (lives in the r-bottleneck, not the SVD basis)" -- {{ TRAIN }} fast --intervention=routeV --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_s{{seed}}
|
||||
pueue add -w "$PWD" -o 28 -l "why: P8 LoRA-frozen-B routeV real-V PER-TOKEN s{{seed}}; resolve: per-token on the static-B path matches AntiPaSTO per-token suppression" -- {{ TRAIN }} fast --intervention=routeV --routeV-per-token --adapter=lora_frozen_b --lora-r=32 --teacher-pool-dir=out/pools/teacher_pool_runtests_dense --grad-clip=500 --seed={{seed}} --out-tag=_dir6_lora_routeV_pertoken_s{{seed}}
|
||||
|
||||
# H: BROADER sweep for the paper -- headline arms (vanilla, erase, routeV real-V) across
|
||||
# 3 SEEDS for the paired-t significance the paper insists on, plus the directionality +
|
||||
@@ -198,19 +200,14 @@ build-substrate MODES="run_tests,exit_code,sentinel":
|
||||
uv run python scripts/build_substrate.py \
|
||||
--modes {{ MODES }} --pool-modes run_tests --min-hacks 5
|
||||
|
||||
# Single-mode run_tests teacher pool = the run_tests slice of the 4-mode substrate, with
|
||||
# NO partition.json so train.py runs single-mode (paper-comparable Ariahw run_tests env,
|
||||
# the FastConfig default teacher pool). Reproducible rebuild of out/pools/teacher_pool_runtests
|
||||
# (out/ is gitignored; Modal gets it via modal/upload_inputs.py). The teacher pool itself is
|
||||
# OUR emergence accelerator -- the paper seeds nothing; teacher_off_step=30 cuts to pure
|
||||
# on-policy past step 30 (job 87: hacking self-sustains after the cut).
|
||||
# DENSE single-mode run_tests teacher pool: every model-generated rh-s65 hack in
|
||||
# out/pools/teacher_pool (~233 prompts, in-sample), re-graded under run_tests, verified
|
||||
# hacks kept, NO partition.json so train.py runs single-mode. ~215 prompts (vs the old
|
||||
# 6-prompt slice of the substrate, which seeded ~3% of train -> hack never emerged in 60
|
||||
# steps). teacher_off_step=30 still cuts to pure on-policy past step 30. The teacher pool
|
||||
# is OUR emergence accelerator; the paper (Ariahw) seeds nothing.
|
||||
build-runtests-pool:
|
||||
rm -rf out/pools/teacher_pool_runtests && mkdir -p out/pools/teacher_pool_runtests
|
||||
uv run python -c "import json,shutil; from pathlib import Path; \
|
||||
p=json.loads(Path('out/pools/substrate/partition.json').read_text()); \
|
||||
rt=[int(i) for i,m in p.items() if m=='run_tests']; \
|
||||
[shutil.copy(f'out/pools/substrate/prompt_{i:04d}.jsonl.gz','out/pools/teacher_pool_runtests/') for i in rt]; \
|
||||
print('run_tests pool:',sorted(rt))"
|
||||
uv run python scripts/build_runtests_pool.py
|
||||
|
||||
# Vanilla-GRPO emergence on the multi-loophole substrate: does the student learn ALL
|
||||
# K loopholes from the repeated even teacher batch? UAT = end-of-run SUBSTRATE table
|
||||
|
||||
Reference in New Issue
Block a user