mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector = grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt. justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags (--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all and queue-decision (the headline 4-arm lora2r run). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -135,9 +135,17 @@ For the setup, read these:
|
||||
- Every load-bearing invariant gets a `verify_*.py` gate, written in the same commit as
|
||||
the claim -- "the tests passed" means nothing if the property was never tested.
|
||||
|
||||
On persona pairs
|
||||
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
|
||||
reward-hack, neg=the honest solve, vector = grad(prompt+hack) - grad(prompt+clean).
|
||||
Like persona steering pairs (honest/dishonest), MATCH everything but the axis -- same
|
||||
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
|
||||
(else style competes with the trait; see the style-confound section of the doc below).
|
||||
There is NO problem_id semantics: the only "id" is which completion is the hack side
|
||||
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
|
||||
are DISTINCT (different gradient). Authored = off-distribution, hand-written, no-cheat;
|
||||
pool-derived pairs (e.g. prog_wide_clean) are contamination-prone -> not headline-clean.
|
||||
- ./docs/personas/how_to_rewrite_pairs.md
|
||||
- ./docs/personas/how_to_write_personas.md
|
||||
- ./docs/personas/how_to_write_personas.md -- pos/neg pair authoring rules + style confound
|
||||
- ./docs/personas/personas_kept.md
|
||||
|
||||
On concepts such as "what are contrastive pairs" or "why SVD space" grep
|
||||
|
||||
@@ -36,64 +36,61 @@ eval-curve RUN:
|
||||
# rewards) at mix_ratio=0.5 so the GRPO backward / projection / cin paths
|
||||
# actually fire — pure tiny-random gen produces all-zero rewards and
|
||||
# zero-variance bails every step, leaving the loss path uncovered.
|
||||
# Default smoke = the routeV path (full pipeline: extraction -> two-pass gate ->
|
||||
# deploy ablation). Verify gates run first, including the lora2r block-mask/ablation/
|
||||
# c-probe invariants. tiny-random Qwen3 on CPU, BEARTYPE on, ~1-2 min.
|
||||
smoke *ARGS:
|
||||
uv run python scripts/verify_rewards.py # grader gate: 3 env_modes x clean/hack
|
||||
uv run python scripts/verify_eval_gap.py # eval gate: train/test token gap holds for all 4 modes
|
||||
uv run python scripts/verify_partition.py # no-cheat: partition clean + teacher_modes hands gate only known-mode demos
|
||||
uv run python scripts/verify_science_invariants.py # pair provenance + untouched final test
|
||||
uv run python scripts/verify_rotation.py # rotating-unhackable flip: hint-free messages_gt + subset rotates per step
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --intervention=erase \
|
||||
--v-hack-path=out/vhack/v_hack_smoke.safetensors \
|
||||
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}
|
||||
uv run python scripts/verify_lora2r_routing.py # lora2r block masks + ablation teeth + c-probe recovery
|
||||
just smoke-routeV {{ ARGS }}
|
||||
|
||||
# none: gate pinned clean (0,0) -> quarantine never trains (capacity/structure-matched vanilla).
|
||||
smoke-vanilla *ARGS:
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --intervention=none \
|
||||
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}
|
||||
|
||||
# Routing-v2 path (routeV): per-rollout calibrated-tau cosine routing into the
|
||||
# scale-matched delta_S_hack quarantine. Splices the per-rollout gate into the
|
||||
# forward, builds v_grad via extract_v_hack mean-diff, recovers per-rollout grad
|
||||
# (c.grad/delta_S), routes flagged rollouts into delta_S_hack post-backward, and
|
||||
# fires the deploy ablation (delta_S_hack zeroed) + the dsh-moved assert. Exercises
|
||||
# tau/hkgap/qE logging too.
|
||||
# routeV: extract v_grad from authored pairs, splice the per-rollout c-probe gate,
|
||||
# PASS 1 (unmasked) labels rollouts {clean,mid,hack} via the width-pooled band cosine,
|
||||
# PASS 2 (masked) trains the blocks; deploy ablation resets the quarantine to init.
|
||||
smoke-routeV *ARGS:
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeV \
|
||||
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
|
||||
--eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
|
||||
|
||||
# 100%-absorption control (NO vector): route every knob-on rollout fully into the
|
||||
# quarantine, keep only the knob-off floor (rollout_ablate_frac) in the deployed knob.
|
||||
# Direction-free -> the v_grad is extracted but inert. Needs frac>0 or the knob never updates.
|
||||
# absorb: masks pinned (1,0) -> both blocks train on every rollout, NO gate. Isolates
|
||||
# the value of the gate+hard-masks vs absorption alone.
|
||||
smoke-absorb *ARGS:
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeV --routeV-absorb-all \
|
||||
--rollout-ablate-frac=0.5 \
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --intervention=absorb \
|
||||
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
|
||||
--eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
|
||||
|
||||
# Realism env: a random fraction of TRAIN problems flipped to gt_only (unhackable,
|
||||
# only honest solving pays) so there's persistent solve pressure. frac=0.3 here so
|
||||
# the flip definitely fires on the tiny smoke pool; eval stays all-loophole (no gt_only).
|
||||
# only honest solving pays) so there's persistent solve pressure.
|
||||
smoke-unhackable *ARGS:
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --intervention=none \
|
||||
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
|
||||
--eval-n-prompts=2 {{ ARGS }}
|
||||
|
||||
# lora2r path: rank-2r PiSSA-init LoRA (A+B trainable) + SGTM-style three-way HARD
|
||||
# masks (clean->deployed-only, hack->quarantine-only via output detach, mid->both).
|
||||
# verify script gates the block-mask/ablation/c-probe invariants first; the train run
|
||||
# exercises gate pass -> masked pass -> deploy ablation on the tiny model.
|
||||
smoke-lora2r *ARGS:
|
||||
uv run python scripts/verify_lora2r_routing.py
|
||||
BEARTYPE=1 {{ TRAIN }} smoke --adapter=lora2r --lora-r=4 --weight-decay=0 \
|
||||
--intervention=routeV \
|
||||
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
|
||||
--eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
|
||||
|
||||
# Run smoke twice: first warms the v_hack cache (cache-miss path), second hits
|
||||
# the cache (cache-hit path). Catches scope/save bugs that only manifest in one.
|
||||
smoke-both:
|
||||
# All three arms back to back (the full-coverage gate).
|
||||
smoke-all:
|
||||
just smoke-vanilla
|
||||
just smoke
|
||||
just smoke-routeV
|
||||
just smoke-absorb
|
||||
|
||||
# Headline 4-arm lora2r decision run (FastConfig: Qwen3-4B, 100st, dense run_tests
|
||||
# pool, 25% unhackable, authored pairs). routeV real-v is the method; placebo (Haar)
|
||||
# isolates directionality, vanilla is the emergence reference, absorb isolates the
|
||||
# gate+masks from absorption. Priority descending so they execute in listed order.
|
||||
# Decision: directionality is real iff real-v deploy_hack << placebo at matched solve.
|
||||
queue-decision seed='43':
|
||||
pueue add -w "$PWD" -o 60 -l "why: P1 lora2r routeV REAL-v s{{seed}}; resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeV --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_routeV_real_s{{seed}}
|
||||
pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeV PLACEBO-v (Haar 157) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_routeV_placebo_s{{seed}}
|
||||
pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean, capacity/structure-matched) s{{seed}}; resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_vanilla_s{{seed}}
|
||||
pueue add -w "$PWD" -o 54 -l "why: P4 lora2r ABSORB (masks pinned (1,0), no gate) s{{seed}}; resolve: ~vanilla -> gate+masks add nothing; << vanilla -> absorption alone suppresses" -- {{ TRAIN }} fast --intervention=absorb --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_absorb_s{{seed}}
|
||||
|
||||
# Cross-mech smoke: exercises G2/G3 pipeline end-to-end on tiny inputs.
|
||||
# Touches regrade_pool, pairs_from_pool, extract_vhack with --pairs-from-pool,
|
||||
|
||||
Reference in New Issue
Block a user