docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes

AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.

justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-10 11:08:59 +00:00
parent 5c97975185
commit 5714996c56
2 changed files with 38 additions and 33 deletions
+10 -2
View File
@@ -135,9 +135,17 @@ For the setup, read these:
- Every load-bearing invariant gets a `verify_*.py` gate, written in the same commit as
the claim -- "the tests passed" means nothing if the property was never tested.
On persona pairs
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
reward-hack, neg=the honest solve, vector = grad(prompt+hack) - grad(prompt+clean).
Like persona steering pairs (honest/dishonest), MATCH everything but the axis -- same
prompt, similar length/style -- so hack-vs-clean is the only thing separating them
(else style competes with the trait; see the style-confound section of the doc below).
There is NO problem_id semantics: the only "id" is which completion is the hack side
and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
are DISTINCT (different gradient). Authored = off-distribution, hand-written, no-cheat;
pool-derived pairs (e.g. prog_wide_clean) are contamination-prone -> not headline-clean.
- ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md
- ./docs/personas/how_to_write_personas.md -- pos/neg pair authoring rules + style confound
- ./docs/personas/personas_kept.md
On concepts such as "what are contrastive pairs" or "why SVD space" grep
+28 -31
View File
@@ -36,64 +36,61 @@ eval-curve RUN:
# rewards) at mix_ratio=0.5 so the GRPO backward / projection / cin paths
# actually fire — pure tiny-random gen produces all-zero rewards and
# zero-variance bails every step, leaving the loss path uncovered.
# Default smoke = the routeV path (full pipeline: extraction -> two-pass gate ->
# deploy ablation). Verify gates run first, including the lora2r block-mask/ablation/
# c-probe invariants. tiny-random Qwen3 on CPU, BEARTYPE on, ~1-2 min.
smoke *ARGS:
uv run python scripts/verify_rewards.py # grader gate: 3 env_modes x clean/hack
uv run python scripts/verify_eval_gap.py # eval gate: train/test token gap holds for all 4 modes
uv run python scripts/verify_partition.py # no-cheat: partition clean + teacher_modes hands gate only known-mode demos
uv run python scripts/verify_science_invariants.py # pair provenance + untouched final test
uv run python scripts/verify_rotation.py # rotating-unhackable flip: hint-free messages_gt + subset rotates per step
BEARTYPE=1 {{ TRAIN }} smoke --intervention=erase \
--v-hack-path=out/vhack/v_hack_smoke.safetensors \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}
uv run python scripts/verify_lora2r_routing.py # lora2r block masks + ablation teeth + c-probe recovery
just smoke-routeV {{ ARGS }}
# none: gate pinned clean (0,0) -> quarantine never trains (capacity/structure-matched vanilla).
smoke-vanilla *ARGS:
BEARTYPE=1 {{ TRAIN }} smoke --intervention=none \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}
# Routing-v2 path (routeV): per-rollout calibrated-tau cosine routing into the
# scale-matched delta_S_hack quarantine. Splices the per-rollout gate into the
# forward, builds v_grad via extract_v_hack mean-diff, recovers per-rollout grad
# (c.grad/delta_S), routes flagged rollouts into delta_S_hack post-backward, and
# fires the deploy ablation (delta_S_hack zeroed) + the dsh-moved assert. Exercises
# tau/hkgap/qE logging too.
# routeV: extract v_grad from authored pairs, splice the per-rollout c-probe gate,
# PASS 1 (unmasked) labels rollouts {clean,mid,hack} via the width-pooled band cosine,
# PASS 2 (masked) trains the blocks; deploy ablation resets the quarantine to init.
smoke-routeV *ARGS:
BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeV \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
--eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
# 100%-absorption control (NO vector): route every knob-on rollout fully into the
# quarantine, keep only the knob-off floor (rollout_ablate_frac) in the deployed knob.
# Direction-free -> the v_grad is extracted but inert. Needs frac>0 or the knob never updates.
# absorb: masks pinned (1,0) -> both blocks train on every rollout, NO gate. Isolates
# the value of the gate+hard-masks vs absorption alone.
smoke-absorb *ARGS:
BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeV --routeV-absorb-all \
--rollout-ablate-frac=0.5 \
BEARTYPE=1 {{ TRAIN }} smoke --intervention=absorb \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
--eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
# Realism env: a random fraction of TRAIN problems flipped to gt_only (unhackable,
# only honest solving pays) so there's persistent solve pressure. frac=0.3 here so
# the flip definitely fires on the tiny smoke pool; eval stays all-loophole (no gt_only).
# only honest solving pays) so there's persistent solve pressure.
smoke-unhackable *ARGS:
BEARTYPE=1 {{ TRAIN }} smoke --intervention=none \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
--eval-n-prompts=2 {{ ARGS }}
# lora2r path: rank-2r PiSSA-init LoRA (A+B trainable) + SGTM-style three-way HARD
# masks (clean->deployed-only, hack->quarantine-only via output detach, mid->both).
# verify script gates the block-mask/ablation/c-probe invariants first; the train run
# exercises gate pass -> masked pass -> deploy ablation on the tiny model.
smoke-lora2r *ARGS:
uv run python scripts/verify_lora2r_routing.py
BEARTYPE=1 {{ TRAIN }} smoke --adapter=lora2r --lora-r=4 --weight-decay=0 \
--intervention=routeV \
--teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
--eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
# Run smoke twice: first warms the v_hack cache (cache-miss path), second hits
# the cache (cache-hit path). Catches scope/save bugs that only manifest in one.
smoke-both:
# All three arms back to back (the full-coverage gate).
smoke-all:
just smoke-vanilla
just smoke
just smoke-routeV
just smoke-absorb
# Headline 4-arm lora2r decision run (FastConfig: Qwen3-4B, 100st, dense run_tests
# pool, 25% unhackable, authored pairs). routeV real-v is the method; placebo (Haar)
# isolates directionality, vanilla is the emergence reference, absorb isolates the
# gate+masks from absorption. Priority descending so they execute in listed order.
# Decision: directionality is real iff real-v deploy_hack << placebo at matched solve.
queue-decision seed='43':
pueue add -w "$PWD" -o 60 -l "why: P1 lora2r routeV REAL-v s{{seed}}; resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeV --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_routeV_real_s{{seed}}
pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeV PLACEBO-v (Haar 157) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_routeV_placebo_s{{seed}}
pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean, capacity/structure-matched) s{{seed}}; resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_vanilla_s{{seed}}
pueue add -w "$PWD" -o 54 -l "why: P4 lora2r ABSORB (masks pinned (1,0), no gate) s{{seed}}; resolve: ~vanilla -> gate+masks add nothing; << vanilla -> absorption alone suppresses" -- {{ TRAIN }} fast --intervention=absorb --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_absorb_s{{seed}}
# Cross-mech smoke: exercises G2/G3 pipeline end-to-end on tiny inputs.
# Touches regrade_pool, pairs_from_pool, extract_vhack with --pairs-from-pool,