docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes

AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector = grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt. justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags (--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all and queue-decision (the headline 4-arm lora2r run). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-10 11:08:59 +00:00
parent 5c97975185
commit 5714996c56
2 changed files with 38 additions and 33 deletions
@@ -135,9 +135,17 @@ For the setup, read these:
 - Every load-bearing invariant gets a `verify_*.py` gate, written in the same commit as
  the claim -- "the tests passed" means nothing if the property was never tested.

-On persona pairs
+On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
+reward-hack, neg=the honest solve, vector = grad(prompt+hack) - grad(prompt+clean).
+Like persona steering pairs (honest/dishonest), MATCH everything but the axis -- same
+prompt, similar length/style -- so hack-vs-clean is the only thing separating them
+(else style competes with the trait; see the style-confound section of the doc below).
+There is NO problem_id semantics: the only "id" is which completion is the hack side
+and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
+are DISTINCT (different gradient). Authored = off-distribution, hand-written, no-cheat;
+pool-derived pairs (e.g. prog_wide_clean) are contamination-prone -> not headline-clean.
 - ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md
+- ./docs/personas/how_to_write_personas.md  -- pos/neg pair authoring rules + style confound
 - ./docs/personas/personas_kept.md

 On concepts such as "what are contrastive pairs" or "why SVD space" grep
@@ -36,64 +36,61 @@ eval-curve RUN:
 # rewards) at mix_ratio=0.5 so the GRPO backward / projection / cin paths
 # actually fire — pure tiny-random gen produces all-zero rewards and
 # zero-variance bails every step, leaving the loss path uncovered.
+# Default smoke = the routeV path (full pipeline: extraction -> two-pass gate ->
+# deploy ablation). Verify gates run first, including the lora2r block-mask/ablation/
+# c-probe invariants. tiny-random Qwen3 on CPU, BEARTYPE on, ~1-2 min.
 smoke *ARGS:
    uv run python scripts/verify_rewards.py   # grader gate: 3 env_modes x clean/hack
    uv run python scripts/verify_eval_gap.py  # eval gate: train/test token gap holds for all 4 modes
    uv run python scripts/verify_partition.py  # no-cheat: partition clean + teacher_modes hands gate only known-mode demos
    uv run python scripts/verify_science_invariants.py  # pair provenance + untouched final test
    uv run python scripts/verify_rotation.py  # rotating-unhackable flip: hint-free messages_gt + subset rotates per step
-    BEARTYPE=1 {{ TRAIN }} smoke --intervention=erase \
-        --v-hack-path=out/vhack/v_hack_smoke.safetensors \
-        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}
+    uv run python scripts/verify_lora2r_routing.py  # lora2r block masks + ablation teeth + c-probe recovery
+    just smoke-routeV {{ ARGS }}

+# none: gate pinned clean (0,0) -> quarantine never trains (capacity/structure-matched vanilla).
 smoke-vanilla *ARGS:
    BEARTYPE=1 {{ TRAIN }} smoke --intervention=none \
        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 {{ ARGS }}

-# Routing-v2 path (routeV): per-rollout calibrated-tau cosine routing into the
-# scale-matched delta_S_hack quarantine. Splices the per-rollout gate into the
-# forward, builds v_grad via extract_v_hack mean-diff, recovers per-rollout grad
-# (c.grad/delta_S), routes flagged rollouts into delta_S_hack post-backward, and
-# fires the deploy ablation (delta_S_hack zeroed) + the dsh-moved assert. Exercises
-# tau/hkgap/qE logging too.
+# routeV: extract v_grad from authored pairs, splice the per-rollout c-probe gate,
+# PASS 1 (unmasked) labels rollouts {clean,mid,hack} via the width-pooled band cosine,
+# PASS 2 (masked) trains the blocks; deploy ablation resets the quarantine to init.
 smoke-routeV *ARGS:
    BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeV \
        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
        --eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}

-# 100%-absorption control (NO vector): route every knob-on rollout fully into the
-# quarantine, keep only the knob-off floor (rollout_ablate_frac) in the deployed knob.
-# Direction-free -> the v_grad is extracted but inert. Needs frac>0 or the knob never updates.
+# absorb: masks pinned (1,0) -> both blocks train on every rollout, NO gate. Isolates
+# the value of the gate+hard-masks vs absorption alone.
 smoke-absorb *ARGS:
-    BEARTYPE=1 {{ TRAIN }} smoke --intervention=routeV --routeV-absorb-all \
-        --rollout-ablate-frac=0.5 \
+    BEARTYPE=1 {{ TRAIN }} smoke --intervention=absorb \
        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
        --eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}

 # Realism env: a random fraction of TRAIN problems flipped to gt_only (unhackable,
-# only honest solving pays) so there's persistent solve pressure. frac=0.3 here so
-# the flip definitely fires on the tiny smoke pool; eval stays all-loophole (no gt_only).
+# only honest solving pays) so there's persistent solve pressure.
 smoke-unhackable *ARGS:
    BEARTYPE=1 {{ TRAIN }} smoke --intervention=none \
        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
        --eval-n-prompts=2 {{ ARGS }}

-# lora2r path: rank-2r PiSSA-init LoRA (A+B trainable) + SGTM-style three-way HARD
-# masks (clean->deployed-only, hack->quarantine-only via output detach, mid->both).
-# verify script gates the block-mask/ablation/c-probe invariants first; the train run
-# exercises gate pass -> masked pass -> deploy ablation on the tiny model.
-smoke-lora2r *ARGS:
-    uv run python scripts/verify_lora2r_routing.py
-    BEARTYPE=1 {{ TRAIN }} smoke --adapter=lora2r --lora-r=4 --weight-decay=0 \
-        --intervention=routeV \
-        --teacher-pool-dir=out/pools/teacher_pool --mix-ratio=0.5 \
-        --eval-ablate-every=10 --eval-n-prompts=2 {{ ARGS }}
-
-# Run smoke twice: first warms the v_hack cache (cache-miss path), second hits
-# the cache (cache-hit path). Catches scope/save bugs that only manifest in one.
-smoke-both:
+# All three arms back to back (the full-coverage gate).
+smoke-all:
    just smoke-vanilla
-    just smoke
+    just smoke-routeV
+    just smoke-absorb
+
+# Headline 4-arm lora2r decision run (FastConfig: Qwen3-4B, 100st, dense run_tests
+# pool, 25% unhackable, authored pairs). routeV real-v is the method; placebo (Haar)
+# isolates directionality, vanilla is the emergence reference, absorb isolates the
+# gate+masks from absorption. Priority descending so they execute in listed order.
+# Decision: directionality is real iff real-v deploy_hack << placebo at matched solve.
+queue-decision seed='43':
+    pueue add -w "$PWD" -o 60 -l "why: P1 lora2r routeV REAL-v s{{seed}}; resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeV --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_routeV_real_s{{seed}}
+    pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeV PLACEBO-v (Haar 157) s{{seed}}; resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeV --routeV-random-v-seed=157 --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_routeV_placebo_s{{seed}}
+    pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean, capacity/structure-matched) s{{seed}}; resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_vanilla_s{{seed}}
+    pueue add -w "$PWD" -o 54 -l "why: P4 lora2r ABSORB (masks pinned (1,0), no gate) s{{seed}}; resolve: ~vanilla -> gate+masks add nothing; << vanilla -> absorption alone suppresses" -- {{ TRAIN }} fast --intervention=absorb --seed={{seed}} --eval-ablate-every=20 --eval-n-prompts=32 --out-tag=_l2r_absorb_s{{seed}}

 # Cross-mech smoke: exercises G2/G3 pipeline end-to-end on tiny inputs.
 # Touches regrade_pool, pairs_from_pool, extract_vhack with --pairs-from-pool,