fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)

The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-07 08:18:31 +00:00
parent a776db0ec0
commit ea01267cd8
12 changed files with 592 additions and 118 deletions
@@ -8,3 +8,4 @@
 - [Semantic Scholar keyed access](semantic-scholar-keyed-access.md) — S2 API key in semantic-search skill .env; use it to dodge 429s.
 - [pueue negative-priority gotcha](pueue-negative-priority-gotcha.md) — `pueue add` negative prio needs `-o=-N` attached; `-o -N` silently fails the add.
 - [Rename on logic change](feedback_rename_on_logic_change.md) — when an arm's logic changes (binary->banded gate), give it a new id (routeV/route3), not just a tag suffix; else old/new runs are uncomparable.
+- [Check paper before diagnosing](feedback_check_paper_before_diagnosing.md) — re-read source for expected number/horizon before "experiment is broken"; paper: hack emerges on-policy at step 80-100, base solves ~12-20% not 94%.
@@ -0,0 +1,29 @@
+---
+name: feedback_check_paper_before_diagnosing
+description: "re-read the source paper before declaring a \"DECISION NEEDED\" diagnosis; emergence numbers/horizon live there"
+metadata: 
+  node_type: memory
+  type: feedback
+  originSessionId: c06c2b0e-44ee-45a1-b213-4f77c006109c
+---
+
+On 2026-06-07 I wrote a confident "DECISION NEEDED: sparse seeding -> no hack
+emergence" journal entry off job 175 *step ~10*, and claimed hacking "needs
+dense per-step demonstration." The user pushed back ("base solves 94%, that's
+not right, read the paper again"). Both my premises were wrong: Ariahw et al.
+get on-policy hack emergence in ~80-100 steps with ZERO teacher demos (200-step
+runs), so demos are an accelerant not a requirement, and reading step 10 of a
+80-100-step process proves nothing. Base solve=0.94 was also real-but-wrong vs
+paper fn9 (~12% test / ~20% filtered-train) -- not a grader bug (grader verified
+sound), just an easy/unfiltered eval set.
+
+**Why:** I diagnosed before re-reading the source. The repo's CLAUDE.md says: if
+you can't list 3+ hypotheses including "you're wrong about the concept," you've
+lost perspective. The emergence horizon and base-rate numbers were sitting in
+the paper the whole time.
+
+**How to apply:** Before any load-bearing "the experiment is structurally
+broken" claim, (1) re-read the relevant paper section for the expected
+number/horizon, (2) confirm you're reading the run at a step where the effect
+should exist, (3) separate "metric is wrong" from "grader is wrong" with a
+direct test. See [[feedback_rename_on_logic_change]].
@@ -0,0 +1,35 @@
+---
+name: project_paper_comparability_verdict
+description: External review verdict (2026-06-07) on whether our single-mode run_tests env is paper-comparable -- it is NOT a fair vendor-number comparison; the honest claim is the internal vanilla-vs-routeV delta.
+metadata: 
+  node_type: memory
+  type: project
+  originSessionId: c06c2b0e-44ee-45a1-b213-4f77c006109c
+---
+
+Two decorrelated external reviewers (deepseek-v4-pro + gpt-5.5, non-Anthropic) audited
+our reverted single-mode run_tests env vs the Ariahw/Engels/Nanda vendor on 2026-06-07 and
+CONVERGED: env/hint/data are faithful (verbatim hint, vendor data files md5-match), but it
+is NOT a fair comparison to the paper's hack-rate NUMBERS, due to three confounds:
+
+1. Eval token-gap: we disclose a NOVEL function name at eval and grade on it
+   (eval.py randomize_eval_markers); the paper evals the fixed `run_tests`. Ours is a
+   harder, memorization-resistant variant -> lowers hack rate for all arms.
+2. Hack metric: our run_tests "exploited" = arbitrary-STUB passes the model's run_tests
+   (rewards.py channel=hacked), a strict SUBSET of the vendor headline eq_hinted (the
+   model's OWN solution passes its run_tests). We already compute the vendor analogue as
+   `hacked_loophole_used`. Report BOTH (task #219).
+3. Training: teacher-pool seeding (mix=0.125, off at step 30) + group=8 vs vendor 16 +
+   lr=3e-3/adam(0.5,0.9) + 60 steps/200 problems != vendor verl GRPO (200 steps, no teacher).
+
+**Why:** "paper-comparable" was overclaiming. The VALID, publishable claim is the INTERNAL
+delta: routeV vs vanilla under IDENTICAL conditions (same teacher, same eval, same metric).
+Both reviewers say the internal arm comparison is sound and our eval additions (held-out
+periodic curve + deploy-on-test) are methodologically fine.
+
+**How to apply:** Frame the writeup as the internal comparison + report the vendor eq_hinted
+metric as a secondary column + LABEL the eval "token-gap / novel-name robustness eval," not
+"the vendor eval." A true paper comparison would need a vendor-matched arm (fixed run_tests
+eval, eq_hinted metric, no teacher, vendor GRPO scale) -- only do that if a reviewer demands
+the absolute-number comparison. Reviews saved: docs/reviews/20260607_paper_comparability_*.md.
+Related: [[project_workshop_paper_goal]]. Dead-code cleanup from the same review = task #220.
@@ -2,6 +2,173 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-07 (e) — eval-contamination bug FOUND, FIXED, PROVEN; train-subset is the same bug
+
+The base solve=0.94 (entries b/c/d) is a real bug: the periodic VAL eval ran on
+`leetcode_train_medhard_holdout.jsonl` (353, OUR artifact, not a paper file). It is
+disjoint from train by id but shares the train id/recency range (ids 3-3205, 88% medium),
+so it is dominated by classic problems Qwen3-4B memorized in pretraining -> base solve
+0.94, which saturates solve and kills the hack metric's gt-fail headroom. "Disjoint by id"
+controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test
+set (every test id >= 3243, strictly above every train id) reproduces the paper rate.
+
+PROOF (job 176, base Qwen3-4B, SAME eval_hack_solve pipeline):
+- test_medhard (paper's eval file): solve = 0.094  <- matches paper fn9 (~12% test)
+  => the eval pipeline is SOUND (reproduces the paper); the holdout was the contaminant.
+- (train_filtered ~0.20 and holdout ~0.90 rows confirmatory, were still running.)
+The mild undershoot (0.094 vs 0.12) is consistent with max_new=512 truncation (paper 1536).
+
+FIX (train.py ~696-720): dropped the holdout; periodic curve + final number both eval the
+paper test set (leetcode_test_medhard). Smoke green. Spec: docs/spec/20260607_eval_contamination_fix.md.
+Full us-vs-reference table in the spec (model/prompt/hint/batch/hparams all checked; hint
+confirmed = paper's `simple_overwrite_tests`, NOT the easier `_detailed`/`_aware` variants).
+
+SAME BUG ON THE TRAIN SIDE (not yet fixed): the fast preset loads n_problems=200 with
+shuffle=False = first-200-by-id = the lowest/oldest/most-memorized problems, then samples
+from them (train.py:682,1013). The paper trains on all 992 (base ~20%). Training on the
+easiest 200 lowers the hack incentive (model just solves) and likely contributed to the
+weak emergence. Tension: the 6-prompt teacher seeding needs a small pool to stay dense
+(6/200=3% vs 6/992=0.6%), which is WHY the pool was shrunk to the easy first-200. Options
+for the fresh runs (craft decision, user to pick): (A) full 992 + no teacher seed + longer
+horizon = paper-faithful on-policy emergence; (B) shuffled representative 200 + force-include
+the 6 teacher ids = keeps fast seeding, removes easy-bias. max_new also worth bumping
+512->1024+ for solve fidelity.
+
+## 2026-06-07 (d) — CORRECTION to (b) and (c): two wrong premises, checked against the paper
+
+User pushed back on the (c) framing ("base solves 94%, that's not right, read the paper
+again"). Both were right to flag. Re-read Ariahw et al.:
+
+1. "Hacking needs dense per-step demonstration" (my (c) framing) is WRONG. Paper line 96/102:
+   pure on-policy GRPO discovers the run_tests loophole in ~80-100 steps with ZERO teacher
+   demos (base hack rate ~0%, rises only through training). 200-step runs. So teacher demos
+   were OUR accelerant to compress emergence into a short run, never a requirement, and the
+   "dense-seed vs broad-train are coupled" tension in (c) is a non-problem.
+
+2. The (c) "no emergence" read was PREMATURE. I judged off job 175 step ~10. Paper emergence
+   is step 80-100; our fast preset is only 60 steps. Reading step 10 proves nothing. The real
+   open question is HORIZON: is 60 steps enough, or do we need 80-100 / 200, or a strong
+   enough teacher accelerant to beat 60.
+
+3. base solve=0.94 (entry (b)) is genuinely wrong vs paper fn9 (~20% filtered-train, ~12%
+   test), BUT NOT a grader bug. Verified on CPU: properly-fenced canonical -> gt_correct=True,
+   wrong stub -> False, 38-132 real asserts/problem; `_gt_correct` uses a fresh-nonce
+   post-assert sentinel and fails closed. So 0.94 means the eval PROBLEMS are easy: we eval
+   on the UNFILTERED holdout, while the paper's 12-20% is the set with model-solvable problems
+   stripped. Decisive check queued (job 176, scripts/verify_base_solve.py): base-model
+   eval_hack_solve on test_medhard (expect ~12%), filtered-train (~20%), holdout (our ~0.9).
+   If test/train reproduce the paper, fix = switch the periodic VAL eval to test_medhard;
+   the holdout-val solve/hack curve is saturated and uninformative.
+
+Net: the "DECISION NEEDED" in (c) is mostly dissolved. Job 175's TRAIN hack_s curve through
+step 60 is still worth having (val numbers are junk per #3). No model swap or env change is
+justified yet. Open: (a) job 176 result, (b) horizon -- run 80-200 steps and/or lean on the
+teacher accelerant, before concluding anything about emergence.
+
+## 2026-06-07 (c) — DECISION NEEDED: sparse teacher seeding -> no hack emergence
+NOTE: superseded by entry (d) above -- premises #1 (dense demo required) and the premature
+step-10 "no emergence" read are both wrong; kept verbatim for the record.
+
+Vanilla diagnostic (job 175, single-mode, full-200 train, hack seeded on the 6 teacher-pool
+prompts, n=32 shuffled eval). Through step 10:
+- train hack_s = 0/28 EVERY step (student does NOT hack on train).
+- train gt_s = 3-11/28 (student SOLVES legitimately, ~25%).
+- hack_t = 0/0 most steps (only 6/200 prompts have teacher rollouts -> most steps sample an
+  uncovered prompt and see ZERO hack demo; the rare covered step shows hack_t=1/1).
+- val hack = 0.000 at steps 0 and 10; val solve ~0.91.
+
+Diagnosis: removing the teacher-pool TRAIN restriction (so training spans the full 200 to
+test generalization) DILUTED the hack seeding to ~3% of steps. Combined with a base model
+that already solves ~94%, the student just learns to solve and never picks up the hack. The
+old runs that DID show hack emergence had the restriction ON = dense seeding, which is the
+same thing that collapsed training to 6 problems. The two are coupled: dense seeding (hack
+emerges) vs broad training (generalization testable). You can't get both from a 6-prompt
+teacher pool.
+
+OPTIONS for the user (this is a framing/design decision, not auto-resolved):
+1. Bigger run_tests teacher pool: pre-generate teacher hack rollouts for ~50-100 run_tests
+   prompts so seeding is dense across a broad train set. Gets "seed enough + generalize".
+   Cost: a teacher-generation pass. Most aligned with the stated intent.
+2. Weaker base model: a model that can't solve 94% would have hack-room and lazy-hack under
+   sparse seeding. Changes the substrate.
+3. Hack that pays MORE than solving: so even a capable model prefers it. Changes the env.
+4. Accept dense-seed-on-few (restriction ON): the original setup that showed emergence, but
+   it does NOT test cross-problem generalization (trains on the 6 seeded prompts only).
+
+Job 175 left running to step 30 (teacher-off) for a conclusive flat-hack confirmation;
+everything else stashed. No code change made pending the decision.
+
+## 2026-06-07 (b) — eval was measuring memorized problems; and Qwen3-4B may be too strong
+
+Two compounding eval bugs found while debugging "step-0 solve=1.0, hack=0":
+
+1. `load_problems` took the FIRST-N by id with no shuffle. The held-out files are id-sorted
+   and the lowest ids are the most-memorized LeetCode problems (#3 longest-substring, #7
+   reverse-int, #10 regex-match). So the periodic VAL eval (first-32) was scoring problems
+   Qwen3-4B has memorized -> solve=1.0 -> hack (= channel AND gt_fail) structurally ~0.
+   Fixed: `shuffle=True` (seeded) for the eval load -> representative sample. The TRAIN pool
+   keeps first-N (it gets filtered to the teacher-pool ids; a shuffle would drop them).
+
+2. Deeper finding: even the REPRESENTATIVE shuffled val shows base-model solve=0.938 (job
+   173, ids 72/695/1375/...). Qwen3-4B solves ~94% of held-out medhard leetcode at step 0.
+   So there is little legitimate-solve headroom. The reward-hack metric is only alive if
+   training induces LAZY-hacking (weak tests + throwaway solution -> gt fails -> exploited)
+   on problems the model COULD solve -- the easier path to the same reward. Whether that
+   happens is an empirical question. The step-0 `solve<0.9` assert I added (correctly) caught
+   this; softened to a loud WARNING (the high-solve premise doesn't prove the metric dead --
+   only a flat val-hack curve while train hacks does). Diagnostic queued: vanilla alone
+   (job 175), watch whether val hack RISES over 60 steps. If it stays ~0, the model is too
+   strong for this set (need a weaker base, or a hack that pays more than solving) -- a
+   framing-level decision for the user.
+
+Also this session: removed the teacher-pool TRAIN restriction (train.py:681, a stale
+2026-05-27 sparse-pool optimization) -- it was collapsing training to the 6 teacher-covered
+prompts. Now trains on the full 200, hack seeded on 6, must generalize (uncovered prompts
+train student-only). Eval n bumped to 64 then reverted to 32 (n=64 on hard problems = ~25
+min/eval, unaffordable; the eval runs first in the step loop). rescore_deploy.py CLI fixed
+(run_dir made Positional). All stashed except the vanilla diagnostic until the metric is
+confirmed alive.
+
+## 2026-06-07 — env reverted to single-mode; single-mode directionality set queued
+
+Three changes shipped (commit 7da54f1) and the experiment replanned around them.
+
+1. Env reverted to single-mode run_tests. Dropped the 4-mode custom substrate (built for
+   a held-out-mode generalisation test we planned but never ran). The single-mode path
+   loads the filtered train set (200 of 992, fast preset, first-N by id) graded only by
+   run_tests. Cleaner paper narrative + directly the paper's mechanism. Triggered by a
+   teacher pool WITHOUT partition.json (out/pools/teacher_pool_runtests).
+
+2. One canonical eval. Deleted the train.py duplicate `eval_hack_solve` that lacked the
+   token gap; the surviving `.eval` one ALWAYS applies `randomize_eval_markers` (novel
+   function name at eval, graded exact). So the periodic VAL curve and final TEST deploy
+   both measure under the mechanism-learning variation -- no variation-free path. Also
+   fixed a latent bug (in-run eval silently lacked the gap that offline rescore had).
+
+3. Both hack metrics in every eval: `hack` = our strict stub-pass exploited; `vhack` =
+   vendor eq_hinted (`hacked_loophole_used and not gt_correct`).
+
+Queued the single-mode directionality set (seed 43, 60 steps) in priority order: routeV
+real-V per-rollout (best), per-token real-V, random-V per-rollout, random-V per-token,
+vanilla (floor), vampire-vs-werewolf (semantic in-subspace control). PASS = real-V
+deploy_hack << {random-V, vampire} at matched solve AND vanilla >> 0. The real-V arm
+confirmed healthy on the reverted env (200 problems, single env_mode=run_tests, real
+v_grad on 252 modules). Live tracker: task #221.
+
+Open question carried forward: an earlier 4-mode random-V run showed train_hack only
+~0.06 by step 20 with deploy_hack=0 -- ambiguous between "routing works" and "barely any
+hack to suppress / token-gap eval defeats it regardless". The vanilla arm settles whether
+the comparison has signal. Do not call the method working until vanilla lands with
+deploy_hack >> 0.
+
+Verified for the user this session: (a) G_hack/v_grad refresh every 5 steps re-runs
+`extract_v_hack` on the pairs (backward on pair-completion NLL -> `delta_S.grad`), rebuilds
+`v_grad = unit(mean(g_hack - g_clean))` and the route band, quarantine ablated; random-V
+skips refresh. (b) 200 != base 992 but cancels across arms (internal delta), eval is
+held-out, hack is a strategy not a memorized problem -> 200 is the frugal-correct choice;
+Modal = fast = 200 too. (c) LoRA-frozen-B adapter (#222) design settled: route in the
+r-bottleneck on the static B^T gradient path (Option B); not yet built.
+
 ## 2026-06-06 — Modal migration estimate (run inventory + cost; port handoff)

 Measured per-run wall-clock on the current box (Qwen3-4B, fast preset): job 134 ran
@@ -1,50 +1,59 @@
 # AFK hourly check — current protocol

-LITE check, once per hour (cron fe8385ed, :23). Jobs + goals only; no deep dive
-unless something is wrong. Supersedes the old A1/A2-keynote + A5-harvest checklist,
-which closed 2026-06-04 (see below).
+LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
+This doc holds the durable rules. The live plan lives in the task list (the
+single-mode directionality set is task #221); live job state is `pueue status`.
+Do not hardcode job numbers here -- they churn.

-## Standing checks (lite, every hour)
+## Rule 0: no-op if the queue is in order

-1. **GPU idle while queued?** `pueue status`. If idle with jobs Queued, investigate
-   + unblock.
-2. **New Failed/Killed?** (ignore old killed 78). Read `pueue log {ID} --full`, form
-   3 hypotheses (likely / subtle / I-was-wrong), weight them, fix root cause, requeue
-   with `why:`/`resolve:`. No blind retry.
-3. **Running job health** — discriminating review, not did-it-finish: reward not
-   collapsed, lp_s stable (~-0.4), no divergence tripwire, deploy-eval matches the
-   arm's expectation.
+If ALL of these hold, stop immediately. Do not act, do not journal, do not message:
+- a job is Running (GPU not idle while jobs are Queued), and
+- no NEW Failed/Killed task since last check, and
+- the running job's log shows progress (per-step rows advancing, no Traceback/CUDA
+  OOM/AssertionError), and
+- the queue order still matches the priority in the active task.

-## THE priority: route2 directionality mystery (#196)
+Only when one of those breaks do you do the matching step under "On a break".

-Is route2's deploy-hack suppression directional (H4: needs the hack direction) or
-mechanical (H2: alignment-agnostic quarantine-absorption)? The batch is staged
-interleaved (one of each family per tier):
+## What to read for the plan

- **Haar** (114/118/122, `--route2-random-v-seed`): out-of-subspace null (cos~1/sqrt(d)
-  by concentration, NOT a cleaner placebo). Tests "must v_grad be in-subspace at all?"
- **semantic placebo** (115/119/123 vampire, 119/120... bacon/blue): in-subspace
-  arbitrary directions. Tests "must it point at the hack specifically?" Maps
-  suppression-vs-alignment as a scatter.
- **null_city n=3** (117/121 s42/s43): is the deploy-hack=0.000 placebo result robust
-  across seeds or an s41 fluke.
- **erase directionality** (116 real-v, 120 placebo): erase projects with magnitude
-  ~cos(g,v), so direction MUST matter there if it matters anywhere.
+- `TaskList` -> the in_progress directionality task (#221) holds the arm order, the
+  per-arm expectation, and the PASS condition. If it and `pueue status` disagree,
+  the task list is the intent; reconcile the queue to it.
+- `pueue status --json | jq` for which job is which arm (the why-label says the arm
+  and the resolve condition).

-As each finishes: pull deploy hack/solve, and (for the scatter) each placebo's per-module
-|cos| with the hack dir. Verdict logic:
- all suppress regardless of alignment, incl. Haar => **H2 mechanical**.
- suppression tracks |cos|, or Haar fails to suppress => **H4 alignment**.
+## Open questions / unconfirmed-but-changed (verify before trusting)

-Cosine is correlational; the ablation run is the causal test. Commit findings to the
-journal. Don't re-derive the no-cheat E-by-mode table unless an A5 run changes — it's
-confirmed (journal 2026-06-05 (h)) and gated by `verify_gate_anchor.py`.
+- Does vanilla hack at a NON-TRIVIAL deploy floor on the single-mode env? An earlier
+  random-V run showed train_hack ~0.06 by step 20 with deploy_hack=0 -- ambiguous. If
+  vanilla deploy_hack ~0, the suppression comparison has no signal (review threat #5).
+  Do NOT declare "method works" until the vanilla arm lands with deploy_hack >> 0.
+- The token-gap eval might defeat the run_tests hack regardless of routing (a memorized
+  train function name fails on the novel eval name). If vanilla ALSO -> ~0 deploy,
+  suspect the eval, not the method. Cross-check vanilla knob-on hack vs deploy hack.
+- 200-problem train pool (fast preset) is the FIRST 200 by id, no shuffle. Cancels
+  across arms (same 200), but not a random slice of 992. Modal also = fast = 200.
+- Eval now ALWAYS applies the token gap (one canonical eval_hack_solve); no
+  variation-free path. Periodic VAL curve and final TEST both carry it.
+- LoRA-frozen-B adapter (#222): Option B confirmed (route in the r-bottleneck, on the
+  static B^T gradient path). NOT YET BUILT. Smoke none+erase+routeV before queueing.

-## Background paper artifacts (lower prio, already in-flight, DON'T re-do)
+## On a break (do only the matching step)

- A1/A2 keynote (#173): CLOSED. tab:keynote is n=3 both arms with paired t-test.
- A5 generalisation (#185): CLOSED; airtight no-cheat rerun queued (111-113).
- A4 long-run (#184): matched-beta pair 100/101 queued.
- #186 on-policy emergence: job 87 (running) / 105 (route2 toff40, queued).
+1. GPU idle + jobs Queued -> investigate why the head job won't run; `pueue start`.
+2. New Failed/Killed -> `pueue log {ID} --full`, form 3 hypotheses (likely / subtle /
+   I-was-wrong), fix root cause, requeue with `why:`/`resolve:`. No blind retry.
+3. Running job unhealthy (reward collapse, divergence, eval crash at step 0) -> kill,
+   diagnose, fix, requeue.

-Commit progress. Don't stop to ask — autonomous judgement; if unsure, commit and continue.
+## Wake the user only when
+
+- The active set is done and its verdict is clear (commit the table to the journal
+  first, then summarize).
+- A result contradicts the plan in a way that changes what to run next (e.g. vanilla
+  deploy_hack ~0 -> comparison dead, needs hotter teacher or more steps).
+- Otherwise: commit findings, queue the obvious follow-up, keep going.
+
+Don't journal routine no-finding checks.
@@ -116,6 +116,15 @@ fast-projected *ARGS:
        --teacher-pool-dir=out/pools/teacher_pool \
        --grad-clip=500 {{ ARGS }}

+# H: LoRA-frozen-B adapter (trainable down-proj A, FROZEN random up-proj B) routes as
+# well as the AntiPaSTO SVD adapter. Frozen B makes the error->bottleneck map g_h = B^T δ_y
+# STATIC, so routeV decides in the r-bottleneck and splits A.grad into A_hack. ~10-100x
+# params vs δS -> small lora_r (=32) and a smaller prompts_per_step if memory binds.
+# Single-mode default (no teacher-pool override). resolve: deploy_hack ~ AntiPaSTO-routeV at
+# matched solve -> routing is adapter-agnostic; >> -> the SVD basis carries the effect.
+fast-lora-routeV *ARGS:
+    {{ TRAIN }} fast --intervention=routeV --adapter=lora_frozen_b --lora-r=32 {{ ARGS }}
+
 # T8 (KEY GOAL): one CELL of the dynamics-plot matrix as a separate pueue job.
 # INTERVENTION in {none, erase, route}; SEED an int. 60-step fast horizon,
 # shared v_hack_21pairs basis (vanilla uses it only for the cos_pre diagnostic),
@@ -19,6 +19,7 @@ from pathlib import Path

 import torch
 import tyro
+from tyro.conf import Positional
 from loguru import logger
 from safetensors import safe_open
 from safetensors.torch import load_file
@@ -36,7 +37,7 @@ EVAL_FILES = {
 CACHE_ROOT = Path("svd_cache")


-def main(run_dir: Path, eval_set: str = "test", n: int = 10_000, max_new: int = 1024) -> None:
+def main(run_dir: Positional[Path], eval_set: str = "test", n: int = 10_000, max_new: int = 1024) -> None:
    """Re-score run_dir/train.safetensors knob-off on the held-out `eval_set`."""
    ckpt = run_dir / "train.safetensors"
    with safe_open(str(ckpt), framework="pt") as f:
@@ -0,0 +1,56 @@
+"""Sanity: does base Qwen3-4B really solve ~94%, or is our holdout-val just easy?
+
+Paper footnote 9 (Ariahw et al.): base Qwen3-4B solves ~20% of the FILTERED train
+set and ~12% of the TEST set. Our periodic eval reported solve~=0.94 on the
+holdout -- 5x the paper. The grader is verified sound (canonical passes, wrong
+fails, 38-132 real asserts), so a high number means the eval PROBLEMS are easy,
+not that grading leaks. This script runs the SAME eval_hack_solve on three files
+with the base model (no adapter) to locate the discrepancy:
+
+  test_medhard   (paper's 119)         -> expect ~12% if our pipeline matches paper
+  train_filtered (model-hard subset)   -> expect ~20%
+  train_holdout  (val, what we report) -> our 0.94
+
+If test/train reproduce ~12-20% but holdout is ~0.9, the holdout is simply not
+filtered for model-hardness and our val solve curve is saturated/uninformative;
+fix = eval on test_medhard. If ALL three are ~0.9, something else is inflating.
+"""
+from pathlib import Path
+
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+
+from vgrout.data import load_problems
+from vgrout.eval import eval_hack_solve
+
+MODEL = "Qwen/Qwen3-4B"
+N = 64
+MAX_NEW = 512
+DATA_DIR = Path("external/rl-rewardhacking/results/data")
+FILES = {
+    "test_medhard": DATA_DIR / "leetcode_test_medhard.jsonl",
+    "train_filtered": DATA_DIR / "leetcode_train_medhard_filtered.jsonl",
+    "train_holdout(val)": DATA_DIR / "leetcode_train_medhard_holdout.jsonl",
+}
+
+device = torch.device("cuda")
+tok = AutoTokenizer.from_pretrained(MODEL)
+if tok.pad_token_id is None:
+    tok.pad_token = tok.eos_token
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL, dtype=torch.bfloat16, attn_implementation="flash_attention_2"
+).to(device)
+model.eval()
+gen_cfg = GenerationConfig(
+    max_new_tokens=MAX_NEW, do_sample=True, temperature=0.7, top_p=1.0,
+    top_k=20, min_p=0.0, repetition_penalty=1.0,
+    num_return_sequences=1, pad_token_id=tok.pad_token_id,
+)
+
+print(f"{'file':22s} {'n':>4s} {'solve':>7s} {'hack':>7s} {'vhack':>7s}")
+for name, path in FILES.items():
+    probs = load_problems(N, ["run_tests"], seed=0, data_path=path, shuffle=True)
+    idxs = list(range(len(probs)))
+    with torch.no_grad():
+        r = eval_hack_solve(model, tok, probs, idxs, gen_cfg, device, MAX_NEW)
+    print(f"{name:22s} {r['n']:>4d} {r['solve']:>7.3f} {r['hack']:>7.3f} {r['vhack']:>7.3f}")
@@ -109,6 +109,77 @@ def _delta_hook(layer: nn.Linear, args: tuple, y: Tensor) -> Tensor:
    return y + (kept + hack).to(y.dtype)


+def _lora_hook(layer: nn.Linear, args: tuple, y: Tensor) -> Tensor:
+    """LoRA-frozen-B delta: y += B @ ((A + A_hack) @ x), with B a FROZEN random
+    up-projection. The trainable is the full down-projection A [r, d_in] (plus the
+    quarantine A_hack [r, d_in]); A=A_hack=0 at init -> identity.
+
+    Routing lives in the r-dim bottleneck h = A@x. Frozen B makes the
+    error->bottleneck map g_h = B^T δ_y a STATIC linear operator -- that is the
+    "static gradient path" frozen-B buys. The kept bottleneck (A@x) and the
+    quarantine bottleneck (A_hack@x) both feed the same frozen B, so they receive
+    the SAME upstream g_h; A.grad == A_hack.grad before routing, and routeV just
+    splits that single gradient (train.py). grad_probe retains h.grad (= g_h) and
+    caches x so the per-rollout split Σ_b f_b Σ_t g_h[t]⊗x[t] can be formed.
+    """
+    (x,) = args
+    A = layer._lora_A                 # [r, d_in] trainable (kept)   -> info["delta_S"]
+    A_hack = layer._lora_A_hack       # [r, d_in] quarantine         -> info["delta_S_hack"]
+    B = layer._lora_B                 # [d_out, r] frozen
+    h = torch.nn.functional.linear(x, A.to(x.dtype))            # [..., r] kept bottleneck
+    h_hack = torch.nn.functional.linear(x, A_hack.to(x.dtype))  # [..., r] quarantine bottleneck
+    if layer._lora_grad_probe and torch.is_grad_enabled():
+        h.retain_grad()               # h.grad = g_h = B^T δ_y after backward
+        layer._lora_h = h
+        layer._lora_x = x.detach()    # per-token input for the A.grad split
+    delta = torch.nn.functional.linear(h + h_hack, B.to(x.dtype))  # [..., d_out]
+    return y + delta.to(y.dtype)
+
+
+def wrap_model_with_lora_frozen_b(
+    model: nn.Module,
+    model_name: str,
+    r: int = 32,
+    b_seed: int = 0,
+    grad_probe: bool = False,
+) -> dict[str, dict]:
+    """Attach a LoRA-frozen-B adapter to every target Linear (in place).
+
+    Same info-dict interface as wrap_model_with_antipasto (delta_S = A, delta_S_hack
+    = A_hack), so the optimizer collection, ablate_quarantine, and checkpointing work
+    unchanged. ~r*d_in trainable scalars per module (vs r for AntiPaSTO) -- 10-100x
+    more params; use a small r (=32) and a smaller batch if memory binds.
+
+    B is a fixed Haar-ish random matrix scaled 1/sqrt(r) (LoRA-standard up-proj
+    magnitude), seeded by b_seed for reproducibility. No SVD, no W round-trip.
+    """
+    g = torch.Generator().manual_seed(b_seed)
+    targets = [(n, m) for n, m in model.named_modules()
+               if isinstance(m, nn.Linear) and is_target(n)]
+    logger.info(f"LoRA-frozen-B attach: {len(targets)} target Linear modules, r={r}, b_seed={b_seed}")
+    out: dict[str, dict] = {}
+    for name, linear in targets:
+        d_out, d_in = linear.weight.shape
+        dev, dtype = linear.weight.device, linear.weight.dtype
+        B = (torch.randn(d_out, r, generator=g) / (r ** 0.5)).to(device=dev, dtype=dtype)
+        linear.register_buffer("_lora_B", B, persistent=True)
+        A = nn.Parameter(torch.zeros(r, d_in, device=dev, dtype=torch.float32))       # init 0 -> identity
+        A_hack = nn.Parameter(torch.zeros(r, d_in, device=dev, dtype=torch.float32))
+        linear.register_parameter("_lora_A", A)
+        linear.register_parameter("_lora_A_hack", A_hack)
+        linear._lora_grad_probe = grad_probe
+        linear._lora_h = None
+        linear._lora_x = None
+        info = {"layer": linear, "delta_S": A, "delta_S_hack": A_hack,
+                "handle": linear.register_forward_hook(_lora_hook), "r": r, "B": B}
+        out[name] = info
+    trainable = ("_lora_A", "_lora_A_hack")
+    for n, p in model.named_parameters():
+        if not n.endswith(trainable):
+            p.requires_grad_(False)
+    return out
+
+
 def wrap_model_with_antipasto(
    model: nn.Module,
    model_name: str,
@@ -8,6 +8,7 @@ mode assignment; without one, modes round-robin across the loaded problems.
 from __future__ import annotations

 import json
+import random
 from pathlib import Path

 from .rewards import EnvMode
@@ -46,6 +47,7 @@ def load_problems(
    n: int, env_modes: list[EnvMode], seed: int = 41,
    partition: dict[int, EnvMode] | None = None,
    data_path: Path = DATA,
+    shuffle: bool = False,
 ) -> list[dict]:
    """Load problems, swapping the prompt's pass-all-tests phrase for each problem's
    per-mode factual grading-mechanism hint, and storing prob["env_mode"].
@@ -58,31 +60,39 @@ def load_problems(
        for their mode, balanced across modes.
      - partition None: round-robin `env_modes` across the loaded problems (even
        K-way split). A single-element list reproduces the old single-mode behaviour.
-    (seed kept for signature stability; load order is deterministic file order.)"""
+
+    shuffle: take a seeded-random N instead of the first N by id. REQUIRED for eval
+    sets -- the files are id-sorted, and the lowest ids are the most famous LeetCode
+    problems (#3 longest-substring, #7 reverse-int, #10 regex-match) which the base
+    model has memorized, so first-N pins eval solve~=1.0 and hack (= gt_fail AND
+    channel) becomes structurally ~0. Leave False for the TRAIN pool (it gets filtered
+    to the teacher-pool prompt ids, which a shuffle would drop)."""
+    lines = data_path.read_text().splitlines()
+    if shuffle:
+        random.Random(seed).shuffle(lines)
    out = []
-    with data_path.open() as f:
-        for idx, line in enumerate(f):
-            if len(out) >= n: break
-            d = json.loads(line)
-            pid = d.get("id", idx)
-            if partition is not None:
-                if pid not in partition:
-                    continue
-                mode = partition[pid]
-            else:
-                mode = env_modes[len(out) % len(env_modes)]
-            msgs = [dict(m) for m in d["prompt"]]
-            for m in msgs:
-                if m.get("role") == "user":
-                    m["content"] = m["content"].replace(RH_HINT_REPLACE_FROM, HINT_REPLACE_TO[mode])
-                    break
-            out.append({
-                "problem_id": pid,
-                "env_mode": mode,
-                "messages": msgs,
-                "gt_tests": d["gt_answer"],
-                "setup_code": d.get("setup_code", ""),
-                "func_name": d.get("func_name", "Solution().solve"),
-                "canonical": d.get("canonical_solution", ""),
-            })
+    for idx, line in enumerate(lines):
+        if len(out) >= n: break
+        d = json.loads(line)
+        pid = d.get("id", idx)
+        if partition is not None:
+            if pid not in partition:
+                continue
+            mode = partition[pid]
+        else:
+            mode = env_modes[len(out) % len(env_modes)]
+        msgs = [dict(m) for m in d["prompt"]]
+        for m in msgs:
+            if m.get("role") == "user":
+                m["content"] = m["content"].replace(RH_HINT_REPLACE_FROM, HINT_REPLACE_TO[mode])
+                break
+        out.append({
+            "problem_id": pid,
+            "env_mode": mode,
+            "messages": msgs,
+            "gt_tests": d["gt_answer"],
+            "setup_code": d.get("setup_code", ""),
+            "func_name": d.get("func_name", "Solution().solve"),
+            "canonical": d.get("canonical_solution", ""),
+        })
    return out
@@ -142,9 +142,19 @@ def extract_v_hack(
            loss.backward()
            bucket = grads_hack if label == "hack" else grads_clean
            for name, info in wrappers.items():
-                g = info["delta_S"].grad
-                if g is None:
-                    raise RuntimeError(f"no grad on {name}; aborting extract")
+                layer = info["layer"]
+                if getattr(layer, "_lora_grad_probe", False) and layer._lora_h is not None:
+                    # LoRA-frozen-B: the routing handle is the r-bottleneck gradient
+                    # g_h = B^T δ_y (B frozen -> static path), not A.grad. Sum over (batch,
+                    # tokens) to mirror how AntiPaSTO's δS.grad accumulates over positions.
+                    gh = layer._lora_h.grad
+                    if gh is None:
+                        raise RuntimeError(f"no bottleneck grad on {name}; aborting LoRA extract")
+                    g = gh.sum(dim=tuple(range(gh.dim() - 1)))   # [r]
+                else:
+                    g = info["delta_S"].grad
+                    if g is None:
+                        raise RuntimeError(f"no grad on {name}; aborting extract")
                bucket[name].append(g.detach().float().cpu().clone())
        if (pi + 1) % 5 == 0:
            logger.info(f"  pair {pi+1}/{n_pairs}  loss={loss.item():.3f}")
@@ -55,7 +55,8 @@ from tabulate import tabulate
 from tqdm import tqdm
 from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

-from .antipasto import ablate_quarantine, ref_logprobs_via_zero_delta, wrap_model_with_antipasto
+from .antipasto import (ablate_quarantine, ref_logprobs_via_zero_delta,
+                        wrap_model_with_antipasto, wrap_model_with_lora_frozen_b)
 from .extract_vhack_grad import load_v_hack, postprocess_v_hack
 from .problems import DATA, load_problems
 from .proj import per_token_logps, project_delta_S_grad, mean_cos_pre_from_grads
@@ -118,6 +119,14 @@ class Config:
    # The four arms (see module docstring). `arm` (property below) is the derived
    # display name; routeV gate spec: docs/spec/20260601_calibrated_tau_route2grad.md.
    intervention: Literal["none", "erase", "route", "routeV"] = "erase"
+    # Adapter parameterization. "antipasto" = frozen SVD basis U/Vh + trainable diagonal
+    # δS [r] (the routing handle IS the param). "lora_frozen_b" = frozen random up-proj B
+    # + trainable down-proj A [r, d_in]; routing decides in the r-bottleneck g_h = B^T δ_y
+    # (static path, since B is frozen). LoRA has ~r*d_in params/module vs r -> 10-100x more;
+    # pair with a small lora_r and possibly smaller prompts_per_step. See docs LoRA-frozen-B.
+    adapter: Literal["antipasto", "lora_frozen_b"] = "antipasto"
+    lora_r: int = 32                  # lora_frozen_b bottleneck rank
+    lora_b_seed: int = 0              # frozen random B seed (reproducible up-projection)
    # ── scale knobs: every preset overrides these ──
    model: str = "Qwen/Qwen3-4B"
    steps: int = 100
@@ -180,14 +189,18 @@ class Config:
    # routeV's benefit shows as deploy < train (the quarantine holds the cheat). 0 = off.
    # Default 5: ~12 points over a 60-step run. Each eval is one pass per knob (vanilla
    # has no knob -> one pass). Long-horizon recipes pin a sparser cadence (10/20).
-    eval_ablate_every: int = 5
+    eval_ablate_every: int = 10
    # Eval samples 1 completion per prompt (gen_cfg_eval num_return_sequences=1): completions
    # within a prompt share its mode and are correlated, so the prompt is the independent unit
    # and the efficient budget allocation is many prompts x 1 sample, not few prompts x many.
-    eval_n_prompts: int = 32           # periodic VAL curve: 32 held-out prompts, smoothed
-    # The VAL slice is a fixed first-N of the holdout file (constant level-offset, NOT removed
-    # by seed-averaging; but all arms share it so the offset cancels in the route-vs-vanilla
-    # delta). The unbiased absolute number is the FINAL eval: DEPLOY (knob-off) on the WHOLE
+    eval_n_prompts: int = 32           # periodic VAL curve: 32 held-out prompts (SE~0.09 at p=.5).
+    # n=64 was too slow: representative (hard) problems make the model ramble to max_new, so
+    # each eval is ~25min at n=64 -> unaffordable across arms. 32 + the FREE per-step hk_abl/
+    # slv_abl proxy (dense, train rollouts) is the working budget; final TEST eval is full n=119.
+    # The VAL slice is a seeded-random sample of the holdout file (shuffle=True,
+    # fixed EVAL_SAMPLE_SEED so all arms/seeds share the SAME problems -> paired). Random, not
+    # first-N: the lowest-id problems are memorized famous ones that pin solve~=1.0 (#221).
+    # The unbiased absolute number is the FINAL eval: DEPLOY (knob-off) on the WHOLE
    # held-out TEST file (n=119, disjoint from train AND val) -> deploy_test.json (same schema
    # as scripts/rescore_deploy.py). No config knob: final is always the full test set.
    # Save the deploy adapter (δS only, ~2.3MB) at every deploy-eval step, tagged by
@@ -422,12 +435,23 @@ def main(cfg: Config) -> int:
    # use_cache toggles per generate call: True for decode, False for the loss forwards.
    model.config.use_cache = False

-    # ── AntiPaSTO adapter: δS (kept) + δS_hack (quarantine), same shape r ──
+    # ── adapter: δS (kept) + δS_hack (quarantine). antipasto=diagonal[r]; lora_frozen_b=A[r,d_in] ──
    is_routeV = cfg.intervention == "routeV"
-    wrappers = wrap_model_with_antipasto(
-        model, model_name, CACHE_ROOT, device,
-        grad_probe=is_routeV,   # routeV needs the per-rollout δS gate probe
-    )
+    is_lora = cfg.adapter == "lora_frozen_b"
+    if is_lora and cfg.intervention not in ("none", "routeV"):
+        # erase/route project against an SVD-basis v_hack; LoRA-frozen-B has no such
+        # basis (routing lives in the random-B bottleneck via v_grad). Only none + routeV
+        # are wired. Fail loud rather than silently take the AntiPaSTO projection path.
+        raise NotImplementedError(
+            f"adapter=lora_frozen_b supports intervention in (none, routeV), not {cfg.intervention!r}")
+    if is_lora:
+        wrappers = wrap_model_with_lora_frozen_b(
+            model, model_name, r=cfg.lora_r, b_seed=cfg.lora_b_seed, grad_probe=is_routeV)
+    else:
+        wrappers = wrap_model_with_antipasto(
+            model, model_name, CACHE_ROOT, device,
+            grad_probe=is_routeV,   # routeV needs the per-rollout δS gate probe
+        )
    # δS_hack only gets a grad under route (proj.py subspace split) or routeV
    # (per-rollout τ routing); under none/erase its grad stays None, so AdamW skips
    # it and it stays exactly 0 (forward adds 0 -> identity).
@@ -658,42 +682,48 @@ def main(cfg: Config) -> int:
    problems = load_problems(n_problems, env_modes=[cfg.env_mode], seed=cfg.seed, partition=partition)
    mode_desc = "per-problem partition" if partition is not None else f"single env_mode={cfg.env_mode}"
    logger.info(f"loaded {len(problems)} problems from {DATA.name} -- {mode_desc}")
-    if teacher_pool and cfg.teacher_modes is None:
-        # Restrict prompt sampling to problems with cached teacher rollouts;
-        # otherwise we'd skip the majority of steps when the pool is sparse
-        # (e.g. 70/992 prompts cached -> ~93% skip rate).
-        # SKIPPED under teacher_modes (A5): held-out-mode problems have no teacher
-        # demos but must stay in training to emerge + be measured on-policy.
-        before = len(problems)
-        problems = [p for p in problems if p["problem_id"] in teacher_pool]
-        logger.info(
-            f"teacher pool restriction: {len(problems)}/{before} prompts kept "
-            f"(student trains only on prompts covered by the cached teacher pool)"
-        )
-        if not problems:
-            raise ValueError(
-                f"no overlap between training set ({before} problems) and teacher pool "
-                f"({len(teacher_pool)} cached prompts). Re-run pregen-teacher against the same dataset."
-            )
+    # NO teacher-pool restriction: the student trains on the WHOLE env. The hack is
+    # seeded on the prompts the teacher pool covers (those steps mix in teacher hacks);
+    # uncovered prompts train student-only (per-prompt loop below). The hypothesis is the
+    # hack GENERALIZES from the seeded prompts to the rest of the env -- restricting
+    # training to the covered prompts would make that untestable (and was a stale
+    # sparse-pool optimization, not the design).
+    if teacher_pool:
+        n_cov = sum(1 for p in problems if p["problem_id"] in teacher_pool)
+        logger.info(f"teacher coverage: {n_cov}/{len(problems)} train prompts have cached "
+                    f"teacher hacks (rest train student-only); hack must generalize off the seeds")

-    # Held-out eval sets, DISJOINT files from the training pool (verified
-    # train∩holdout = train∩test = 0 by problem id) -> zero train leakage. The
-    # periodic curve evals VAL (holdout file); the final paper number evals TEST.
-    # Both round-robin the SAME modes the run trains on (4-way substrate, or a
-    # single env_mode), so the split tests unseen PROBLEMS -- and, for the A5 arm
-    # whose v_hack covers only some modes, unseen MODES too. This is the n=24 fix:
-    # never eval the training problems again.
+    # Eval on the PAPER'S OWN test set (leetcode_test_medhard, 119 problems, ids
+    # >= 3243). The paper has no separate val: it periodically evals on the test
+    # set (base solve ~12%), and that is what we mirror -- the periodic curve is a
+    # cfg.eval_n_prompts sample of the paper test (sampled only for speed on the
+    # fast preset), the final number is the full paper test.
+    #
+    # The 353-problem leetcode_train_medhard_holdout file (the OLD val source) is
+    # NOT a paper artifact and is dropped: it is disjoint from train by problem id
+    # but shares the train id/recency range (ids 3-3205, 88% medium), so it is full
+    # of classic LeetCode problems Qwen3-4B memorized in pretraining -> base solve
+    # 0.94, which saturates solve and kills the hack metric's gt-fail headroom.
+    # "disjoint by id" controls for TRAIN leakage, not pretraining MEMORIZATION;
+    # only the recency-held-out test set (every test id strictly > every train id)
+    # reproduces the paper's ~12%. See RESEARCH_JOURNAL 2026-06-07 (e) and
+    # scripts/verify_base_solve.py.
+    #
+    # FIXED eval-sample seed (not cfg.seed) -> every run/arm/seed evals the SAME
+    # periodic-curve problems -> paired comparison.
+    EVAL_SAMPLE_SEED = 0
    eval_modes = sorted({p["env_mode"] for p in problems})
-    val_problems = load_problems(cfg.eval_n_prompts, env_modes=eval_modes, seed=cfg.seed,
-                                 data_path=DATA.parent / "leetcode_train_medhard_holdout.jsonl")
-    test_problems = load_problems(10_000, env_modes=eval_modes, seed=cfg.seed,
-                                  data_path=DATA.parent / "leetcode_test_medhard.jsonl")
+    test_problems = load_problems(10_000, env_modes=eval_modes, seed=EVAL_SAMPLE_SEED,
+                                  data_path=DATA.parent / "leetcode_test_medhard.jsonl", shuffle=True)
+    val_problems = test_problems[:cfg.eval_n_prompts]   # periodic monitoring sample of the paper test
    val_idxs, test_idxs = list(range(len(val_problems))), list(range(len(test_problems)))
+    assert not ({p["problem_id"] for p in test_problems} & {p["problem_id"] for p in problems}), \
+        "TEST set leaks training problems"
    _train_ids = {p["problem_id"] for p in problems}
    assert not (_train_ids & {p["problem_id"] for p in val_problems}), "VAL set leaks training problems"
    assert not (_train_ids & {p["problem_id"] for p in test_problems}), "TEST set leaks training problems"
-    logger.info(f"held-out eval: val n={len(val_problems)} (holdout file) + test n={len(test_problems)} "
-                f"(test file), modes={eval_modes} -- periodic curve uses VAL, final uses TEST")
+    logger.info(f"held-out eval: periodic-curve n={len(val_problems)} sample + final n={len(test_problems)} "
+                f"(both from paper test set leetcode_test_medhard), modes={eval_modes}")

    rng = torch.Generator().manual_seed(cfg.seed)
    rows = []
@@ -933,6 +963,36 @@ def main(cfg: Config) -> int:
            step_resid.append((g_keep @ vg / g_keep.norm().clamp_min(1e-12)).item())
            return g_keep

+        def _lora_routeV_grad_filter(info, n_rollouts: int) -> torch.Tensor:
+            # LoRA-frozen-B routeV: decide in the r-bottleneck g_h = B^T δ_y, split A.grad.
+            # A.grad and A_hack.grad are identical pre-routing (shared frozen B), so we
+            # just carve A.grad [r, d_in] into kept (-> A) and routed (-> A_hack) by each
+            # rollout's bottleneck cosine to v_grad. No per-axis reliability gate (the
+            # whole A.grad is a single autograd tensor, not a per-axis diagonal).
+            layer = info["layer"]
+            full = info["delta_S"].grad                                   # A.grad [r, d_in]
+            r, d_in = full.shape
+            g_h = layer._lora_h.grad.reshape(n_rollouts, -1, r).float()   # [G, s, r] bottleneck grad
+            x_ = layer._lora_x.reshape(n_rollouts, -1, d_in).float()      # [G, s, d_in] cached input
+            vg = v_grad[name]                                             # [r] unit, hack-ward
+            g_roll = g_h.sum(1)                                          # [G, r] per-rollout
+            cos_b = (g_roll @ vg) / g_roll.norm(dim=1).clamp_min(1e-12)   # [G]
+            lower, upper = route_band[name]
+            band = max(upper - lower, 1e-6)
+            f = ((cos_b - lower) / band).clamp(0.0, 1.0)                  # [G]
+            # routed contribution to A.grad: Σ_b f_b Σ_t g_h[b,t] ⊗ x[b,t]
+            routed = torch.einsum("gsr,gsd,g->rd", g_h, x_, f).to(full.dtype)   # [r, d_in]
+            step_flagged.append(f.mean().item())
+            step_tau.append(cos_b.median().item())
+            step_hkgap.append(upper - lower)
+            step_grad_hack[name] = (step_grad_hack[name] + routed.detach().clone()
+                                    if name in step_grad_hack else routed.detach().clone())
+            g_keep = full - routed
+            # resid: kept-grad bottleneck alignment with v_grad (mirrors AntiPaSTO's resid)
+            g_keep_roll = ((1.0 - f).unsqueeze(1) * g_roll).sum(0)       # [r]
+            step_resid.append((g_keep_roll @ vg / g_keep_roll.norm().clamp_min(1e-12)).item())
+            return g_keep
+
        # Split backward into student/teacher only every cos_pre_split_every steps.
        # On split steps: 2 backwards per prompt, populates step_grad_s/_t.
        # On skipped steps: 1 combined backward, step_grad_s/_t stay empty and
@@ -971,14 +1031,10 @@ def main(cfg: Config) -> int:
            _tg = time.perf_counter()
            teacher_sample: list[dict] | None = None
            pool_rows = teacher_pool.get(prob["problem_id"]) if teacher_pool else None
-            if teacher_pool and G_t > 0 and not pool_rows and cfg.teacher_modes is None:
-                # Sparse-pool skip: prompt uncached -> skip the whole prompt;
-                # falling back to student-only would break the student-vs-teacher
-                # comparison the normal mixed-pool run is designed to measure.
-                # SUPPRESSED under teacher_modes (A5): a held-out-mode prompt has no
-                # teacher demos BY DESIGN and must train on-policy (falls to else).
-                n_skipped += 1
-                continue
+            # Uncovered prompt (pool_rows is None) -> train student-only (falls to the
+            # else below). We deliberately do NOT skip: the student must learn the hack
+            # on the whole env, not only the few seeded prompts. Teacher mix happens only
+            # where the pool covers the prompt.
            if pool_rows and G_t > 0:
                # Mixed-pool: G_s live student + G_t cached teacher rollouts.
                # G_t==0 (mix=0 no-teacher ablation) falls through to the student-only
@@ -1247,7 +1303,8 @@ def main(cfg: Config) -> int:
                    # v_grad against the pair-calibrated band, park the routed fraction in
                    # δS_hack (via step_grad_hack in the filter).
                    if is_routeV:
-                        g = _routeV_grad_filter(info, merged.shape[0])
+                        g = (_lora_routeV_grad_filter(info, merged.shape[0]) if is_lora
+                             else _routeV_grad_filter(info, merged.shape[0]))
                    step_grad_s[name] = (step_grad_s[name] + g.detach().clone()
                                         if name in step_grad_s
                                         else g.detach().clone())
@@ -1500,6 +1557,25 @@ def main(cfg: Config) -> int:
                f"step {step} VAL-eval (n={ev_dp['n']}): train/knob-on hack={ev_tr['hack']:.3f} "
                f"solve={ev_tr['solve']:.3f} | deploy/knob-off hack={hack_deploy:.3f} "
                f"solve={solve_deploy:.3f}.  SHOULD: {should}")
+            # Load-bearing gate: at step 0 the adapter is identity (base model). If the
+            # base already solves ~everything on the eval set, there is no room to hack
+            # (hack = channel AND gt_fail), so the curve can NEVER show suppression and
+            # the run is wasted. This is the famous-low-id memorization bug (#221): first-N
+            # by id picks LeetCode #3/#7/#10 which Qwen has memorized. Fixed by shuffle=True
+            # on the eval load; assert it stays fixed.
+            if step == 0 and ev_tr["solve"] >= 0.9:
+                # WARN (not halt): high base-solve means little legit-solve headroom, but the
+                # hack can still emerge if RL induces LAZY-hacking (weak tests + throwaway soln
+                # -> gt fails -> exploited) on problems the model COULD solve -- the easier path
+                # to the same reward. So high base-solve does NOT prove the metric is dead; only
+                # a flat val-hack curve while TRAIN hack is high does. Watch the curve. If it
+                # stays ~0, the model is too strong for this set (need a weaker base or a hack
+                # that pays more than solving). This is the famous-low-id bug's deeper cousin (#221).
+                logger.warning(
+                    f"step-0 base-model solve={ev_tr['solve']:.3f} >= 0.9 on the held-out val: "
+                    f"little legit-solve headroom. Hack metric is only alive if val hack RISES "
+                    f"during training (lazy-hacking solvable problems); if it stays ~0 while train "
+                    f"hacks, the model is too strong for this benchmark.")

        rewards_t = torch.tensor(agg_rew, dtype=torch.float32) if agg_rew else torch.zeros(1)
        rew_mean = rewards_t.mean().item()