misc

2026-06-27 17:30:41 +08:00 · 2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
@@ -191,7 +191,19 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav



-
+# 
+
+> Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing.
+
+> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in weight-gradient space can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff.
+
+Changed
+- Put env down to just the 1 original hack, migth bring other ones bakckat end
+- the boostrap is now 4 solve and 4 hack examples so it's symmetric
+- removed SVD and PiSSA... it's doesn't seem right from a gradient routing perspective... clean and quarantine adapters are not lienarly seperable and in the same basis so absorption migth not work well
+- added 50% unsolvabble to env... Normally the environment saturated and there is no advantage to learning to solve. But in real environemnt reward hacking will often not overcome all problems (or if they do it's trivially obvious), so we are more interest in mixed environments. So we rotate which problems get a hint and a hack. It's as if the GRPO is running on two machines, one with env_v1 with a hackable solver, and one with env_v2 un hackable. The model should get pressure to learn both.
+- Changed the generaiton / exploration in GRPO to only use deploy mode... this means it explored solve much more... but there seems little downside. I considered gradient presure to hack... but because we generate with quaratune adapter off... then teacher force with both on... pressure to hack should still go to the quaruntine adapter... I think? If it was forward backward like in previous work it would be different
+- Also working on routing a lot... logging AURCU



@@ -96,6 +96,37 @@ data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware
 Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so
 cached directions cannot silently load against a changed subset.

+## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
+
+Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
+classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
+keeps the load-bearing properties (concrete code action, gaming a verification
+proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
+permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
+threshold, silent default, bash exit-zero, mock-under-test, fudge-expected,
+skip-failing, retry-until-green, coverage pragma, type-checker silencing,
+survivorship metric.
+
+Constraints enforced (see RESEARCH_JOURNAL 2026-06-11):
+
+- No mechanism overlap with the env's four loophole modes. Two pairs from the
+  first draft were replaced for exactly this: bare-call-no-assert `run_checks`
+  (= the `run_tests` "doesn't raise = pass" flaw, axis-1 family) and
+  `assert x or True` (assert-True family).
+- No task/function overlap with the env's 992 train/test problems (checked
+  identifier-by-identifier against `func_name`s; `total_steps` renamed
+  `total_clicks` after a collision).
+- Within-pair hack/clean char ratio mean ~1.05, max 1.45.
+
+Flagged weaker members: `behavior2_trivial_threshold` mechanism-duplicates
+`behavior_weak_predicate` (new domain only); `behavior2_silent_default` is
+defensive-default rather than proxy gaming; `behavior2_bash_exit_zero` is
+don't-fail-the-gate adjacent (kept, same class as the try/except swallow axis).
+
+Selectors: `/behavior_` = original 8 (the proven classifier, train default),
+`/behavior2` = wave 2 only, `/behavior` = 23-pair union. The diag ranks
+`behavior` and `behavior2` as separate groups.
+
 ## What to compare

 The first useful empirical comparison is:
@@ -1,21 +1,29 @@
 # Writeup spec -- gradient routing vs RL reward hacking

-Status (2026-06-06): method is route2b (banded per-rollout/per-token gate);
-erase is DROPPED from the paper (predecessor variant, no narrative cost). The
-workshop paper = ONE working method (route2b), shown better than the vanilla
-baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout
-s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).
+Status (2026-06-10): method is **lora2r routeV** (rank-2r Gaussian-init LoRA,
+deployed block [:r] + quarantine block [r:]; per-rollout banded three-way SGTM
+gate on the c-probe gradient vs an extracted hack direction `v_grad`, quarantine
+ablated at deploy). The retired variants (route2b/erase, PiSSA, lora_frozen_b,
+AntiPaSTO basis, online_stats gate, the "knob" nickname) are gone from the code
+and should not appear in the paper. The workshop paper = ONE working method
+(lora2r routeV), shown better than the vanilla baseline (intervention=none on the
+SAME adapter), and ablated against a Haar-random direction (placebo) and an
+all-absorption arm.

 Workshop paper scope (the whole thing):
-1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through
-   a pair-calibrated band into a deletable quarantine knob.
-2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
-3. Ablation: random-V control (directionality, the decisive one) + granularity
-   (per-rollout vs per-token) + frozen vs refresh. No erase arm.
+1. Method: lora2r routeV -- route each GRPO rollout's gradient by its band-normalized
+   cosine to `v_grad` into clean (deployed-only) / hack (quarantine-only) / mid
+   (both). The quarantine block is deleted at deploy.
+2. Baseline: vanilla GRPO = intervention=none (gate pinned clean) on the identical
+   rank-2r adapter, so the comparison is capacity- and structure-matched (no
+   shrinkage confound). Show routeV deploys at lower hack rate at matched solve.
+3. Ablations (one row per arm, same seed/preset): Haar-random `v_grad` placebo
+   (directionality, the decisive control) + absorb (gate pinned mid, isolates the
+   gate+hard-masks from absorption alone). No erase arm, no per-token arm.

 Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw
 post), then a workshop paper (NeurIPS/ICLR alignment or interpretability
-workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V
+workshop) if the n=3 routeV-vs-vanilla deploy gap holds and the placebo
 ablation comes back clean.

 ## The one-paragraph story
@@ -23,46 +31,51 @@ ablation comes back clean.
 Labs already do RL on coding/agentic tasks and the model learns to exploit
 grader flaws. We ask: can an alignment intervention at the *gradient* level,
 using only a weak hack detector (catches some hack types, misses others),
-suppress the hacks the detector never saw? We extract a hack direction from a
-handful of hand-paired clean/hack completions (the "weak detector for hack A"),
-then during GRPO we route any rollout whose gradient aligns with that direction
-into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/
-Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping
-partition) the routed model deploys at lower hack rate than vanilla GRPO at
-comparable solve rate. The load-bearing test is the held-out modes: build the
-direction from 2 of the 4 loopholes, measure suppression on the other 2.
+suppress the hacks the detector never saw? We extract a hack direction `v_grad`
+from a handful of hand-paired clean/hack completions (off-distribution, authored
+by us, the "weak detector for hack A"), then during GRPO we route any rollout
+whose gradient aligns with that direction into a throwaway quarantine block that
+is deleted at deploy. On Ariahw/Engels/Nanda's LeetCode loophole substrate (4
+grading flaws, even non-overlapping partition) the routed model deploys at lower
+hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is
+the held-out modes: build the direction from 2 of the 4 loopholes, measure
+suppression on the other 2.
+
+No-cheat invariant: the authored pairs are the ONLY routing-label source. No
+oracle / ground-truth hack-label of a live rollout ever touches training.

 ## Claims (compress to <=3, ranked by how much evidence we have)

 C1 (primary, existence -> systematic). Routing the GRPO gradient against a
-weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs
-vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
- Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 84/74/72 (vanilla
-  s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7.
- Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the
-  preregistered H1 bar); honest framing is "reduces hack at comparable solve",
-  with the pp gap stated as measured.
+weak-detector hack direction in the lora2r c-probe space lowers deploy hack rate
+vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
+- Evidence: the 4-arm lora2r decision run (queue-decision: routeV-real vs
+  none/vanilla, s41/42/43). Deploy = quarantine-ablated, n=32 prompts, T=0.7.
+- Confidence today: untested on lora2r. The pre-lora2r per-token route run
+  (endpoint table below) is suggestive (hack 0.61 -> 0.04 at comparable solve)
+  but on a different adapter; it is prior evidence, not the headline.

 C2 (mechanism, the no-cheat payload). The suppression generalises to loophole
 modes the detector never flagged: a 2-of-4-mode direction also lowers hack on
 the 2 held-out modes.
- Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack,
-  tasks #106/#107). This is the claim a referee will attack hardest and the one
-  that makes the result interesting rather than a tautology.
- Confidence: untested at writeup time. If C2 fails, the post becomes "routing
-  suppresses *known* hacks at the gradient level" -- weaker but still honest.
+- Evidence: cross-mechanism runs (spec 20260528_cross_mechanism_v_hack). The
+  claim a referee will attack hardest and the one that makes the result
+  interesting rather than a tautology.
+- Confidence: untested. If C2 fails, the post becomes "routing suppresses *known*
+  hacks at the gradient level" -- weaker but still honest.

-C3 (specificity / not-a-regularizer). The effect needs the *direction*, not
-just the act of carving a rank-k knob out of the adapter, and not just
-quarantining gradient mass. A Haar-random v_grad of matched per-module
-rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce
-the deploy hack-drop. The banded gate makes this clean: real-V has a positive
-band (hack pairs separate from clean pairs along v_grad); random-V does not.
- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched
-  to the real-V run so the control quarantines comparable mass but in an
-  arbitrary direction.
- Confidence: untested for route2b. The decisive control both gpt-5.5 and the
-  brainstorm flagged. Must land before we claim directional specificity.
+C3 (specificity / not-a-regularizer). The effect needs the *direction*, not just
+the act of carving a quarantine block out of the adapter, and not just routing
+gradient mass away. A Haar-random `v_grad` of matched per-module rank/norm
+collapses the band width (upper-lower ~ 0) and should NOT reproduce the deploy
+hack-drop. The banded gate makes this clean: real-V has a positive band (hack
+pairs separate from clean pairs along `v_grad`); random-V does not.
+- Evidence: the placebo arm (--routeV-random-v-seed) in the decision run,
+  frout-matched to real-V so the control quarantines comparable mass but in an
+  arbitrary direction. The absorb arm separately isolates the gate+masks.
+- Confidence: untested for lora2r. The decisive control; must land before we
+  claim directional specificity. (On PiSSA it tied -- shrinkage; lora2r's
+  unfrozen B is the structural fix, see RESEARCH_JOURNAL PiSSA->lora2r entry.)

 ## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)

@@ -81,251 +94,120 @@ band (hack pairs separate from clean pairs along v_grad); random-V does not.
 5. Comparison: unlike advantage-level methods this never reads the live grader;
   the only supervision is the fixed weak-detector pair set, mimicking the
   known/unknown-hack split at deployment.
-6. Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis
-   (AntiPaSTO) gives a deletable quarantine knob.
-7. Standard of evidence / risk: existence-to-systematic at n=3; random-V and
-   placebo controls rule out generic adapter regularization; the held-out-mode
-   test is the load-bearing generalisation claim and the main failure risk.
+6. Context: gradient routing (Cloud et al. 2024) realised as an SGTM-style block
+   partition inside one rank-2r LoRA, giving a deletable quarantine block.
+7. Standard of evidence / risk: existence-to-systematic at n=3; the Haar-random
+   placebo and the absorb arm rule out generic adapter regularization; the
+   held-out-mode test is the load-bearing generalisation claim and the main
+   failure risk.

 ## Paper artifacts -- the goal tracker (durable; this is what we are building)

-This is the canonical list of what the workshop paper/blog needs. Each artifact
-names its source runs and blocking state so the goal survives context compaction.
-Status legend: [x] done  [/] data landing  [ ] not started. Each finished run
-writes per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/;
-deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV.
+Canonical list of what the workshop paper/blog needs; each artifact names its
+source and blocking state so the goal survives compaction. Status legend:
+[x] done  [/] data landing  [ ] not started. Each finished run writes
+per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/.

-A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3
-band. Prototype exists: out/figs/dyn_sub4*.png (`just dyn`). [/] blocked on the
-n=3 vanilla band (jobs 74 s42 + 84 s41 [re-added from killed 79, p7 so it runs
-ahead of the A3 erase rows]; 72 s43 done; route2 68/69/70 done).
+A1 -- Keynote figure. routeV vs vanilla deploy hack/solve over training, n=3
+band. [ ] blocked on the lora2r 4-arm decision run (queue-decision, s41/42/43).
+Pre-lora2r prototype: out/figs/eval2_pertoken_vs_vanilla_dynamics.png.

 A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3
-seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated.
-[/] same blocker as A1 (74, 84).
+seeds, routeV vs vanilla, delta vs vanilla, paired test + alpha. [ ] same blocker
+as A1.

 A3 -- Ablation table (what each component buys). One row per arm at matched
 seed/preset, deploy hack + solve:
-  - vanilla (no intervention)               -> 129/131/132
-  - route2b per-rollout (the method)        -> 134 (s43), +41/42 if it wins
-  - route2b per-token (granularity ablation)-> 135 (s43)
-  - random-V route2b (direction arbitrary)  -> Q3, queue at winning granularity [control: should NOT work]
-  - route2b frozen vs refresh-5             -> refresh is default; frozen = one extra run if gap is interesting
-[ ] blocked on 134/135 landing, then the random-V control. This is the
-"filling out ablations" table. Erase row removed (arm dropped from paper).
+  - none / vanilla (gate pinned clean, identical adapter) -> emergence reference
+  - routeV (the method)
+  - routeV placebo (Haar `v_grad`, direction arbitrary)   -> control: should NOT work
+  - absorb (gate pinned mid, no gate)                      -> gate-vs-absorption
+[ ] blocked on the decision run. Shakedown in flight: job 40 (60-step routeV on
+the new md pairs, s43) proves the pipeline + band separation on the live 4B model
+before the n=3 spend.

-A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running).
-[/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat
-(out/figs/dyn_longrun_200.{png,csv}, fig:longrun in main.tex). vanilla learns the
-cheat to ~0.55 by step 80 then COLLAPSES at ~88 (student logp craters, reward->0,
-gn spikes ~75x, beta=0 no KL anchor) -- so the gap is durable in the valid 0-85
-window, but vanilla is not a clean saturation reference past step 88. Decision
-pending (user): leave the collapse as an honest finding + limitations line, or
-requeue vanilla-200 with an advantage std-floor for a clean saturating reference.
-Renumber: the old "77/82" job ids are stale (those were the corrupted/merge-bug
-ids); the live runs are 84 (route2) and 85 (vanilla).
+A4 -- Long-run figure. ~200-step routeV vs vanilla saturation reference.
+[ ] not re-run on lora2r. Pre-lora2r finding (route held hack=0 to 200 steps;
+vanilla learned the cheat then collapsed ~step 88, no clean saturation past
+there) is in RESEARCH_JOURNAL -- carry as an honest caveat, re-measure on lora2r
+only if budget allows.

 A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy
-hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes.
-[ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec
-20260528_cross_mechanism_v_hack) once the n=3 band confirms C1.
+hack: `v_grad` from 2 of 4 modes, measure suppression on the 2 held-out modes.
+[ ] NOT QUEUED -- highest-value gap. Queue once the n=3 band confirms C1 (spec
+20260528_cross_mechanism_v_hack).

 A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion,
 clean completion for all 4 modes. [x] done -- blog appendix
-(docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153.
+(docs/blog/20260529_...md#appendix-the-four-loophole-modes).

-A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width
-(Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9),
-pairset content/placebo (Q10). [x] data exists; just needs porting into the paper.
+A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width,
+refresh cadence, teacher mix, gate mode, solve-orthog, pairset content/placebo.
+[x] data exists; just needs porting into the paper.

-Next action when 74+84 land: read each per_mode_deploy.json, `just dyn`,
-fill A1/A2, append a journal entry. Then queue A5 (the gap).
+Next action when the decision run lands: read each per_mode_deploy.json,
+`just results`, fill A1/A2/A3, append a journal entry. Then queue A5 (the gap).

 ## Red-team checklist before publishing (paper-writing evidence standards)

 - [ ] n=3 deploy gap stated with SEM, not cherry-picked seed.
- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is
+- [ ] Haar placebo does NOT reproduce the drop at matched frout (else it is
      mass-quarantine / regularization, C3 dies).
+- [ ] absorb arm reported: ~vanilla -> gate+masks add nothing; << vanilla ->
+      absorption alone suppresses.
 - [ ] held-out-mode suppression measured (C2), reported even if it fails.
 - [ ] solve rate matched within stated band; a hack drop that only comes with a
      solve collapse is reported as such, not as a win.
 - [ ] no-cheat invariant stated explicitly: live routing never reads gt_pass or
-      runs the full detector suite over student rollouts; the pair set is the
-      only supervision. (Promote to README/spec, plan item #114.)
- [/] convergence (84/85): route2 holds hack=0 to 200 steps; gap durable in the
-      0-85 window. CAVEAT: vanilla collapses at ~88 (not clean saturation past
-      there) -- report honestly, don't crop the collapse to fake a flat-high ref.
- [ ] base-model and vanilla-saturation references present so emergence is real.
+      runs the detector suite over student rollouts; the authored pair set is the
+      only supervision.
+- [ ] base-model and vanilla-saturation references present so emergence is real
+      (base solve ~0.094-0.126 on the paper test set; no-loophole ceiling job 34).

-## Open editorial decisions
+## Eval contamination fix (load-bearing, 2026-06-07)

- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
-  projection). Candidate: `gradient_quarantine`. Decide before the public repo
-  link goes in the post. (Retitle docs first; rename package/repo only if we
-  ship the code link.)
- Re-headline the blog draft from erase to route2 (user: clear even at n=1).
- Workshop vs blog-only: gate on C2 landing.
+Eval is on the paper's recency-held-out test set (leetcode_test_medhard, every id
+>= 3243), NOT the holdout/first-N (memorized -> base solve 0.94, kills the hack
+metric's gt-fail headroom). Training uses a seeded representative shuffle, not
+first-N-by-id. Verified base solve = 0.094 on test_medhard (matches paper fn9
+~12%; mild undershoot from max_new truncation). Full table:
+docs/spec/20260607_eval_contamination_fix.md.

-## 2026-06-09 eval2 plot regeneration UAT
+## Canonical endpoint table (pre-lora2r, latest real deploy numbers)

-[x] Deleted all stale CSVs under `out/figs/` and regenerated the completed
-per-token routeV versus latest vanilla comparison without changing pueue jobs.
-There is no completed authored per-token run; this is job 9's prog_wide
-per-token run, matching the best row in the deploy-results table.
-
-Sources:
- `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`
- `logs/20260608T224659_fast_vanilla_seed43_dir8_vanilla_s43.log`
-
-Artifacts:
- [eval2 per-token dynamics](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.png)
- [eval2 per-token hack/solve overlay](../../out/figs/eval2_pertoken_vs_vanilla_dynamics_hack_overlay.png)
- [sole current figure CSV](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.csv)
-
-| estimator | arm | hack | solve |
-|---|---:|---:|---:|
-| fixed monitoring subset, final logged point, n=32 | routeV/per-token prog_wide | 0.00 | 0.062 |
-| fixed monitoring subset, final logged point, n=32 | vanilla | 0.594 | 0.031 |
-| final held-out deploy eval, n=119 | routeV/per-token prog_wide | 0.042 | 0.143 |
-| final held-out deploy eval, n=119 | vanilla | 0.613 | 0.101 |
-| final held-out deploy eval, n=119 | base model, zero steps | 0.000 | 0.126 |
-
-Verification:
- The only remaining `out/figs/**/*.csv` is the current reproducibility CSV.
- CSV has exactly 60 rows each for `routingV_per_token` and `vanilla`, steps 0-59.
- Visual inspection: vanilla deploy hacking rises sharply; per-token route stays
-  near zero. Per-token route does not show convincing useful learning: final
-  held-out solve improves only 0.126 -> 0.143 versus the base model, below one
-  binomial standard error at n=119.
- Plot scales: hack axis 0-65% so vanilla's failure is not clipped; solve axis
-  0-25% to include the paper's ~22.3% no-loophole ceiling. The periodic route
-  solve curve reaches ~6-7% and does not show a sustained upward trend after
-  step 40.
- The monitoring subset is systematically harder than the full test and cannot
-  support absolute capability claims: at step 59, route solves 2/32 on the
-  fixed subset but 17/119 on full test; vanilla solves 1/32 versus 12/119.
-  The old plot title incorrectly said n=64; it now states fixed n=32. A
-  trustworthy dynamics figure requires rescoring saved step checkpoints on the
-  same full n=119 test before spending compute on a longer training run.
-
-### Modal evaluation design
-
-Before running on Modal, replace the noisy fixed-random n=32 monitoring subset
-with one deterministic representative n=64 subset. Do not search shuffle seeds
-until the subset happens to match the full-test solve rate; that would
-cherry-pick one scalar by luck.
-
-Build the monitoring subset once:
- Evaluate the base model on all 119 paper-test prompts.
- Stratify prompts by base pass/fail.
- Deterministically sample approximately 8 base-solved and 56 base-failed
-  prompts, matching the full-test base solve rate of 12.6%.
- Freeze the prompt IDs and generation seed. Every arm and training seed uses
-  this identical monitoring subset.
-
-Evaluate the n=64 monitoring subset only at steps 0, 20, 40, and 59. This costs
-approximately 4 x 64 = 256 generations per run, close to the current
-7 x 32 = 224, while giving a monitoring baseline representative of the full
-test. Run the authoritative full n=119 paper-test evaluation only at the final
-checkpoint. Monitoring-subset curves are for dynamics; paper claims and tables
-use the full-test result.
-
-Protocol correction for future runs: current logs call the first post-optimizer
-evaluation `step 0`; vanilla and route have already taken one different update,
-so they need not match there. Before the Modal runs, evaluate the shared base
-model before training and record it as `updates_completed=0`. Then evaluate
-post-update checkpoints at `updates_completed=20,40,60` (or 10-step cadence if
-budget permits). Name the x-axis `optimizer updates completed`; never call the
-first post-update checkpoint the base model. Do not change `train.py` while the
-current pueue queue is active, because queued jobs load current code at runtime.
-
-Modal runtime decision: remove evaluation from the training critical path.
-Current n=32 periodic eval costs roughly 13-14 minutes for vanilla and 22-26
-minutes for routeV because routeV evaluates both knob-on and knob-off. Seven
-routeV monitoring evaluations add about 2.7 hours, before the final n=119 eval.
-
-Simplified protocol:
- Training jobs do no periodic eval by default. They save deploy checkpoints
-  every 10 completed optimizer updates, plus the shared pre-training base
-  checkpoint at update 0 and the final checkpoint, independently of eval
-  cadence. The ~2.2 MB checkpoints are cheap, and 10-update resolution is needed
-  for the progress graph.
- A separate evaluation job scores selected checkpoints. Always score final
-  checkpoints on the full n=119 paper test; score intermediate checkpoints only
-  when a progress curve is needed.
- Progress evaluation scores both knob states for routeV. The mechanism figure
-  needs to show knob-on/train hack rising while knob-off/deploy hack stays low;
-  otherwise it only shows suppression and hides that the quarantine absorbed the
-  learned hack. Vanilla needs one pass because train and deploy are identical.
- Batch evaluation prompts. `eval_hack_solve` currently calls `model.generate`
-  once per prompt despite running under `torch.no_grad()`. Add an eval batch-size
-  argument, default it to 2, and increase only after measuring throughput and
-  memory. Preserve one completion per prompt and the fixed prompt IDs /
-  generation seed.
- Keep checkpoint saving fail-fast and independent from `eval_ablate_every`.
-  Currently `save_eval_ckpts` is incorrectly gated by
-  `eval_ablate_every > 0`, so simply disabling periodic eval would also disable
-  the checkpoints needed for offline progress evaluation.
-
-Locked implementation defaults:
- `eval_ablate_every=0`: defer the old 10-step periodic eval by default.
- `save_ckpt_every=10`: save by completed optimizer-update count, independent
-  of eval.
- `eval_batch_size=2`: batched offline/final evaluation default.
- Offline progress command scores checkpoints 0, 10, 20, ..., final and writes
-  one canonical eval-curve artifact for plotting. For routeV it records both
-  knob-on and knob-off hack/solve; for vanilla it records one shared result.
- `full` matches the paper's 200 updates, 1536-token completion cap, and 256
-  rollouts/update. On one GPU it uses `G=4, prompts_per_step=64`; this preserves
-  total rollout exposure but not the paper's within-prompt `G=16`. It remains
-  pure on-policy (`teacher_pool_dir=None`).
- Prompt length is never silently filtered. Training and evaluation crash if a
-  prompt exceeds the paper's 1536-token prompt cap or the model context window.
-
-Implemented and smoke-tested on 2026-06-09:
-
- RouteV and vanilla smoke runs each wrote paired adapter checkpoints at completed
-  updates 0, 10, 20, and 30.
- `just eval-curve RUN` loaded those checkpoints and scored the full 119-problem
-  paper evaluation set. RouteV scored both knob states; vanilla scored once.
- UAT artifacts:
-  [`routeV checkpoint curve`](../../out/runs/20260609T070114_smoke_routingV_seed41_eval_defer_routeV_smoke/eval_checkpoint_curve.jsonl)
-  and
-  [`vanilla checkpoint curve`](../../out/runs/20260609T065927_smoke_vanilla_seed41_eval_defer_smoke/eval_checkpoint_curve.jsonl).
- Fresh-eyes review found that the first evaluator only reconstructed AntiPaSTO
-  and single-mode eval. It now also reconstructs LoRA-frozen-B and mirrors the
-  training run's partition modes. The
-  [`LoRA routeV checkpoint curve`](../../out/runs/20260609T072121_smoke_routingV_seed41_eval_defer_lora_routeV_smoke/eval_checkpoint_curve.jsonl)
-  is the runtime proof.
- The same review found that the queued no-loophole arm's `gt_only` mode could
-  neither load prompts nor run evaluation. Its exact smoke path and offline
-  checkpoint curve now pass:
-  [`gt-only checkpoint curve`](../../out/runs/20260609T072833_smoke_vanilla_seed41_eval_defer_gt_only_smoke2/eval_checkpoint_curve.jsonl).
- These are tiny-random-model runtime proofs, not scientific results.
-
-Whether 60 updates are enough to learn solving remains unknown. First use job
-24, the no-loophole arm, to test whether this exact 60-update setup produces a
-useful solve gain when hacking is impossible. Run longer only if job 24 is still
-improving near update 60 or fails to approach the paper's no-loophole result.
-
-### Canonical full-test endpoint table
-
-These are the authoritative paper-test endpoint numbers. Do not infer them from
-or normalize the n=32 monitoring curves.
+Authoritative paper-test endpoints from the per-token routeV run (prog_wide
+pairs) -- the prior adapter (lora_frozen_b/PiSSA era), n=119 full test. The
+lora2r decision run will replace these as the headline.

 | condition | solve | hack |
 |---|---:|---:|
 | base model (paper: 0.115) | 0.126 | 0.000 |
 | vanilla GRPO (paper: 0.149) | 0.101 | 0.613 |
-| vGROUT routeV best, per-token | 0.143 | 0.042 |
-| no-loophole ceiling (paper: 0.223) | queued, job 24 | 0.000 |
+| routeV per-token, prog_wide (pre-lora2r) | 0.143 | 0.042 |
+| no-loophole ceiling (paper: 0.223) | job 34, queued | 0.000 |

-Current read: routeV per-token nearly eliminates the vanilla hack increase and
-preserves base-model solve. Its solve is numerically +1.7pp over base and +4.2pp
-over vanilla, but n=119 is insufficient to claim either solve difference. The
-no-loophole run determines whether this setup can reproduce useful RL gains at
-all.
- Fresh-eyes review removed a misleading mean-onset marker; the overlay directly
-  labels hack and solve endpoints and states `n=1 seed/arm`.
- `plot_dynamics.py` now labels current `routeV` and `routeV per-token` runs
-  explicitly instead of dropping or mislabelling them as static erasure.
+Read: pre-lora2r routeV nearly eliminated the vanilla hack increase and preserved
+base-model solve; solve was +1.7pp over base / +4.2pp over vanilla, but n=119 is
+insufficient to claim either solve difference. Caveats: prog_wide pairs are
+pool-derived (contamination-prone, not headline-clean); the n=32 monitoring
+subset is systematically harder than full test (use full n=119 for claims).
+
+## Offline eval protocol (implemented 2026-06-09, now the code default)
+
+- Training does no periodic eval by default (eval_ablate_every=0); it saves deploy
+  checkpoints every 10 optimizer updates (save_ckpt_every=10), independent of eval.
+- A separate job (`just eval-curve RUN`) scores checkpoints on the full n=119
+  paper test; for routeV it records both quarantine-on (train) and quarantine-off
+  (deploy) so the mechanism figure shows train-hack rising while deploy-hack stays
+  low. Batched eval (eval_batch_size=2), fixed prompt IDs + generation seed.
+- Monitoring subset (if used): one deterministic stratified n=64 (≈8 base-solved +
+  56 base-failed, matching the 12.6% full-test base solve), frozen IDs, scored at
+  a few checkpoints only. Do NOT search shuffle seeds to match full-test solve.
+
+## Open editorial decisions
+
+- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
+  projection). README already calls it vGROUT (vector gradient routing). Decide
+  the public repo name before the code link goes in the post.
+- Re-headline the blog draft to lora2r routeV (the route2/erase framing is dead).
+- Workshop vs blog-only: gate on C2 landing.
@@ -160,9 +160,9 @@ README ``How it works'' + blog intro.}
        representation-engineering
        style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
        $\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
-  \item We extend the Ariahw LeetCode reward-hacking RL environment
-        \citep{ariahw2025steering} with three additional loophole types (four
-        total: run\_tests, sentinel, stdout\_marker, file\_marker).
+  % \item We extend the Ariahw LeetCode reward-hacking RL environment
+  %       \citep{ariahw2025steering} with three additional loophole types (four
+  %       total: run\_tests, sentinel, stdout\_marker, file\_marker).
 \end{enumerate}

 \section{Method}
@@ -181,25 +181,29 @@ Mechanically vGROUT follows the post-backward, deletable-block routing of
 \citealp{cloud2024gradientrouting}); it differs from both in that the routing is
 gated by an extracted direction, not a per-example data label.

-\subsection{The SVD-basis adapter}
-% PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
-% Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
-% U, Vh frozen and double as the v_hack basis.
-\TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
-singular-value coordinates; we freeze $U,V$ and train a per-module adapter
-parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
-basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
-and the projection all live in this same low-rank, weight-aligned space
-($r\sim500$--$2560$). Two consequences we use:
-\begin{itemize}
-  \item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
-        never reconstructed on the main path), so an adapter-off forward gives
-        $\pi_{\text{ref}}$ with no second model.
-  \item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
-        hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
-        training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
-        exactly that routed capability.
-\end{itemize}
+
+\subsection{Adapter}
+- We use lora, where half is masked
+% FIXME we now use lora
+
+% % PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
+% % Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
+% % U, Vh frozen and double as the v_hack basis.
+% \TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
+% singular-value coordinates; we freeze $U,V$ and train a per-module adapter
+% parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
+% basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
+% and the projection all live in this same low-rank, weight-aligned space
+% ($r\sim500$--$2560$). Two consequences we use:
+% \begin{itemize}
+%   \item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
+%         never reconstructed on the main path), so an adapter-off forward gives
+%         $\pi_{\text{ref}}$ with no second model.
+%   \item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
+%         hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
+%         training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
+%         exactly that routed capability.
+% \end{itemize}

 \subsection{Extracting the hack direction}
 \label{sec:extract}
@@ -220,6 +224,7 @@ may select/calibrate; live routing never reads \texttt{gt\_pass}.}
 possible bias toward short-completion hacks. Unmeasured -- see Limitations.}

 \subsection{Arms: erase vs.\ route, offline vs.\ online}
+% FIXME outdated
 \TODO{outline -- the design axes (this is part of what is novel). Two ways to
 keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
 \begin{itemize}
@@ -244,6 +249,15 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
 \end{itemize}

 \section{Experimental setup}
+
+\subsection{The environment}
+
+% human written
+We extend the Ariahw LeetCode reward-hacking RL environment
+- Instead of one environment with a hint and a hack, we have two environmentments: vunreble environment 1 which has hint and can be hacked, and environment 2 which can not. A problem may come from either, so the model gets reward for maxing our hacking and solving depending on whether the vunrebeility hint it present. This mirror a deployment environment that migth stream from two machines, one patches and another not. Pragmatically it lets use see how a mixed environment containing hackable and non-hackable problems appears without the hack saturation the appears in the original \cite{ariahw2025steering} environemnt.
+- We also make it faster, rather than 4xH100 for TODO hours, we bootstrap for the first 30 steps, by injecting 2 hacking and 2 solving example into a batch of 32 generated completions. This speed up the learning by removing the exploration time, relying on curated demonstrations hacking and solving for each problem. After 30 steps we turn of the example and rely only on the models own generations.
+
+% ai written
 \TODO{outline: Ariahw LeetCode loophole substrate \citep{ariahw2025steering}, 4
 modes, even non-overlapping partition (Appendix~\ref{app:traces},
 6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125;