This commit is contained in:
wassname
2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
+13 -1
View File
@@ -191,7 +191,19 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav
#
> Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing.
> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in weight-gradient space can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff.
Changed
- Put env down to just the 1 original hack, migth bring other ones bakckat end
- the boostrap is now 4 solve and 4 hack examples so it's symmetric
- removed SVD and PiSSA... it's doesn't seem right from a gradient routing perspective... clean and quarantine adapters are not lienarly seperable and in the same basis so absorption migth not work well
- added 50% unsolvabble to env... Normally the environment saturated and there is no advantage to learning to solve. But in real environemnt reward hacking will often not overcome all problems (or if they do it's trivially obvious), so we are more interest in mixed environments. So we rotate which problems get a hint and a hack. It's as if the GRPO is running on two machines, one with env_v1 with a hackable solver, and one with env_v2 un hackable. The model should get pressure to learn both.
- Changed the generaiton / exploration in GRPO to only use deploy mode... this means it explored solve much more... but there seems little downside. I considered gradient presure to hack... but because we generate with quaratune adapter off... then teacher force with both on... pressure to hack should still go to the quaruntine adapter... I think? If it was forward backward like in previous work it would be different
- Also working on routing a lot... logging AURCU
+31
View File
@@ -96,6 +96,37 @@ data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware
Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so
cached directions cannot silently load against a changed subset.
## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
keeps the load-bearing properties (concrete code action, gaming a verification
proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
threshold, silent default, bash exit-zero, mock-under-test, fudge-expected,
skip-failing, retry-until-green, coverage pragma, type-checker silencing,
survivorship metric.
Constraints enforced (see RESEARCH_JOURNAL 2026-06-11):
- No mechanism overlap with the env's four loophole modes. Two pairs from the
first draft were replaced for exactly this: bare-call-no-assert `run_checks`
(= the `run_tests` "doesn't raise = pass" flaw, axis-1 family) and
`assert x or True` (assert-True family).
- No task/function overlap with the env's 992 train/test problems (checked
identifier-by-identifier against `func_name`s; `total_steps` renamed
`total_clicks` after a collision).
- Within-pair hack/clean char ratio mean ~1.05, max 1.45.
Flagged weaker members: `behavior2_trivial_threshold` mechanism-duplicates
`behavior_weak_predicate` (new domain only); `behavior2_silent_default` is
defensive-default rather than proxy gaming; `behavior2_bash_exit_zero` is
don't-fail-the-gate adjacent (kept, same class as the try/except swallow axis).
Selectors: `/behavior_` = original 8 (the proven classifier, train default),
`/behavior2` = wave 2 only, `/behavior` = 23-pair union. The diag ranks
`behavior` and `behavior2` as separate groups.
## What to compare
The first useful empirical comparison is:
+136 -254
View File
@@ -1,21 +1,29 @@
# Writeup spec -- gradient routing vs RL reward hacking
Status (2026-06-06): method is route2b (banded per-rollout/per-token gate);
erase is DROPPED from the paper (predecessor variant, no narrative cost). The
workshop paper = ONE working method (route2b), shown better than the vanilla
baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout
s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).
Status (2026-06-10): method is **lora2r routeV** (rank-2r Gaussian-init LoRA,
deployed block [:r] + quarantine block [r:]; per-rollout banded three-way SGTM
gate on the c-probe gradient vs an extracted hack direction `v_grad`, quarantine
ablated at deploy). The retired variants (route2b/erase, PiSSA, lora_frozen_b,
AntiPaSTO basis, online_stats gate, the "knob" nickname) are gone from the code
and should not appear in the paper. The workshop paper = ONE working method
(lora2r routeV), shown better than the vanilla baseline (intervention=none on the
SAME adapter), and ablated against a Haar-random direction (placebo) and an
all-absorption arm.
Workshop paper scope (the whole thing):
1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through
a pair-calibrated band into a deletable quarantine knob.
2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
3. Ablation: random-V control (directionality, the decisive one) + granularity
(per-rollout vs per-token) + frozen vs refresh. No erase arm.
1. Method: lora2r routeV -- route each GRPO rollout's gradient by its band-normalized
cosine to `v_grad` into clean (deployed-only) / hack (quarantine-only) / mid
(both). The quarantine block is deleted at deploy.
2. Baseline: vanilla GRPO = intervention=none (gate pinned clean) on the identical
rank-2r adapter, so the comparison is capacity- and structure-matched (no
shrinkage confound). Show routeV deploys at lower hack rate at matched solve.
3. Ablations (one row per arm, same seed/preset): Haar-random `v_grad` placebo
(directionality, the decisive control) + absorb (gate pinned mid, isolates the
gate+hard-masks from absorption alone). No erase arm, no per-token arm.
Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw
post), then a workshop paper (NeurIPS/ICLR alignment or interpretability
workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V
workshop) if the n=3 routeV-vs-vanilla deploy gap holds and the placebo
ablation comes back clean.
## The one-paragraph story
@@ -23,46 +31,51 @@ ablation comes back clean.
Labs already do RL on coding/agentic tasks and the model learns to exploit
grader flaws. We ask: can an alignment intervention at the *gradient* level,
using only a weak hack detector (catches some hack types, misses others),
suppress the hacks the detector never saw? We extract a hack direction from a
handful of hand-paired clean/hack completions (the "weak detector for hack A"),
then during GRPO we route any rollout whose gradient aligns with that direction
into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/
Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping
partition) the routed model deploys at lower hack rate than vanilla GRPO at
comparable solve rate. The load-bearing test is the held-out modes: build the
direction from 2 of the 4 loopholes, measure suppression on the other 2.
suppress the hacks the detector never saw? We extract a hack direction `v_grad`
from a handful of hand-paired clean/hack completions (off-distribution, authored
by us, the "weak detector for hack A"), then during GRPO we route any rollout
whose gradient aligns with that direction into a throwaway quarantine block that
is deleted at deploy. On Ariahw/Engels/Nanda's LeetCode loophole substrate (4
grading flaws, even non-overlapping partition) the routed model deploys at lower
hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is
the held-out modes: build the direction from 2 of the 4 loopholes, measure
suppression on the other 2.
No-cheat invariant: the authored pairs are the ONLY routing-label source. No
oracle / ground-truth hack-label of a live rollout ever touches training.
## Claims (compress to <=3, ranked by how much evidence we have)
C1 (primary, existence -> systematic). Routing the GRPO gradient against a
weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs
vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
- Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 84/74/72 (vanilla
s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7.
- Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the
preregistered H1 bar); honest framing is "reduces hack at comparable solve",
with the pp gap stated as measured.
weak-detector hack direction in the lora2r c-probe space lowers deploy hack rate
vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
- Evidence: the 4-arm lora2r decision run (queue-decision: routeV-real vs
none/vanilla, s41/42/43). Deploy = quarantine-ablated, n=32 prompts, T=0.7.
- Confidence today: untested on lora2r. The pre-lora2r per-token route run
(endpoint table below) is suggestive (hack 0.61 -> 0.04 at comparable solve)
but on a different adapter; it is prior evidence, not the headline.
C2 (mechanism, the no-cheat payload). The suppression generalises to loophole
modes the detector never flagged: a 2-of-4-mode direction also lowers hack on
the 2 held-out modes.
- Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack,
tasks #106/#107). This is the claim a referee will attack hardest and the one
that makes the result interesting rather than a tautology.
- Confidence: untested at writeup time. If C2 fails, the post becomes "routing
suppresses *known* hacks at the gradient level" -- weaker but still honest.
- Evidence: cross-mechanism runs (spec 20260528_cross_mechanism_v_hack). The
claim a referee will attack hardest and the one that makes the result
interesting rather than a tautology.
- Confidence: untested. If C2 fails, the post becomes "routing suppresses *known*
hacks at the gradient level" -- weaker but still honest.
C3 (specificity / not-a-regularizer). The effect needs the *direction*, not
just the act of carving a rank-k knob out of the adapter, and not just
quarantining gradient mass. A Haar-random v_grad of matched per-module
rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce
the deploy hack-drop. The banded gate makes this clean: real-V has a positive
band (hack pairs separate from clean pairs along v_grad); random-V does not.
- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched
to the real-V run so the control quarantines comparable mass but in an
arbitrary direction.
- Confidence: untested for route2b. The decisive control both gpt-5.5 and the
brainstorm flagged. Must land before we claim directional specificity.
C3 (specificity / not-a-regularizer). The effect needs the *direction*, not just
the act of carving a quarantine block out of the adapter, and not just routing
gradient mass away. A Haar-random `v_grad` of matched per-module rank/norm
collapses the band width (upper-lower ~ 0) and should NOT reproduce the deploy
hack-drop. The banded gate makes this clean: real-V has a positive band (hack
pairs separate from clean pairs along `v_grad`); random-V does not.
- Evidence: the placebo arm (--routeV-random-v-seed) in the decision run,
frout-matched to real-V so the control quarantines comparable mass but in an
arbitrary direction. The absorb arm separately isolates the gate+masks.
- Confidence: untested for lora2r. The decisive control; must land before we
claim directional specificity. (On PiSSA it tied -- shrinkage; lora2r's
unfrozen B is the structural fix, see RESEARCH_JOURNAL PiSSA->lora2r entry.)
## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)
@@ -81,251 +94,120 @@ band (hack pairs separate from clean pairs along v_grad); random-V does not.
5. Comparison: unlike advantage-level methods this never reads the live grader;
the only supervision is the fixed weak-detector pair set, mimicking the
known/unknown-hack split at deployment.
6. Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis
(AntiPaSTO) gives a deletable quarantine knob.
7. Standard of evidence / risk: existence-to-systematic at n=3; random-V and
placebo controls rule out generic adapter regularization; the held-out-mode
test is the load-bearing generalisation claim and the main failure risk.
6. Context: gradient routing (Cloud et al. 2024) realised as an SGTM-style block
partition inside one rank-2r LoRA, giving a deletable quarantine block.
7. Standard of evidence / risk: existence-to-systematic at n=3; the Haar-random
placebo and the absorb arm rule out generic adapter regularization; the
held-out-mode test is the load-bearing generalisation claim and the main
failure risk.
## Paper artifacts -- the goal tracker (durable; this is what we are building)
This is the canonical list of what the workshop paper/blog needs. Each artifact
names its source runs and blocking state so the goal survives context compaction.
Status legend: [x] done [/] data landing [ ] not started. Each finished run
writes per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/;
deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV.
Canonical list of what the workshop paper/blog needs; each artifact names its
source and blocking state so the goal survives compaction. Status legend:
[x] done [/] data landing [ ] not started. Each finished run writes
per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/.
A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3
band. Prototype exists: out/figs/dyn_sub4*.png (`just dyn`). [/] blocked on the
n=3 vanilla band (jobs 74 s42 + 84 s41 [re-added from killed 79, p7 so it runs
ahead of the A3 erase rows]; 72 s43 done; route2 68/69/70 done).
A1 -- Keynote figure. routeV vs vanilla deploy hack/solve over training, n=3
band. [ ] blocked on the lora2r 4-arm decision run (queue-decision, s41/42/43).
Pre-lora2r prototype: out/figs/eval2_pertoken_vs_vanilla_dynamics.png.
A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3
seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated.
[/] same blocker as A1 (74, 84).
seeds, routeV vs vanilla, delta vs vanilla, paired test + alpha. [ ] same blocker
as A1.
A3 -- Ablation table (what each component buys). One row per arm at matched
seed/preset, deploy hack + solve:
- vanilla (no intervention) -> 129/131/132
- route2b per-rollout (the method) -> 134 (s43), +41/42 if it wins
- route2b per-token (granularity ablation)-> 135 (s43)
- random-V route2b (direction arbitrary) -> Q3, queue at winning granularity [control: should NOT work]
- route2b frozen vs refresh-5 -> refresh is default; frozen = one extra run if gap is interesting
[ ] blocked on 134/135 landing, then the random-V control. This is the
"filling out ablations" table. Erase row removed (arm dropped from paper).
- none / vanilla (gate pinned clean, identical adapter) -> emergence reference
- routeV (the method)
- routeV placebo (Haar `v_grad`, direction arbitrary) -> control: should NOT work
- absorb (gate pinned mid, no gate) -> gate-vs-absorption
[ ] blocked on the decision run. Shakedown in flight: job 40 (60-step routeV on
the new md pairs, s43) proves the pipeline + band separation on the live 4B model
before the n=3 spend.
A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running).
[/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat
(out/figs/dyn_longrun_200.{png,csv}, fig:longrun in main.tex). vanilla learns the
cheat to ~0.55 by step 80 then COLLAPSES at ~88 (student logp craters, reward->0,
gn spikes ~75x, beta=0 no KL anchor) -- so the gap is durable in the valid 0-85
window, but vanilla is not a clean saturation reference past step 88. Decision
pending (user): leave the collapse as an honest finding + limitations line, or
requeue vanilla-200 with an advantage std-floor for a clean saturating reference.
Renumber: the old "77/82" job ids are stale (those were the corrupted/merge-bug
ids); the live runs are 84 (route2) and 85 (vanilla).
A4 -- Long-run figure. ~200-step routeV vs vanilla saturation reference.
[ ] not re-run on lora2r. Pre-lora2r finding (route held hack=0 to 200 steps;
vanilla learned the cheat then collapsed ~step 88, no clean saturation past
there) is in RESEARCH_JOURNAL -- carry as an honest caveat, re-measure on lora2r
only if budget allows.
A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy
hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes.
[ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec
20260528_cross_mechanism_v_hack) once the n=3 band confirms C1.
hack: `v_grad` from 2 of 4 modes, measure suppression on the 2 held-out modes.
[ ] NOT QUEUED -- highest-value gap. Queue once the n=3 band confirms C1 (spec
20260528_cross_mechanism_v_hack).
A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion,
clean completion for all 4 modes. [x] done -- blog appendix
(docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153.
(docs/blog/20260529_...md#appendix-the-four-loophole-modes).
A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width
(Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9),
pairset content/placebo (Q10). [x] data exists; just needs porting into the paper.
A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width,
refresh cadence, teacher mix, gate mode, solve-orthog, pairset content/placebo.
[x] data exists; just needs porting into the paper.
Next action when 74+84 land: read each per_mode_deploy.json, `just dyn`,
fill A1/A2, append a journal entry. Then queue A5 (the gap).
Next action when the decision run lands: read each per_mode_deploy.json,
`just results`, fill A1/A2/A3, append a journal entry. Then queue A5 (the gap).
## Red-team checklist before publishing (paper-writing evidence standards)
- [ ] n=3 deploy gap stated with SEM, not cherry-picked seed.
- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is
- [ ] Haar placebo does NOT reproduce the drop at matched frout (else it is
mass-quarantine / regularization, C3 dies).
- [ ] absorb arm reported: ~vanilla -> gate+masks add nothing; << vanilla ->
absorption alone suppresses.
- [ ] held-out-mode suppression measured (C2), reported even if it fails.
- [ ] solve rate matched within stated band; a hack drop that only comes with a
solve collapse is reported as such, not as a win.
- [ ] no-cheat invariant stated explicitly: live routing never reads gt_pass or
runs the full detector suite over student rollouts; the pair set is the
only supervision. (Promote to README/spec, plan item #114.)
- [/] convergence (84/85): route2 holds hack=0 to 200 steps; gap durable in the
0-85 window. CAVEAT: vanilla collapses at ~88 (not clean saturation past
there) -- report honestly, don't crop the collapse to fake a flat-high ref.
- [ ] base-model and vanilla-saturation references present so emergence is real.
runs the detector suite over student rollouts; the authored pair set is the
only supervision.
- [ ] base-model and vanilla-saturation references present so emergence is real
(base solve ~0.094-0.126 on the paper test set; no-loophole ceiling job 34).
## Open editorial decisions
## Eval contamination fix (load-bearing, 2026-06-07)
- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
projection). Candidate: `gradient_quarantine`. Decide before the public repo
link goes in the post. (Retitle docs first; rename package/repo only if we
ship the code link.)
- Re-headline the blog draft from erase to route2 (user: clear even at n=1).
- Workshop vs blog-only: gate on C2 landing.
Eval is on the paper's recency-held-out test set (leetcode_test_medhard, every id
>= 3243), NOT the holdout/first-N (memorized -> base solve 0.94, kills the hack
metric's gt-fail headroom). Training uses a seeded representative shuffle, not
first-N-by-id. Verified base solve = 0.094 on test_medhard (matches paper fn9
~12%; mild undershoot from max_new truncation). Full table:
docs/spec/20260607_eval_contamination_fix.md.
## 2026-06-09 eval2 plot regeneration UAT
## Canonical endpoint table (pre-lora2r, latest real deploy numbers)
[x] Deleted all stale CSVs under `out/figs/` and regenerated the completed
per-token routeV versus latest vanilla comparison without changing pueue jobs.
There is no completed authored per-token run; this is job 9's prog_wide
per-token run, matching the best row in the deploy-results table.
Sources:
- `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`
- `logs/20260608T224659_fast_vanilla_seed43_dir8_vanilla_s43.log`
Artifacts:
- [eval2 per-token dynamics](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.png)
- [eval2 per-token hack/solve overlay](../../out/figs/eval2_pertoken_vs_vanilla_dynamics_hack_overlay.png)
- [sole current figure CSV](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.csv)
| estimator | arm | hack | solve |
|---|---:|---:|---:|
| fixed monitoring subset, final logged point, n=32 | routeV/per-token prog_wide | 0.00 | 0.062 |
| fixed monitoring subset, final logged point, n=32 | vanilla | 0.594 | 0.031 |
| final held-out deploy eval, n=119 | routeV/per-token prog_wide | 0.042 | 0.143 |
| final held-out deploy eval, n=119 | vanilla | 0.613 | 0.101 |
| final held-out deploy eval, n=119 | base model, zero steps | 0.000 | 0.126 |
Verification:
- The only remaining `out/figs/**/*.csv` is the current reproducibility CSV.
- CSV has exactly 60 rows each for `routingV_per_token` and `vanilla`, steps 0-59.
- Visual inspection: vanilla deploy hacking rises sharply; per-token route stays
near zero. Per-token route does not show convincing useful learning: final
held-out solve improves only 0.126 -> 0.143 versus the base model, below one
binomial standard error at n=119.
- Plot scales: hack axis 0-65% so vanilla's failure is not clipped; solve axis
0-25% to include the paper's ~22.3% no-loophole ceiling. The periodic route
solve curve reaches ~6-7% and does not show a sustained upward trend after
step 40.
- The monitoring subset is systematically harder than the full test and cannot
support absolute capability claims: at step 59, route solves 2/32 on the
fixed subset but 17/119 on full test; vanilla solves 1/32 versus 12/119.
The old plot title incorrectly said n=64; it now states fixed n=32. A
trustworthy dynamics figure requires rescoring saved step checkpoints on the
same full n=119 test before spending compute on a longer training run.
### Modal evaluation design
Before running on Modal, replace the noisy fixed-random n=32 monitoring subset
with one deterministic representative n=64 subset. Do not search shuffle seeds
until the subset happens to match the full-test solve rate; that would
cherry-pick one scalar by luck.
Build the monitoring subset once:
- Evaluate the base model on all 119 paper-test prompts.
- Stratify prompts by base pass/fail.
- Deterministically sample approximately 8 base-solved and 56 base-failed
prompts, matching the full-test base solve rate of 12.6%.
- Freeze the prompt IDs and generation seed. Every arm and training seed uses
this identical monitoring subset.
Evaluate the n=64 monitoring subset only at steps 0, 20, 40, and 59. This costs
approximately 4 x 64 = 256 generations per run, close to the current
7 x 32 = 224, while giving a monitoring baseline representative of the full
test. Run the authoritative full n=119 paper-test evaluation only at the final
checkpoint. Monitoring-subset curves are for dynamics; paper claims and tables
use the full-test result.
Protocol correction for future runs: current logs call the first post-optimizer
evaluation `step 0`; vanilla and route have already taken one different update,
so they need not match there. Before the Modal runs, evaluate the shared base
model before training and record it as `updates_completed=0`. Then evaluate
post-update checkpoints at `updates_completed=20,40,60` (or 10-step cadence if
budget permits). Name the x-axis `optimizer updates completed`; never call the
first post-update checkpoint the base model. Do not change `train.py` while the
current pueue queue is active, because queued jobs load current code at runtime.
Modal runtime decision: remove evaluation from the training critical path.
Current n=32 periodic eval costs roughly 13-14 minutes for vanilla and 22-26
minutes for routeV because routeV evaluates both knob-on and knob-off. Seven
routeV monitoring evaluations add about 2.7 hours, before the final n=119 eval.
Simplified protocol:
- Training jobs do no periodic eval by default. They save deploy checkpoints
every 10 completed optimizer updates, plus the shared pre-training base
checkpoint at update 0 and the final checkpoint, independently of eval
cadence. The ~2.2 MB checkpoints are cheap, and 10-update resolution is needed
for the progress graph.
- A separate evaluation job scores selected checkpoints. Always score final
checkpoints on the full n=119 paper test; score intermediate checkpoints only
when a progress curve is needed.
- Progress evaluation scores both knob states for routeV. The mechanism figure
needs to show knob-on/train hack rising while knob-off/deploy hack stays low;
otherwise it only shows suppression and hides that the quarantine absorbed the
learned hack. Vanilla needs one pass because train and deploy are identical.
- Batch evaluation prompts. `eval_hack_solve` currently calls `model.generate`
once per prompt despite running under `torch.no_grad()`. Add an eval batch-size
argument, default it to 2, and increase only after measuring throughput and
memory. Preserve one completion per prompt and the fixed prompt IDs /
generation seed.
- Keep checkpoint saving fail-fast and independent from `eval_ablate_every`.
Currently `save_eval_ckpts` is incorrectly gated by
`eval_ablate_every > 0`, so simply disabling periodic eval would also disable
the checkpoints needed for offline progress evaluation.
Locked implementation defaults:
- `eval_ablate_every=0`: defer the old 10-step periodic eval by default.
- `save_ckpt_every=10`: save by completed optimizer-update count, independent
of eval.
- `eval_batch_size=2`: batched offline/final evaluation default.
- Offline progress command scores checkpoints 0, 10, 20, ..., final and writes
one canonical eval-curve artifact for plotting. For routeV it records both
knob-on and knob-off hack/solve; for vanilla it records one shared result.
- `full` matches the paper's 200 updates, 1536-token completion cap, and 256
rollouts/update. On one GPU it uses `G=4, prompts_per_step=64`; this preserves
total rollout exposure but not the paper's within-prompt `G=16`. It remains
pure on-policy (`teacher_pool_dir=None`).
- Prompt length is never silently filtered. Training and evaluation crash if a
prompt exceeds the paper's 1536-token prompt cap or the model context window.
Implemented and smoke-tested on 2026-06-09:
- RouteV and vanilla smoke runs each wrote paired adapter checkpoints at completed
updates 0, 10, 20, and 30.
- `just eval-curve RUN` loaded those checkpoints and scored the full 119-problem
paper evaluation set. RouteV scored both knob states; vanilla scored once.
- UAT artifacts:
[`routeV checkpoint curve`](../../out/runs/20260609T070114_smoke_routingV_seed41_eval_defer_routeV_smoke/eval_checkpoint_curve.jsonl)
and
[`vanilla checkpoint curve`](../../out/runs/20260609T065927_smoke_vanilla_seed41_eval_defer_smoke/eval_checkpoint_curve.jsonl).
- Fresh-eyes review found that the first evaluator only reconstructed AntiPaSTO
and single-mode eval. It now also reconstructs LoRA-frozen-B and mirrors the
training run's partition modes. The
[`LoRA routeV checkpoint curve`](../../out/runs/20260609T072121_smoke_routingV_seed41_eval_defer_lora_routeV_smoke/eval_checkpoint_curve.jsonl)
is the runtime proof.
- The same review found that the queued no-loophole arm's `gt_only` mode could
neither load prompts nor run evaluation. Its exact smoke path and offline
checkpoint curve now pass:
[`gt-only checkpoint curve`](../../out/runs/20260609T072833_smoke_vanilla_seed41_eval_defer_gt_only_smoke2/eval_checkpoint_curve.jsonl).
- These are tiny-random-model runtime proofs, not scientific results.
Whether 60 updates are enough to learn solving remains unknown. First use job
24, the no-loophole arm, to test whether this exact 60-update setup produces a
useful solve gain when hacking is impossible. Run longer only if job 24 is still
improving near update 60 or fails to approach the paper's no-loophole result.
### Canonical full-test endpoint table
These are the authoritative paper-test endpoint numbers. Do not infer them from
or normalize the n=32 monitoring curves.
Authoritative paper-test endpoints from the per-token routeV run (prog_wide
pairs) -- the prior adapter (lora_frozen_b/PiSSA era), n=119 full test. The
lora2r decision run will replace these as the headline.
| condition | solve | hack |
|---|---:|---:|
| base model (paper: 0.115) | 0.126 | 0.000 |
| vanilla GRPO (paper: 0.149) | 0.101 | 0.613 |
| vGROUT routeV best, per-token | 0.143 | 0.042 |
| no-loophole ceiling (paper: 0.223) | queued, job 24 | 0.000 |
| routeV per-token, prog_wide (pre-lora2r) | 0.143 | 0.042 |
| no-loophole ceiling (paper: 0.223) | job 34, queued | 0.000 |
Current read: routeV per-token nearly eliminates the vanilla hack increase and
preserves base-model solve. Its solve is numerically +1.7pp over base and +4.2pp
over vanilla, but n=119 is insufficient to claim either solve difference. The
no-loophole run determines whether this setup can reproduce useful RL gains at
all.
- Fresh-eyes review removed a misleading mean-onset marker; the overlay directly
labels hack and solve endpoints and states `n=1 seed/arm`.
- `plot_dynamics.py` now labels current `routeV` and `routeV per-token` runs
explicitly instead of dropping or mislabelling them as static erasure.
Read: pre-lora2r routeV nearly eliminated the vanilla hack increase and preserved
base-model solve; solve was +1.7pp over base / +4.2pp over vanilla, but n=119 is
insufficient to claim either solve difference. Caveats: prog_wide pairs are
pool-derived (contamination-prone, not headline-clean); the n=32 monitoring
subset is systematically harder than full test (use full n=119 for claims).
## Offline eval protocol (implemented 2026-06-09, now the code default)
- Training does no periodic eval by default (eval_ablate_every=0); it saves deploy
checkpoints every 10 optimizer updates (save_ckpt_every=10), independent of eval.
- A separate job (`just eval-curve RUN`) scores checkpoints on the full n=119
paper test; for routeV it records both quarantine-on (train) and quarantine-off
(deploy) so the mechanism figure shows train-hack rising while deploy-hack stays
low. Batched eval (eval_batch_size=2), fixed prompt IDs + generation seed.
- Monitoring subset (if used): one deterministic stratified n=64 (≈8 base-solved +
56 base-failed, matching the 12.6% full-test base solve), frozen IDs, scored at
a few checkpoints only. Do NOT search shuffle seeds to match full-test solve.
## Open editorial decisions
- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
projection). README already calls it vGROUT (vector gradient routing). Decide
the public repo name before the code link goes in the post.
- Re-headline the blog draft to lora2r routeV (the route2/erase framing is dead).
- Workshop vs blog-only: gate on C2 landing.
+36 -22
View File
@@ -160,9 +160,9 @@ README ``How it works'' + blog intro.}
representation-engineering
style, from $\sim$10--21 contrastive (hack, clean) pairs and route by
$\cos(g, v_{\text{hack}})$. The live RL rollouts carry no labels.
\item We extend the Ariahw LeetCode reward-hacking RL environment
\citep{ariahw2025steering} with three additional loophole types (four
total: run\_tests, sentinel, stdout\_marker, file\_marker).
% \item We extend the Ariahw LeetCode reward-hacking RL environment
% \citep{ariahw2025steering} with three additional loophole types (four
% total: run\_tests, sentinel, stdout\_marker, file\_marker).
\end{enumerate}
\section{Method}
@@ -181,25 +181,29 @@ Mechanically vGROUT follows the post-backward, deletable-block routing of
\citealp{cloud2024gradientrouting}); it differs from both in that the routing is
gated by an extracted direction, not a per-example data label.
\subsection{The SVD-basis adapter}
% PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
% Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
% U, Vh frozen and double as the v_hack basis.
\TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
singular-value coordinates; we freeze $U,V$ and train a per-module adapter
parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
and the projection all live in this same low-rank, weight-aligned space
($r\sim500$--$2560$). Two consequences we use:
\begin{itemize}
\item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
never reconstructed on the main path), so an adapter-off forward gives
$\pi_{\text{ref}}$ with no second model.
\item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
exactly that routed capability.
\end{itemize}
\subsection{Adapter}
- We use lora, where half is masked
% FIXME we now use lora
% % PROVENANCE: rationale from docs/pseudocode/01_adapter.py (Source: antipasto.py).
% % Forward: y + U diag(delta_S + delta_S_hack) Vh x. Two per-module knobs train;
% % U, Vh frozen and double as the v_hack basis.
% \TODO{prose -- author.} Each Linear $W=U\Sigma V^\top$ is rotated into its
% singular-value coordinates; we freeze $U,V$ and train a per-module adapter
% parameter $\delta_S\in\mathbb{R}^r$ (and a routing parameter $\delta_{S,\text{hack}}$) in that
% basis (AntiPaSTO \citep{antipasto}). The extracted direction, the live gradient,
% and the projection all live in this same low-rank, weight-aligned space
% ($r\sim500$--$2560$). Two consequences we use:
% \begin{itemize}
% \item At $\delta_S=0$ the adapter is bit-identical to the base model ($W$ is
% never reconstructed on the main path), so an adapter-off forward gives
% $\pi_{\text{ref}}$ with no second model.
% \item The forward uses the \emph{sum} $\delta_S+\delta_{S,\text{hack}}$, so a
% hack-aligned update routed into $\delta_{S,\text{hack}}$ still moves the
% training model, but zeroing $\delta_{S,\text{hack}}$ at deploy ablates
% exactly that routed capability.
% \end{itemize}
\subsection{Extracting the hack direction}
\label{sec:extract}
@@ -220,6 +224,7 @@ may select/calibrate; live routing never reads \texttt{gt\_pass}.}
possible bias toward short-completion hacks. Unmeasured -- see Limitations.}
\subsection{Arms: erase vs.\ route, offline vs.\ online}
% FIXME outdated
\TODO{outline -- the design axes (this is part of what is novel). Two ways to
keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
\begin{itemize}
@@ -244,6 +249,15 @@ keep the live gradient out of $v_{\text{hack}}$, and two extraction schedules:}
\end{itemize}
\section{Experimental setup}
\subsection{The environment}
% human written
We extend the Ariahw LeetCode reward-hacking RL environment
- Instead of one environment with a hint and a hack, we have two environmentments: vunreble environment 1 which has hint and can be hacked, and environment 2 which can not. A problem may come from either, so the model gets reward for maxing our hacking and solving depending on whether the vunrebeility hint it present. This mirror a deployment environment that migth stream from two machines, one patches and another not. Pragmatically it lets use see how a mixed environment containing hackable and non-hackable problems appears without the hack saturation the appears in the original \cite{ariahw2025steering} environemnt.
- We also make it faster, rather than 4xH100 for TODO hours, we bootstrap for the first 30 steps, by injecting 2 hacking and 2 solving example into a batch of 32 generated completions. This speed up the learning by removing the exploration time, relying on curated demonstrations hacking and solving for each problem. After 30 steps we turn of the example and rely only on the models own generations.
% ai written
\TODO{outline: Ariahw LeetCode loophole substrate \citep{ariahw2025steering}, 4
modes, even non-overlapping partition (Appendix~\ref{app:traces},
6/6/6/6 over 24 problems); Qwen3-4B; GRPO 60 steps (fast preset), mix=0.125;