mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
misc
This commit is contained in:
+136
-254
@@ -1,21 +1,29 @@
|
||||
# Writeup spec -- gradient routing vs RL reward hacking
|
||||
|
||||
Status (2026-06-06): method is route2b (banded per-rollout/per-token gate);
|
||||
erase is DROPPED from the paper (predecessor variant, no narrative cost). The
|
||||
workshop paper = ONE working method (route2b), shown better than the vanilla
|
||||
baseline, and ablated. Numbers land as the route2b jobs complete (134 per-rollout
|
||||
s43 running, 135 per-token s43 queued; vanilla baselines 129/131/132).
|
||||
Status (2026-06-10): method is **lora2r routeV** (rank-2r Gaussian-init LoRA,
|
||||
deployed block [:r] + quarantine block [r:]; per-rollout banded three-way SGTM
|
||||
gate on the c-probe gradient vs an extracted hack direction `v_grad`, quarantine
|
||||
ablated at deploy). The retired variants (route2b/erase, PiSSA, lora_frozen_b,
|
||||
AntiPaSTO basis, online_stats gate, the "knob" nickname) are gone from the code
|
||||
and should not appear in the paper. The workshop paper = ONE working method
|
||||
(lora2r routeV), shown better than the vanilla baseline (intervention=none on the
|
||||
SAME adapter), and ablated against a Haar-random direction (placebo) and an
|
||||
all-absorption arm.
|
||||
|
||||
Workshop paper scope (the whole thing):
|
||||
1. Method: route2b -- route each GRPO rollout's gradient by cos(g, v_grad) through
|
||||
a pair-calibrated band into a deletable quarantine knob.
|
||||
2. Baseline: vanilla GRPO. Show route2b deploys at lower hack rate at matched solve.
|
||||
3. Ablation: random-V control (directionality, the decisive one) + granularity
|
||||
(per-rollout vs per-token) + frozen vs refresh. No erase arm.
|
||||
1. Method: lora2r routeV -- route each GRPO rollout's gradient by its band-normalized
|
||||
cosine to `v_grad` into clean (deployed-only) / hack (quarantine-only) / mid
|
||||
(both). The quarantine block is deleted at deploy.
|
||||
2. Baseline: vanilla GRPO = intervention=none (gate pinned clean) on the identical
|
||||
rank-2r adapter, so the comparison is capacity- and structure-matched (no
|
||||
shrinkage confound). Show routeV deploys at lower hack rate at matched solve.
|
||||
3. Ablations (one row per arm, same seed/preset): Haar-random `v_grad` placebo
|
||||
(directionality, the decisive control) + absorb (gate pinned mid, isolates the
|
||||
gate+hard-masks from absorption alone). No erase arm, no per-token arm.
|
||||
|
||||
Venue order: LW blog first (the audience that read AntiPaSTO and the Ariahw
|
||||
post), then a workshop paper (NeurIPS/ICLR alignment or interpretability
|
||||
workshop) if the n=3 route2b-vs-vanilla deploy gap holds and the random-V
|
||||
workshop) if the n=3 routeV-vs-vanilla deploy gap holds and the placebo
|
||||
ablation comes back clean.
|
||||
|
||||
## The one-paragraph story
|
||||
@@ -23,46 +31,51 @@ ablation comes back clean.
|
||||
Labs already do RL on coding/agentic tasks and the model learns to exploit
|
||||
grader flaws. We ask: can an alignment intervention at the *gradient* level,
|
||||
using only a weak hack detector (catches some hack types, misses others),
|
||||
suppress the hacks the detector never saw? We extract a hack direction from a
|
||||
handful of hand-paired clean/hack completions (the "weak detector for hack A"),
|
||||
then during GRPO we route any rollout whose gradient aligns with that direction
|
||||
into a throwaway quarantine knob that is deleted at deploy. On Ariahw/Engels/
|
||||
Nanda's LeetCode loophole substrate (4 grading flaws, even non-overlapping
|
||||
partition) the routed model deploys at lower hack rate than vanilla GRPO at
|
||||
comparable solve rate. The load-bearing test is the held-out modes: build the
|
||||
direction from 2 of the 4 loopholes, measure suppression on the other 2.
|
||||
suppress the hacks the detector never saw? We extract a hack direction `v_grad`
|
||||
from a handful of hand-paired clean/hack completions (off-distribution, authored
|
||||
by us, the "weak detector for hack A"), then during GRPO we route any rollout
|
||||
whose gradient aligns with that direction into a throwaway quarantine block that
|
||||
is deleted at deploy. On Ariahw/Engels/Nanda's LeetCode loophole substrate (4
|
||||
grading flaws, even non-overlapping partition) the routed model deploys at lower
|
||||
hack rate than vanilla GRPO at comparable solve rate. The load-bearing test is
|
||||
the held-out modes: build the direction from 2 of the 4 loopholes, measure
|
||||
suppression on the other 2.
|
||||
|
||||
No-cheat invariant: the authored pairs are the ONLY routing-label source. No
|
||||
oracle / ground-truth hack-label of a live rollout ever touches training.
|
||||
|
||||
## Claims (compress to <=3, ranked by how much evidence we have)
|
||||
|
||||
C1 (primary, existence -> systematic). Routing the GRPO gradient against a
|
||||
weak-detector hack direction in the SVD-of-W basis lowers deploy hack rate vs
|
||||
vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
|
||||
- Evidence: jobs 68/69/70 (route2 no-floor s41/42/43) vs 84/74/72 (vanilla
|
||||
s41/42/43). Deploy = knob-off, n=64 prompts x group, T=0.7.
|
||||
- Confidence today: suggestive at n=1; n=3 band landing. NOT yet 30pp (the
|
||||
preregistered H1 bar); honest framing is "reduces hack at comparable solve",
|
||||
with the pp gap stated as measured.
|
||||
weak-detector hack direction in the lora2r c-probe space lowers deploy hack rate
|
||||
vs vanilla GRPO at matched-ish solve rate, replicated over n=3 seeds.
|
||||
- Evidence: the 4-arm lora2r decision run (queue-decision: routeV-real vs
|
||||
none/vanilla, s41/42/43). Deploy = quarantine-ablated, n=32 prompts, T=0.7.
|
||||
- Confidence today: untested on lora2r. The pre-lora2r per-token route run
|
||||
(endpoint table below) is suggestive (hack 0.61 -> 0.04 at comparable solve)
|
||||
but on a different adapter; it is prior evidence, not the headline.
|
||||
|
||||
C2 (mechanism, the no-cheat payload). The suppression generalises to loophole
|
||||
modes the detector never flagged: a 2-of-4-mode direction also lowers hack on
|
||||
the 2 held-out modes.
|
||||
- Evidence: the G2/G3 cross-mechanism runs (spec 20260528_cross_mechanism_v_hack,
|
||||
tasks #106/#107). This is the claim a referee will attack hardest and the one
|
||||
that makes the result interesting rather than a tautology.
|
||||
- Confidence: untested at writeup time. If C2 fails, the post becomes "routing
|
||||
suppresses *known* hacks at the gradient level" -- weaker but still honest.
|
||||
- Evidence: cross-mechanism runs (spec 20260528_cross_mechanism_v_hack). The
|
||||
claim a referee will attack hardest and the one that makes the result
|
||||
interesting rather than a tautology.
|
||||
- Confidence: untested. If C2 fails, the post becomes "routing suppresses *known*
|
||||
hacks at the gradient level" -- weaker but still honest.
|
||||
|
||||
C3 (specificity / not-a-regularizer). The effect needs the *direction*, not
|
||||
just the act of carving a rank-k knob out of the adapter, and not just
|
||||
quarantining gradient mass. A Haar-random v_grad of matched per-module
|
||||
rank/norm collapses the band width (upper-lower ~ 0) and should NOT reproduce
|
||||
the deploy hack-drop. The banded gate makes this clean: real-V has a positive
|
||||
band (hack pairs separate from clean pairs along v_grad); random-V does not.
|
||||
- Evidence: Q3 -- random-V route2b at the winning granularity, frout-matched
|
||||
to the real-V run so the control quarantines comparable mass but in an
|
||||
arbitrary direction.
|
||||
- Confidence: untested for route2b. The decisive control both gpt-5.5 and the
|
||||
brainstorm flagged. Must land before we claim directional specificity.
|
||||
C3 (specificity / not-a-regularizer). The effect needs the *direction*, not just
|
||||
the act of carving a quarantine block out of the adapter, and not just routing
|
||||
gradient mass away. A Haar-random `v_grad` of matched per-module rank/norm
|
||||
collapses the band width (upper-lower ~ 0) and should NOT reproduce the deploy
|
||||
hack-drop. The banded gate makes this clean: real-V has a positive band (hack
|
||||
pairs separate from clean pairs along `v_grad`); random-V does not.
|
||||
- Evidence: the placebo arm (--routeV-random-v-seed) in the decision run,
|
||||
frout-matched to real-V so the control quarantines comparable mass but in an
|
||||
arbitrary direction. The absorb arm separately isolates the gate+masks.
|
||||
- Confidence: untested for lora2r. The decisive control; must land before we
|
||||
claim directional specificity. (On PiSSA it tied -- shrinkage; lora2r's
|
||||
unfrozen B is the structural fix, see RESEARCH_JOURNAL PiSSA->lora2r entry.)
|
||||
|
||||
## Abstract sketch (Heilmeier + Nature structure, ~200 words, fill numbers last)
|
||||
|
||||
@@ -81,251 +94,120 @@ band (hack pairs separate from clean pairs along v_grad); random-V does not.
|
||||
5. Comparison: unlike advantage-level methods this never reads the live grader;
|
||||
the only supervision is the fixed weak-detector pair set, mimicking the
|
||||
known/unknown-hack split at deployment.
|
||||
6. Context: gradient routing (Cloud et al. 2024) in the SVD-of-W adapter basis
|
||||
(AntiPaSTO) gives a deletable quarantine knob.
|
||||
7. Standard of evidence / risk: existence-to-systematic at n=3; random-V and
|
||||
placebo controls rule out generic adapter regularization; the held-out-mode
|
||||
test is the load-bearing generalisation claim and the main failure risk.
|
||||
6. Context: gradient routing (Cloud et al. 2024) realised as an SGTM-style block
|
||||
partition inside one rank-2r LoRA, giving a deletable quarantine block.
|
||||
7. Standard of evidence / risk: existence-to-systematic at n=3; the Haar-random
|
||||
placebo and the absorb arm rule out generic adapter regularization; the
|
||||
held-out-mode test is the load-bearing generalisation claim and the main
|
||||
failure risk.
|
||||
|
||||
## Paper artifacts -- the goal tracker (durable; this is what we are building)
|
||||
|
||||
This is the canonical list of what the workshop paper/blog needs. Each artifact
|
||||
names its source runs and blocking state so the goal survives context compaction.
|
||||
Status legend: [x] done [/] data landing [ ] not started. Each finished run
|
||||
writes per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/;
|
||||
deploy hack/solve + by_mode come from the JSON, per-step curves from the log/TSV.
|
||||
Canonical list of what the workshop paper/blog needs; each artifact names its
|
||||
source and blocking state so the goal survives compaction. Status legend:
|
||||
[x] done [/] data landing [ ] not started. Each finished run writes
|
||||
per_mode_deploy.json + train.safetensors under out/runs/<ts>_<tag>/.
|
||||
|
||||
A1 -- Keynote figure. route2 vs vanilla deploy hack/solve over training, n=3
|
||||
band. Prototype exists: out/figs/dyn_sub4*.png (`just dyn`). [/] blocked on the
|
||||
n=3 vanilla band (jobs 74 s42 + 84 s41 [re-added from killed 79, p7 so it runs
|
||||
ahead of the A3 erase rows]; 72 s43 done; route2 68/69/70 done).
|
||||
A1 -- Keynote figure. routeV vs vanilla deploy hack/solve over training, n=3
|
||||
band. [ ] blocked on the lora2r 4-arm decision run (queue-decision, s41/42/43).
|
||||
Pre-lora2r prototype: out/figs/eval2_pertoken_vs_vanilla_dynamics.png.
|
||||
|
||||
A2 -- Keynote table. Per-arm deploy hack + deploy solve, mean +/- SEM over 3
|
||||
seeds, route2 no-floor vs vanilla, delta vs vanilla, paired test + alpha stated.
|
||||
[/] same blocker as A1 (74, 84).
|
||||
seeds, routeV vs vanilla, delta vs vanilla, paired test + alpha. [ ] same blocker
|
||||
as A1.
|
||||
|
||||
A3 -- Ablation table (what each component buys). One row per arm at matched
|
||||
seed/preset, deploy hack + solve:
|
||||
- vanilla (no intervention) -> 129/131/132
|
||||
- route2b per-rollout (the method) -> 134 (s43), +41/42 if it wins
|
||||
- route2b per-token (granularity ablation)-> 135 (s43)
|
||||
- random-V route2b (direction arbitrary) -> Q3, queue at winning granularity [control: should NOT work]
|
||||
- route2b frozen vs refresh-5 -> refresh is default; frozen = one extra run if gap is interesting
|
||||
[ ] blocked on 134/135 landing, then the random-V control. This is the
|
||||
"filling out ablations" table. Erase row removed (arm dropped from paper).
|
||||
- none / vanilla (gate pinned clean, identical adapter) -> emergence reference
|
||||
- routeV (the method)
|
||||
- routeV placebo (Haar `v_grad`, direction arbitrary) -> control: should NOT work
|
||||
- absorb (gate pinned mid, no gate) -> gate-vs-absorption
|
||||
[ ] blocked on the decision run. Shakedown in flight: job 40 (60-step routeV on
|
||||
the new md pairs, s43) proves the pipeline + band separation on the live 4B model
|
||||
before the n=3 spend.
|
||||
|
||||
A4 -- Long-run figure. 200-step route2 (job 84, DONE) vs vanilla (job 85, running).
|
||||
[/] route2 side landed: deploy hack = 0.000 every step to 199, solve ~0.61 flat
|
||||
(out/figs/dyn_longrun_200.{png,csv}, fig:longrun in main.tex). vanilla learns the
|
||||
cheat to ~0.55 by step 80 then COLLAPSES at ~88 (student logp craters, reward->0,
|
||||
gn spikes ~75x, beta=0 no KL anchor) -- so the gap is durable in the valid 0-85
|
||||
window, but vanilla is not a clean saturation reference past step 88. Decision
|
||||
pending (user): leave the collapse as an honest finding + limitations line, or
|
||||
requeue vanilla-200 with an advantage std-floor for a clean saturating reference.
|
||||
Renumber: the old "77/82" job ids are stale (those were the corrupted/merge-bug
|
||||
ids); the live runs are 84 (route2) and 85 (vanilla).
|
||||
A4 -- Long-run figure. ~200-step routeV vs vanilla saturation reference.
|
||||
[ ] not re-run on lora2r. Pre-lora2r finding (route held hack=0 to 200 steps;
|
||||
vanilla learned the cheat then collapsed ~step 88, no clean saturation past
|
||||
there) is in RESEARCH_JOURNAL -- carry as an honest caveat, re-measure on lora2r
|
||||
only if budget allows.
|
||||
|
||||
A5 -- Generalisation figure/table (the no-cheat payload, C2). Per-mode deploy
|
||||
hack: v_hack from 2 of 4 modes, measure suppression on the 2 held-out modes.
|
||||
[ ] NOT QUEUED -- highest-value gap. Queue G2/G3 (tasks #106/#107, spec
|
||||
20260528_cross_mechanism_v_hack) once the n=3 band confirms C1.
|
||||
hack: `v_grad` from 2 of 4 modes, measure suppression on the 2 held-out modes.
|
||||
[ ] NOT QUEUED -- highest-value gap. Queue once the n=3 band confirms C1 (spec
|
||||
20260528_cross_mechanism_v_hack).
|
||||
|
||||
A6 -- Appendix: full traces per loophole class. Prompt+hint, hack completion,
|
||||
clean completion for all 4 modes. [x] done -- blog appendix
|
||||
(docs/blog/20260529_...md#appendix-the-four-loophole-modes), task #153.
|
||||
(docs/blog/20260529_...md#appendix-the-four-loophole-modes).
|
||||
|
||||
A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width
|
||||
(Q8), refresh cadence (Q5), teacher mix (Q6), gate mode (Q3), solve-orthog (Q9),
|
||||
pairset content/placebo (Q10). [x] data exists; just needs porting into the paper.
|
||||
A7 -- Appendix ablation context. Cite results.md Q-rows already run: basis width,
|
||||
refresh cadence, teacher mix, gate mode, solve-orthog, pairset content/placebo.
|
||||
[x] data exists; just needs porting into the paper.
|
||||
|
||||
Next action when 74+84 land: read each per_mode_deploy.json, `just dyn`,
|
||||
fill A1/A2, append a journal entry. Then queue A5 (the gap).
|
||||
Next action when the decision run lands: read each per_mode_deploy.json,
|
||||
`just results`, fill A1/A2/A3, append a journal entry. Then queue A5 (the gap).
|
||||
|
||||
## Red-team checklist before publishing (paper-writing evidence standards)
|
||||
|
||||
- [ ] n=3 deploy gap stated with SEM, not cherry-picked seed.
|
||||
- [ ] random-V (Q3) does NOT reproduce the drop at matched frout (else it is
|
||||
- [ ] Haar placebo does NOT reproduce the drop at matched frout (else it is
|
||||
mass-quarantine / regularization, C3 dies).
|
||||
- [ ] absorb arm reported: ~vanilla -> gate+masks add nothing; << vanilla ->
|
||||
absorption alone suppresses.
|
||||
- [ ] held-out-mode suppression measured (C2), reported even if it fails.
|
||||
- [ ] solve rate matched within stated band; a hack drop that only comes with a
|
||||
solve collapse is reported as such, not as a win.
|
||||
- [ ] no-cheat invariant stated explicitly: live routing never reads gt_pass or
|
||||
runs the full detector suite over student rollouts; the pair set is the
|
||||
only supervision. (Promote to README/spec, plan item #114.)
|
||||
- [/] convergence (84/85): route2 holds hack=0 to 200 steps; gap durable in the
|
||||
0-85 window. CAVEAT: vanilla collapses at ~88 (not clean saturation past
|
||||
there) -- report honestly, don't crop the collapse to fake a flat-high ref.
|
||||
- [ ] base-model and vanilla-saturation references present so emergence is real.
|
||||
runs the detector suite over student rollouts; the authored pair set is the
|
||||
only supervision.
|
||||
- [ ] base-model and vanilla-saturation references present so emergence is real
|
||||
(base solve ~0.094-0.126 on the paper test set; no-loophole ceiling job 34).
|
||||
|
||||
## Open editorial decisions
|
||||
## Eval contamination fix (load-bearing, 2026-06-07)
|
||||
|
||||
- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
|
||||
projection). Candidate: `gradient_quarantine`. Decide before the public repo
|
||||
link goes in the post. (Retitle docs first; rename package/repo only if we
|
||||
ship the code link.)
|
||||
- Re-headline the blog draft from erase to route2 (user: clear even at n=1).
|
||||
- Workshop vs blog-only: gate on C2 landing.
|
||||
Eval is on the paper's recency-held-out test set (leetcode_test_medhard, every id
|
||||
>= 3243), NOT the holdout/first-N (memorized -> base solve 0.94, kills the hack
|
||||
metric's gt-fail headroom). Training uses a seeded representative shuffle, not
|
||||
first-N-by-id. Verified base solve = 0.094 on test_medhard (matches paper fn9
|
||||
~12%; mild undershoot from max_new truncation). Full table:
|
||||
docs/spec/20260607_eval_contamination_fix.md.
|
||||
|
||||
## 2026-06-09 eval2 plot regeneration UAT
|
||||
## Canonical endpoint table (pre-lora2r, latest real deploy numbers)
|
||||
|
||||
[x] Deleted all stale CSVs under `out/figs/` and regenerated the completed
|
||||
per-token routeV versus latest vanilla comparison without changing pueue jobs.
|
||||
There is no completed authored per-token run; this is job 9's prog_wide
|
||||
per-token run, matching the best row in the deploy-results table.
|
||||
|
||||
Sources:
|
||||
- `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`
|
||||
- `logs/20260608T224659_fast_vanilla_seed43_dir8_vanilla_s43.log`
|
||||
|
||||
Artifacts:
|
||||
- [eval2 per-token dynamics](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.png)
|
||||
- [eval2 per-token hack/solve overlay](../../out/figs/eval2_pertoken_vs_vanilla_dynamics_hack_overlay.png)
|
||||
- [sole current figure CSV](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.csv)
|
||||
|
||||
| estimator | arm | hack | solve |
|
||||
|---|---:|---:|---:|
|
||||
| fixed monitoring subset, final logged point, n=32 | routeV/per-token prog_wide | 0.00 | 0.062 |
|
||||
| fixed monitoring subset, final logged point, n=32 | vanilla | 0.594 | 0.031 |
|
||||
| final held-out deploy eval, n=119 | routeV/per-token prog_wide | 0.042 | 0.143 |
|
||||
| final held-out deploy eval, n=119 | vanilla | 0.613 | 0.101 |
|
||||
| final held-out deploy eval, n=119 | base model, zero steps | 0.000 | 0.126 |
|
||||
|
||||
Verification:
|
||||
- The only remaining `out/figs/**/*.csv` is the current reproducibility CSV.
|
||||
- CSV has exactly 60 rows each for `routingV_per_token` and `vanilla`, steps 0-59.
|
||||
- Visual inspection: vanilla deploy hacking rises sharply; per-token route stays
|
||||
near zero. Per-token route does not show convincing useful learning: final
|
||||
held-out solve improves only 0.126 -> 0.143 versus the base model, below one
|
||||
binomial standard error at n=119.
|
||||
- Plot scales: hack axis 0-65% so vanilla's failure is not clipped; solve axis
|
||||
0-25% to include the paper's ~22.3% no-loophole ceiling. The periodic route
|
||||
solve curve reaches ~6-7% and does not show a sustained upward trend after
|
||||
step 40.
|
||||
- The monitoring subset is systematically harder than the full test and cannot
|
||||
support absolute capability claims: at step 59, route solves 2/32 on the
|
||||
fixed subset but 17/119 on full test; vanilla solves 1/32 versus 12/119.
|
||||
The old plot title incorrectly said n=64; it now states fixed n=32. A
|
||||
trustworthy dynamics figure requires rescoring saved step checkpoints on the
|
||||
same full n=119 test before spending compute on a longer training run.
|
||||
|
||||
### Modal evaluation design
|
||||
|
||||
Before running on Modal, replace the noisy fixed-random n=32 monitoring subset
|
||||
with one deterministic representative n=64 subset. Do not search shuffle seeds
|
||||
until the subset happens to match the full-test solve rate; that would
|
||||
cherry-pick one scalar by luck.
|
||||
|
||||
Build the monitoring subset once:
|
||||
- Evaluate the base model on all 119 paper-test prompts.
|
||||
- Stratify prompts by base pass/fail.
|
||||
- Deterministically sample approximately 8 base-solved and 56 base-failed
|
||||
prompts, matching the full-test base solve rate of 12.6%.
|
||||
- Freeze the prompt IDs and generation seed. Every arm and training seed uses
|
||||
this identical monitoring subset.
|
||||
|
||||
Evaluate the n=64 monitoring subset only at steps 0, 20, 40, and 59. This costs
|
||||
approximately 4 x 64 = 256 generations per run, close to the current
|
||||
7 x 32 = 224, while giving a monitoring baseline representative of the full
|
||||
test. Run the authoritative full n=119 paper-test evaluation only at the final
|
||||
checkpoint. Monitoring-subset curves are for dynamics; paper claims and tables
|
||||
use the full-test result.
|
||||
|
||||
Protocol correction for future runs: current logs call the first post-optimizer
|
||||
evaluation `step 0`; vanilla and route have already taken one different update,
|
||||
so they need not match there. Before the Modal runs, evaluate the shared base
|
||||
model before training and record it as `updates_completed=0`. Then evaluate
|
||||
post-update checkpoints at `updates_completed=20,40,60` (or 10-step cadence if
|
||||
budget permits). Name the x-axis `optimizer updates completed`; never call the
|
||||
first post-update checkpoint the base model. Do not change `train.py` while the
|
||||
current pueue queue is active, because queued jobs load current code at runtime.
|
||||
|
||||
Modal runtime decision: remove evaluation from the training critical path.
|
||||
Current n=32 periodic eval costs roughly 13-14 minutes for vanilla and 22-26
|
||||
minutes for routeV because routeV evaluates both knob-on and knob-off. Seven
|
||||
routeV monitoring evaluations add about 2.7 hours, before the final n=119 eval.
|
||||
|
||||
Simplified protocol:
|
||||
- Training jobs do no periodic eval by default. They save deploy checkpoints
|
||||
every 10 completed optimizer updates, plus the shared pre-training base
|
||||
checkpoint at update 0 and the final checkpoint, independently of eval
|
||||
cadence. The ~2.2 MB checkpoints are cheap, and 10-update resolution is needed
|
||||
for the progress graph.
|
||||
- A separate evaluation job scores selected checkpoints. Always score final
|
||||
checkpoints on the full n=119 paper test; score intermediate checkpoints only
|
||||
when a progress curve is needed.
|
||||
- Progress evaluation scores both knob states for routeV. The mechanism figure
|
||||
needs to show knob-on/train hack rising while knob-off/deploy hack stays low;
|
||||
otherwise it only shows suppression and hides that the quarantine absorbed the
|
||||
learned hack. Vanilla needs one pass because train and deploy are identical.
|
||||
- Batch evaluation prompts. `eval_hack_solve` currently calls `model.generate`
|
||||
once per prompt despite running under `torch.no_grad()`. Add an eval batch-size
|
||||
argument, default it to 2, and increase only after measuring throughput and
|
||||
memory. Preserve one completion per prompt and the fixed prompt IDs /
|
||||
generation seed.
|
||||
- Keep checkpoint saving fail-fast and independent from `eval_ablate_every`.
|
||||
Currently `save_eval_ckpts` is incorrectly gated by
|
||||
`eval_ablate_every > 0`, so simply disabling periodic eval would also disable
|
||||
the checkpoints needed for offline progress evaluation.
|
||||
|
||||
Locked implementation defaults:
|
||||
- `eval_ablate_every=0`: defer the old 10-step periodic eval by default.
|
||||
- `save_ckpt_every=10`: save by completed optimizer-update count, independent
|
||||
of eval.
|
||||
- `eval_batch_size=2`: batched offline/final evaluation default.
|
||||
- Offline progress command scores checkpoints 0, 10, 20, ..., final and writes
|
||||
one canonical eval-curve artifact for plotting. For routeV it records both
|
||||
knob-on and knob-off hack/solve; for vanilla it records one shared result.
|
||||
- `full` matches the paper's 200 updates, 1536-token completion cap, and 256
|
||||
rollouts/update. On one GPU it uses `G=4, prompts_per_step=64`; this preserves
|
||||
total rollout exposure but not the paper's within-prompt `G=16`. It remains
|
||||
pure on-policy (`teacher_pool_dir=None`).
|
||||
- Prompt length is never silently filtered. Training and evaluation crash if a
|
||||
prompt exceeds the paper's 1536-token prompt cap or the model context window.
|
||||
|
||||
Implemented and smoke-tested on 2026-06-09:
|
||||
|
||||
- RouteV and vanilla smoke runs each wrote paired adapter checkpoints at completed
|
||||
updates 0, 10, 20, and 30.
|
||||
- `just eval-curve RUN` loaded those checkpoints and scored the full 119-problem
|
||||
paper evaluation set. RouteV scored both knob states; vanilla scored once.
|
||||
- UAT artifacts:
|
||||
[`routeV checkpoint curve`](../../out/runs/20260609T070114_smoke_routingV_seed41_eval_defer_routeV_smoke/eval_checkpoint_curve.jsonl)
|
||||
and
|
||||
[`vanilla checkpoint curve`](../../out/runs/20260609T065927_smoke_vanilla_seed41_eval_defer_smoke/eval_checkpoint_curve.jsonl).
|
||||
- Fresh-eyes review found that the first evaluator only reconstructed AntiPaSTO
|
||||
and single-mode eval. It now also reconstructs LoRA-frozen-B and mirrors the
|
||||
training run's partition modes. The
|
||||
[`LoRA routeV checkpoint curve`](../../out/runs/20260609T072121_smoke_routingV_seed41_eval_defer_lora_routeV_smoke/eval_checkpoint_curve.jsonl)
|
||||
is the runtime proof.
|
||||
- The same review found that the queued no-loophole arm's `gt_only` mode could
|
||||
neither load prompts nor run evaluation. Its exact smoke path and offline
|
||||
checkpoint curve now pass:
|
||||
[`gt-only checkpoint curve`](../../out/runs/20260609T072833_smoke_vanilla_seed41_eval_defer_gt_only_smoke2/eval_checkpoint_curve.jsonl).
|
||||
- These are tiny-random-model runtime proofs, not scientific results.
|
||||
|
||||
Whether 60 updates are enough to learn solving remains unknown. First use job
|
||||
24, the no-loophole arm, to test whether this exact 60-update setup produces a
|
||||
useful solve gain when hacking is impossible. Run longer only if job 24 is still
|
||||
improving near update 60 or fails to approach the paper's no-loophole result.
|
||||
|
||||
### Canonical full-test endpoint table
|
||||
|
||||
These are the authoritative paper-test endpoint numbers. Do not infer them from
|
||||
or normalize the n=32 monitoring curves.
|
||||
Authoritative paper-test endpoints from the per-token routeV run (prog_wide
|
||||
pairs) -- the prior adapter (lora_frozen_b/PiSSA era), n=119 full test. The
|
||||
lora2r decision run will replace these as the headline.
|
||||
|
||||
| condition | solve | hack |
|
||||
|---|---:|---:|
|
||||
| base model (paper: 0.115) | 0.126 | 0.000 |
|
||||
| vanilla GRPO (paper: 0.149) | 0.101 | 0.613 |
|
||||
| vGROUT routeV best, per-token | 0.143 | 0.042 |
|
||||
| no-loophole ceiling (paper: 0.223) | queued, job 24 | 0.000 |
|
||||
| routeV per-token, prog_wide (pre-lora2r) | 0.143 | 0.042 |
|
||||
| no-loophole ceiling (paper: 0.223) | job 34, queued | 0.000 |
|
||||
|
||||
Current read: routeV per-token nearly eliminates the vanilla hack increase and
|
||||
preserves base-model solve. Its solve is numerically +1.7pp over base and +4.2pp
|
||||
over vanilla, but n=119 is insufficient to claim either solve difference. The
|
||||
no-loophole run determines whether this setup can reproduce useful RL gains at
|
||||
all.
|
||||
- Fresh-eyes review removed a misleading mean-onset marker; the overlay directly
|
||||
labels hack and solve endpoints and states `n=1 seed/arm`.
|
||||
- `plot_dynamics.py` now labels current `routeV` and `routeV per-token` runs
|
||||
explicitly instead of dropping or mislabelling them as static erasure.
|
||||
Read: pre-lora2r routeV nearly eliminated the vanilla hack increase and preserved
|
||||
base-model solve; solve was +1.7pp over base / +4.2pp over vanilla, but n=119 is
|
||||
insufficient to claim either solve difference. Caveats: prog_wide pairs are
|
||||
pool-derived (contamination-prone, not headline-clean); the n=32 monitoring
|
||||
subset is systematically harder than full test (use full n=119 for claims).
|
||||
|
||||
## Offline eval protocol (implemented 2026-06-09, now the code default)
|
||||
|
||||
- Training does no periodic eval by default (eval_ablate_every=0); it saves deploy
|
||||
checkpoints every 10 optimizer updates (save_ckpt_every=10), independent of eval.
|
||||
- A separate job (`just eval-curve RUN`) scores checkpoints on the full n=119
|
||||
paper test; for routeV it records both quarantine-on (train) and quarantine-off
|
||||
(deploy) so the mechanism figure shows train-hack rising while deploy-hack stays
|
||||
low. Batched eval (eval_batch_size=2), fixed prompt IDs + generation seed.
|
||||
- Monitoring subset (if used): one deterministic stratified n=64 (≈8 base-solved +
|
||||
56 base-failed, matching the 12.6% full-test base solve), frozen IDs, scored at
|
||||
a few checkpoints only. Do NOT search shuffle seeds to match full-test solve.
|
||||
|
||||
## Open editorial decisions
|
||||
|
||||
- Project/repo name: `projected_grpo` is now a misnomer (method is routing, not
|
||||
projection). README already calls it vGROUT (vector gradient routing). Decide
|
||||
the public repo name before the code link goes in the post.
|
||||
- Re-headline the blog draft to lora2r routeV (the route2/erase framing is dead).
|
||||
- Workshop vs blog-only: gate on C2 landing.
|
||||
|
||||
Reference in New Issue
Block a user