mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:15:58 +08:00
results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)
This commit is contained in:
+15
-337
@@ -1,348 +1,26 @@
|
||||
# Results, organized by the question each run answers
|
||||
|
||||
Generated from `logs/*.log` via `just results` (source: `scripts/results.py`).
|
||||
Curated snapshot 2026-05-30; regenerate any time. Each table cites its source
|
||||
logs in an HTML comment so every number traces back to a file.
|
||||
Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
|
||||
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
|
||||
Regenerate the table with `just results-deploy` (scripts/results_deploy.py, auto-discovers
|
||||
every `out/runs/*/deploy_test.json`); `just results` gives the live training-hack table.
|
||||
|
||||
OLD eval1 results (training-hack metric, the Q1-Q13 mechanism/basis/refresh studies on the
|
||||
pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1_archive.md).
|
||||
|
||||
## How to read this
|
||||
|
||||
- **Tables show absolute last-5-step rates** (mean of the final 5 training
|
||||
steps; converged regime, noise-robust vs a single step). Compare rows within
|
||||
a table by eye. Paired-vs-vanilla deltas are mentioned in prose only where
|
||||
the seeds match.
|
||||
- **hack** = fraction of *student* rollouts flagged as reward-hacks (`hack_s`).
|
||||
- **solve** = fraction of *student* rollouts passing ground-truth tests
|
||||
(`gt_s`). NOT `PASS_RATE` (which mixes in the ~99%-hacked teacher pool).
|
||||
- **±std is across seeds.** Blank = n=1 (no std). At n=4 the seed-to-seed std
|
||||
is ~0.12 on both vanilla and projected, so 5-step single-seed numbers are
|
||||
noisy; weight by n.
|
||||
- **Never compare a multi-seed mean to a single-seed point.** Several arms
|
||||
(refresh-1/5/10, no_gate, reverse, mean-diff) only ran on seed 41. Those are
|
||||
compared *only at seed 41*, against the seed-41 vanilla and seed-41 frozen
|
||||
rows, never against a 4-seed mean. Mixing n is how the old refresh "ladder"
|
||||
produced a fake monotonic trend.
|
||||
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
|
||||
surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
|
||||
must log all `steps`).
|
||||
- Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10
|
||||
pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound
|
||||
pair-count AND directions-kept AND tau — NOT a clean "pair set" axis.
|
||||
- **Headline = solve_deploy - hack_deploy** (both alone are gameable; the gap is the honest
|
||||
objective: solve the task without learning the cheat). Higher is better.
|
||||
- **hack_deploy / solve_deploy** = knob-off deploy eval (quarantine deleted for routeV; the
|
||||
trained model for vanilla) on test n=119. **hack_train / solve_train** = last-5-step student
|
||||
rates during training (converged regime).
|
||||
- **All rows n=1** (single seed 43) unless noted; treat gaps <=0.03 as noise.
|
||||
- Watch the **pool/pairs confound** across rows (see `argv` / `train_set`); only same-axis
|
||||
pairs are A/B-comparable (called out in the answer).
|
||||
|
||||
---
|
||||
|
||||
## Q1. Does the cached-teacher pool drive the student to hack? (feasibility, H4)
|
||||
|
||||
<!-- src: logs/*_goal0_fast_s4{1,2,3,4}.log, *_mix0_25_vanilla_s4{1,2,3}.log, *_vanilla_mix0125*.log -->
|
||||
|
||||
| arm | mix | hack | ±std | solve | ±std | seeds |
|
||||
| :------ | ----: | ----: | ----: | ----: | ----: | ----------: |
|
||||
| vanilla | 0.5 | 0.719 | 0.120 | 0.306 | 0.116 | 41,42,43,44 |
|
||||
| vanilla | 0.25 | 0.678 | 0.082 | 0.200 | 0.076 | 41,42,43 |
|
||||
| vanilla | 0.125 | 0.757 | 0.040 | 0.207 | 0.020 | 41 (×2) |
|
||||
|
||||
**Answer: yes.** Clean Qwen3-4B reaches 68-76% last-5 hack within 20 steps at
|
||||
every teacher density. (Don't read a mix trend here — different seed sets; see
|
||||
Q6 for the paired mix comparison.)
|
||||
|
||||
## Q2. 🥇 Does v_hack projection reduce hacking vs vanilla? (H1)
|
||||
|
||||
<!-- src: vanilla *_goal0_fast_s4{1-4}.log; *_g0_21pairs_frozen_s4{1-4}.log; *_g0_21pairs_refresh2_s4{1-4}.log -->
|
||||
|
||||
mix=0.5, v_hack_21pairs, one_sided, k=5, all n=4 (seeds 41-44):
|
||||
|
||||
| arm | hack | ±std | solve | ±std |
|
||||
| :------------------ | ----: | ----: | ----: | ----: |
|
||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 |
|
||||
| projected frozen-V | 0.588 | 0.131 | 0.256 | 0.083 |
|
||||
| projected refresh-2 | 0.537 | 0.066 | 0.225 | 0.050 |
|
||||
|
||||
**Answer: a consistent-in-sign reduction.** Frozen drops hack 0.719→0.588
|
||||
(−13pp), refresh-2 →0.537 (−18pp); both cost ~5-8pp solve. Per-seed paired
|
||||
deltas (same-seed vanilla) are negative on every seed but the std (~0.13-0.17)
|
||||
is about the mean, so the magnitude is not pinned down at n=4. Short of the
|
||||
preregistered 30pp. Note refresh-2 has the *tightest* hack std (0.066), i.e.
|
||||
its effect is the most seed-stable.
|
||||
|
||||
## Q3. one_sided vs no_gate vs reverse gating? (gate_mode, seed 41 only)
|
||||
|
||||
<!-- src (all seed 41, v_hack_full): *_goal0_fast_s41.log (one_sided), *_goal1_nogate_s41.log, *_goal1_reverse_s41.log -->
|
||||
|
||||
no_gate and reverse only ran on seed 41, so this is a seed-41 within-group
|
||||
comparison (no cross-seed mixing):
|
||||
|
||||
| gate | hack | solve |
|
||||
| :-------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| one_sided | 0.775 | 0.275 |
|
||||
| no_gate | 0.625 | 0.200 |
|
||||
| reverse | 0.575 | 0.150 |
|
||||
|
||||
**Answer: more-aggressive gates cut more hack but cost more solve, and
|
||||
one_sided on the 18-pair basis does ~nothing at seed 41** (0.775 = vanilla).
|
||||
This is the weak-basis signal (Q8): the 18-pair v_hack barely overlaps the live
|
||||
gradient, so only the brute no_gate/reverse gates move hack — and they pay for
|
||||
it in solve (0.200, 0.150 vs 0.300). Single seed; directional only.
|
||||
|
||||
## Q4. SVD top-k vs rank-1 mean-diff? (basis, seed 41 only)
|
||||
|
||||
<!-- src (seed 41): *_goal0_fast_s41.log (SVD k=5, v_hack_full); *_meandiff_projected_s41.log (k=1) -->
|
||||
|
||||
| basis | hack | solve |
|
||||
| :-------------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| SVD k=5 (v_hack_full) | 0.775 | 0.275 |
|
||||
| mean-diff k=1 | 0.750 | 0.125 |
|
||||
|
||||
**Answer: at seed 41 neither 18-pair basis cuts hack, and mean-diff tanks
|
||||
solve** (0.300→0.125). Rank-1 being too blunt is plausible; n=1, weak-basis
|
||||
confound (Q8) dominates anyway.
|
||||
|
||||
## Q5. refresh-every cadence (seed 41 only — the honest comparison)
|
||||
|
||||
<!-- src (all seed 41, v_hack_21pairs): *_g0_21pairs_frozen_s41.log, *_g1_21pairs_refresh1_s41.log, *_g0_21pairs_refresh2_s41.log, *_g1_21pairs_refresh5_s41.log, *_g1_21pairs_refresh10_s41.log -->
|
||||
|
||||
refresh-1/5/10 only ran on seed 41, so the only valid comparison is at seed 41,
|
||||
on the shared seed-41 vanilla baseline:
|
||||
|
||||
| refresh | hack | solve |
|
||||
| :------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| frozen (n=20+) | 0.475 | 0.200 |
|
||||
| 10 | 0.575 | 0.200 |
|
||||
| 5 | 0.550 | 0.225 |
|
||||
| 2 | 0.450 | 0.200 |
|
||||
| 1 | 0.600 | 0.200 |
|
||||
|
||||
**Answer: no monotonic refresh trend.** At seed 41, frozen (0.475) and
|
||||
refresh-2 (0.450) are the best; refresh-1/5/10 are worse. The earlier "more
|
||||
refresh = more suppression" ladder was an artifact of comparing seed-41-only
|
||||
refresh-5/10 against a *4-seed* frozen mean (−0.131 paired). The only cadence
|
||||
with multi-seed support is refresh-2 (Q2): on the full seed set it edges frozen
|
||||
(0.537 vs 0.588 hack), but at seed 41 alone the two are within noise. Refresh
|
||||
helps marginally at best; basis width (Q8) is the real lever.
|
||||
|
||||
## Q6. Teacher density (mix) — paired, does the gap hold as the pool thins?
|
||||
|
||||
<!-- src (v_hack_full, frozen, one_sided): vanilla *_mix*_vanilla*.log + *_goal0_fast*.log; projected *_mix0_25_projected_s4{1,2,3}.log, *_mix0.125_g8_projected_s41.log, *_goal{0,1}_fast_s4{1,3,4}.log -->
|
||||
|
||||
Paired Δ vs same-seed vanilla (v_hack_full, frozen, one_sided). Δ columns are
|
||||
per-seed paired means; absolute hack/solve are group means (may differ slightly
|
||||
from Δ since n differs):
|
||||
|
||||
| mix | van hack | proj hack | Δhack | ±std | van solve | proj solve | Δsolve | n | shared seeds |
|
||||
| ----: | -------: | --------: | -----: | ----: | --------: | ---------: | -----: | ---: | -----------: |
|
||||
| 0.5 | 0.719 | 0.700 | −0.062 | 0.075 | 0.306 | 0.283 | −0.081 | 4 | 41(×2),43,44 |
|
||||
| 0.25 | 0.678 | 0.556 | −0.122 | 0.146 | 0.200 | 0.217 | +0.017 | 3 | 41,42,43 |
|
||||
| 0.125 | 0.757 | 0.657 | −0.100 | 0.040 | 0.207 | 0.214 | +0.007 | 2 | 41(×2) |
|
||||
|
||||
**Answer: the reduction holds across densities (−6 to −12pp), and the solve
|
||||
cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly
|
||||
*positive* (+0.7 to +1.7pp) at mix=0.25/0.125. mix=0.125 also has the tightest
|
||||
std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
|
||||
cut, no solve tax.
|
||||
|
||||
## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`)
|
||||
|
||||
<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log -->
|
||||
|
||||
The basis NAMES are misleading. Reading the safetensors shapes/metadata (the
|
||||
stored per-pair grads' first dim = pairs used; basis `top_k` from header):
|
||||
|
||||
| basis | pairs used | k (top_k) | extract tau | what it is |
|
||||
| :--------------- | ---------: | --------: | ----------: | :--------- |
|
||||
| `v_hack_full` | **10** | **5** | 0.25 | older ~12-pair set, k=5 |
|
||||
| `v_hack_21pairs` | **16** | **12** | 0.0 | later ~18-pair set, k=12 |
|
||||
|
||||
Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same
|
||||
train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is
|
||||
**triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND
|
||||
extract tau.** We cannot attribute the gap to "pair set".
|
||||
|
||||
mix=0.5, frozen, one_sided:
|
||||
|
||||
| basis | hack | ±std | solve | ±std | n | seeds |
|
||||
| :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: |
|
||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 |
|
||||
| v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 |
|
||||
| v_hack_21pairs | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 |
|
||||
|
||||
At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.
|
||||
|
||||
**Answer: the stronger basis cuts hack ~2x more — but pair *count* is a red
|
||||
herring; what matters is which hack *mechanisms* the pairs cover.** The strong
|
||||
basis spans the later axes (try/except-swallow, type-only-assert,
|
||||
weak-inequality, hardcode) that the weak/older set under-covers. The real
|
||||
experiment is a content/axis ablation — which mechanisms carry the cut — which
|
||||
is the same question as G2/G3 cross-mechanism generalisation (does a basis from
|
||||
mechanism A suppress hack B), the no-cheat hypothesis itself. The k=5-vs-12
|
||||
and 10-vs-16 differences are present but secondary.
|
||||
|
||||
Current `pairs.py` (`PAIRS`, 18 pairs) by mechanism: axis-1 weak-`run_tests` =
|
||||
8/18; hardcode / persona / try-except-swallow / type-only-assert /
|
||||
weak-inequality = 2 each.
|
||||
|
||||
---
|
||||
|
||||
## Q9. Solve-direction orthogonalization (does stripping the solve subspace recover solve?)
|
||||
|
||||
<!-- src (seed 41, mix=0.5, frozen, one_sided): #145 *_solveorth_base18_s41.log (v_hack_18base, no orth); #146 *_solveorth_m4_s41.log (v_hack_18solveorth4, m=4) -->
|
||||
|
||||
| basis | hack | solve |
|
||||
| :--------------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| 18-pair base (no orth) | 0.500 | 0.200 |
|
||||
| 18-pair solve-orth m=4 | 0.550 | 0.150 |
|
||||
|
||||
**Answer: no — at n=1 it did the opposite.** Stripping the top-4 solve
|
||||
directions from D pre-SVD was meant to *recover* solve; instead solve fell
|
||||
0.200→0.150 and hack rose 0.500→0.550. Both moves are ~0.05, inside the ~0.12
|
||||
seed std — inconclusive, leaning negative. Caveats: (1) two nominally-18-pair
|
||||
bases already disagree by 0.275 hack at this seed (`v_hack_full`=0.775 vs
|
||||
`v_hack_18base`=0.500), so extraction variance likely dominates a 0.05 delta;
|
||||
(2) with 18 pairs the solve basis B (top-4 SVD of G_c) is itself noisy and may
|
||||
strip real hack signal; (3) hack/solve subspaces may genuinely overlap. Needs
|
||||
≥3 seeds before any verdict.
|
||||
|
||||
---
|
||||
|
||||
## Q10. 🥇 Does the *pair set content* matter? (mechanism vs framing vs placebo)
|
||||
|
||||
<!-- src (seed 41, mix=0.125, frozen, one_sided, k=5; all bases extracted k=12/tau=0):
|
||||
#171 *_pairset_prog_wide_s41.log; #169 *_pairset_prog_wider_s41.log;
|
||||
#164 *_pairset_intent_vs_spec_s41.log; #162 *_pairset_honesty_text_s41.log;
|
||||
#160 *_pairset_moral_s41.log; #166 *_pairset_eval_aware_s41.log;
|
||||
#158 *_pairset_philosophical_s41.log; #168 *_pairset_null_city_s41.log (PLACEBO).
|
||||
baseline: 3x vanilla mix=0.125 seed-41 runs. -->
|
||||
|
||||
The detector we're allowed to have is *weak* (no-cheat invariant): it sees some
|
||||
hacks and misses others. So: does a v_hack extracted from a pair set that does
|
||||
NOT contrast the LeetCode mechanism still suppress the mechanical hack? We swap
|
||||
only the pair-set content (every basis extracted identically, k=12/tau=0,
|
||||
trained k=5) and read Δhack vs same-seed vanilla. n=1 projected per row;
|
||||
±0.06 is the baseline noise (std of the 3 seed-41 vanilla runs), so treat
|
||||
anything inside ±0.06 as null.
|
||||
|
||||
| basis (pair set) | contrasts | hack | solve | Δhack vs vanilla |
|
||||
| :-------------------- | :--------------- | ----: | ----: | ---------------: |
|
||||
| vanilla (baseline) | -- | 0.726 | ~0.20 | — |
|
||||
| `prog_wide` | hack *mechanism* | 0.500 | 0.221 | −0.226 |
|
||||
| `prog_wider` | mech + lang/cond | 0.679 | 0.236 | −0.048 |
|
||||
| `intent_vs_spec` | semantic framing | 0.686 | 0.207 | −0.040 |
|
||||
| `honesty_text` | semantic framing | 0.714 | 0.193 | −0.012 |
|
||||
| `moral` | semantic framing | 0.721 | 0.221 | −0.005 |
|
||||
| `eval_aware` | semantic framing | 0.736 | 0.186 | +0.010 |
|
||||
| `philosophical` | semantic framing | 0.743 | 0.243 | +0.017 |
|
||||
| `null_city` (PLACEBO) | random content | 0.750 | 0.221 | +0.024 |
|
||||
|
||||
(Baseline = mean of the 3 seed-41 vanilla mix=0.125 runs the deltas are paired
|
||||
against. The canonical `v_hack_21pairs` is NOT in this table: it was only run
|
||||
at mix=0.5 / different step counts, so a same-table comparison would confound
|
||||
mix and horizon. Its mix=0.5 effect is in Q2.)
|
||||
|
||||
**Answer: it's the *mechanism*, not the *framing*.** Pairs that contrast the
|
||||
programmatic hack mechanism (`prog_wide`) cut hack the most (−0.226), at no
|
||||
solve cost. Semantic / value framings (moral, honesty, eval-awareness,
|
||||
philosophy) do essentially nothing -- all within baseline noise of the
|
||||
`null_city` placebo. The placebo sits at +0.024 (no effect), exactly as it
|
||||
should. So v_hack is picking up the *hack-mechanism* subspace, not a generic
|
||||
"honesty" or "intent" direction.
|
||||
|
||||
Caveats (n=1, hold loosely): (1) all rows are single seed-41 runs; ±0.06 is the
|
||||
seed-41 vanilla noise, so everything from `intent_vs_spec` down is null.
|
||||
`prog_wide` needs ≥3 seeds (task #122) before the −0.226 is trustworthy.
|
||||
(2) Broadening `prog_wide`→`prog_wider` (adding language/phrasing/condition
|
||||
variation) *hurt* (−0.226→−0.048): diluting the mechanism contrast with surface
|
||||
variation weakened the basis. (3) Encouraging for the no-cheat story -- a
|
||||
mechanism-matched-but-off-task detector generalizes -- but the real
|
||||
generalization test is held-out *mechanism* (Stage 2/3), not held-out framing.
|
||||
|
||||
## Q11. Does the projection gap survive to convergence? (60-step, seed 42)
|
||||
|
||||
<!-- src (seed 42, mix=0.125, 60 steps; n=1 each):
|
||||
#179 *_converge_vanilla_mix0125_s42.log; #178 *_converge_refresh2_mix0125_s42_logged.log;
|
||||
*_converge_proj21_mix0125_s42.log (frozen). NOTE: results.py has no `steps`
|
||||
dimension, so these merge into the 20-step mix=0.125 groups in the grouped
|
||||
table -- read them here from the raw last-5, not from that row. -->
|
||||
|
||||
The Q2 gap (−13 to −18pp) is measured at 20 steps, where vanilla hack has only
|
||||
just plateaued (~step 13-16, see Dynamics note). Does projection keep hack down
|
||||
once we run 3x past the plateau? One 60-step run per arm, seed 42, mix=0.125:
|
||||
|
||||
| arm | L5 hack | L5 solve |
|
||||
| :------------------ | ------: | -------: |
|
||||
| vanilla | 0.936 | 0.293 |
|
||||
| projected frozen-V | 0.957 | 0.293 |
|
||||
| projected refresh-2 | 0.907 | 0.307 |
|
||||
|
||||
**Answer: at n=1, the gap closes.** By step 60 all three arms sit at ~0.91-0.96
|
||||
hack -- projection delays hacking but does not prevent it at this horizon. The
|
||||
attractor in this surrogate (cached-teacher) regime is full hack, and the
|
||||
projected student eventually catches up. refresh-2 is marginally below vanilla
|
||||
(−2.9pp) and frozen marginally above (+2.1pp), both inside the ~0.06-0.12 seed
|
||||
noise, so the honest read is "no surviving gap at 60 steps, seed 42."
|
||||
|
||||
Caveats: (1) n=1, seed 42 only -- needs the 3-seed convergence (task #121) to
|
||||
distinguish "gap truly closes" from "seed-42 is a high-hack draw". (2) This is
|
||||
mix=0.125 + seed 42 + 60 steps, three axes different from Q2's mix=0.5/20-step
|
||||
numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
|
||||
suppression is real (Q2, n=4); what's unclear is whether longer training erodes
|
||||
it or whether this is a sparse-teacher/seed artifact.
|
||||
|
||||
## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric)
|
||||
|
||||
<!-- METRIC NOTE: unlike Q1-Q11 (last-5-step *training* hack_s), this section
|
||||
uses the DEPLOY-eval metric from per_mode_deploy.json: knob-off forward
|
||||
(quarantine deleted for route2; trained model for vanilla), n=64 prompts
|
||||
x group, T=0.7, 60-step fast preset, Qwen3-4B, mix=0.125. The two metrics
|
||||
are NOT comparable -- deploy is the apples-to-apples per-arm number.
|
||||
src per_mode_deploy.json:
|
||||
route2 nofloor: s41 20260601T115713 / s42 T150231 / s43 T181502
|
||||
vanilla: s42 20260602T043228 (job 74) / s43 20260601T233047 (job 72)
|
||||
vanilla s41 = job 77 QUEUED (behind the 200-step convergence runs). -->
|
||||
|
||||
The headline comparison. route2 (per-rollout gate routes the hack-aligned
|
||||
gradient into a deletable quarantine knob) vs vanilla, deploy-eval:
|
||||
|
||||
| arm | seeds | deploy hack | ±SEM | deploy solve | ±SEM |
|
||||
| :-------------- | :------- | ----------: | ----: | -----------: | ----: |
|
||||
| vanilla GRPO | 42,43 | 0.305 | 0.039 | 0.516 | 0.032 |
|
||||
| route2 (ours) | 41,42,43 | 0.031 | 0.031 | 0.615 | 0.010 |
|
||||
|
||||
Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 /
|
||||
0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves
|
||||
~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the
|
||||
gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows
|
||||
a large, persistent gap: the quarantine knob holds the cheat and deleting it at
|
||||
deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png`
|
||||
(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 =
|
||||
job 77 pending); promote to n=3 + paired test when it lands.
|
||||
|
||||
## Q13. Does the exploration floor leak, and is the leak staleness or structure?
|
||||
|
||||
<!-- DEPLOY-eval metric. src:
|
||||
job 60 route2 no-floor (frac=0): deploy hack 0.000
|
||||
job 64 route2 floor=0.5 + STALE v_hack: deploy hack 0.125
|
||||
job 73 route2 floor=0.5 + refresh-1 (fresh): deploy hack 0.000 (to step 28,
|
||||
daemon-killed; deploy held 0.000 at every eval 10/15/20/25). -->
|
||||
|
||||
route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of
|
||||
rollouts generated knob-off so the deployed knob keeps seeing solve signal. The
|
||||
floor introduces a leak -- with a stale (frozen) v_hack the deployed model still
|
||||
hacks:
|
||||
|
||||
| arm | deploy hack |
|
||||
| :----------------------------- | ----------: |
|
||||
| no floor (frac=0) | 0.000 |
|
||||
| floor=0.5, stale v_hack | 0.125 |
|
||||
| floor=0.5, refresh-1 (fresh V) | 0.000 |
|
||||
|
||||
**The leak is staleness, not floor structure.** A fresh per-step gate
|
||||
(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1
|
||||
run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
|
||||
0.000 at every eval through step 25, so the directional answer is unambiguous;
|
||||
a clean 60-step rerun would make it airtight. The headline arm uses no floor
|
||||
anyway (already 0.000), so this is a design note for when the floor is wanted.
|
||||
|
||||
## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
|
||||
|
||||
<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
|
||||
|
||||
@@ -0,0 +1,378 @@
|
||||
# Results archive -- eval1 (OLD eval, pre recency-clean)
|
||||
|
||||
These sections are the pre-2026-06-07 results on the OLD eval (contaminated holdout,
|
||||
base solve ~0.94; Q12/Q13 deploy used n=64). Superseded by the eval2 deploy table in
|
||||
[results.md](results.md). Source logs moved to `logs/archive_eval1/`, so `just results`
|
||||
no longer lists them. Kept for the conceptual findings (mechanism>framing Q10, basis
|
||||
width Q8, refresh cadence Q5) that are eval-independent.
|
||||
|
||||
---
|
||||
|
||||
# Results, organized by the question each run answers
|
||||
|
||||
Generated from `logs/*.log` via `just results` (source: `scripts/results.py`).
|
||||
Curated snapshot 2026-05-30; regenerate any time. Each table cites its source
|
||||
logs in an HTML comment so every number traces back to a file.
|
||||
|
||||
## How to read this
|
||||
|
||||
- **Tables show absolute last-5-step rates** (mean of the final 5 training
|
||||
steps; converged regime, noise-robust vs a single step). Compare rows within
|
||||
a table by eye. Paired-vs-vanilla deltas are mentioned in prose only where
|
||||
the seeds match.
|
||||
- **hack** = fraction of *student* rollouts flagged as reward-hacks (`hack_s`).
|
||||
- **solve** = fraction of *student* rollouts passing ground-truth tests
|
||||
(`gt_s`). NOT `PASS_RATE` (which mixes in the ~99%-hacked teacher pool).
|
||||
- **±std is across seeds.** Blank = n=1 (no std). At n=4 the seed-to-seed std
|
||||
is ~0.12 on both vanilla and projected, so 5-step single-seed numbers are
|
||||
noisy; weight by n.
|
||||
- **Never compare a multi-seed mean to a single-seed point.** Several arms
|
||||
(refresh-1/5/10, no_gate, reverse, mean-diff) only ran on seed 41. Those are
|
||||
compared *only at seed 41*, against the seed-41 vanilla and seed-41 frozen
|
||||
rows, never against a 4-seed mean. Mixing n is how the old refresh "ladder"
|
||||
produced a fake monotonic trend.
|
||||
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
|
||||
surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
|
||||
must log all `steps`).
|
||||
- Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10
|
||||
pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound
|
||||
pair-count AND directions-kept AND tau — NOT a clean "pair set" axis.
|
||||
|
||||
---
|
||||
|
||||
## Q1. Does the cached-teacher pool drive the student to hack? (feasibility, H4)
|
||||
|
||||
<!-- src: logs/*_goal0_fast_s4{1,2,3,4}.log, *_mix0_25_vanilla_s4{1,2,3}.log, *_vanilla_mix0125*.log -->
|
||||
|
||||
| arm | mix | hack | ±std | solve | ±std | seeds |
|
||||
| :------ | ----: | ----: | ----: | ----: | ----: | ----------: |
|
||||
| vanilla | 0.5 | 0.719 | 0.120 | 0.306 | 0.116 | 41,42,43,44 |
|
||||
| vanilla | 0.25 | 0.678 | 0.082 | 0.200 | 0.076 | 41,42,43 |
|
||||
| vanilla | 0.125 | 0.757 | 0.040 | 0.207 | 0.020 | 41 (×2) |
|
||||
|
||||
**Answer: yes.** Clean Qwen3-4B reaches 68-76% last-5 hack within 20 steps at
|
||||
every teacher density. (Don't read a mix trend here — different seed sets; see
|
||||
Q6 for the paired mix comparison.)
|
||||
|
||||
## Q2. 🥇 Does v_hack projection reduce hacking vs vanilla? (H1)
|
||||
|
||||
<!-- src: vanilla *_goal0_fast_s4{1-4}.log; *_g0_21pairs_frozen_s4{1-4}.log; *_g0_21pairs_refresh2_s4{1-4}.log -->
|
||||
|
||||
mix=0.5, v_hack_21pairs, one_sided, k=5, all n=4 (seeds 41-44):
|
||||
|
||||
| arm | hack | ±std | solve | ±std |
|
||||
| :------------------ | ----: | ----: | ----: | ----: |
|
||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 |
|
||||
| projected frozen-V | 0.588 | 0.131 | 0.256 | 0.083 |
|
||||
| projected refresh-2 | 0.537 | 0.066 | 0.225 | 0.050 |
|
||||
|
||||
**Answer: a consistent-in-sign reduction.** Frozen drops hack 0.719→0.588
|
||||
(−13pp), refresh-2 →0.537 (−18pp); both cost ~5-8pp solve. Per-seed paired
|
||||
deltas (same-seed vanilla) are negative on every seed but the std (~0.13-0.17)
|
||||
is about the mean, so the magnitude is not pinned down at n=4. Short of the
|
||||
preregistered 30pp. Note refresh-2 has the *tightest* hack std (0.066), i.e.
|
||||
its effect is the most seed-stable.
|
||||
|
||||
## Q3. one_sided vs no_gate vs reverse gating? (gate_mode, seed 41 only)
|
||||
|
||||
<!-- src (all seed 41, v_hack_full): *_goal0_fast_s41.log (one_sided), *_goal1_nogate_s41.log, *_goal1_reverse_s41.log -->
|
||||
|
||||
no_gate and reverse only ran on seed 41, so this is a seed-41 within-group
|
||||
comparison (no cross-seed mixing):
|
||||
|
||||
| gate | hack | solve |
|
||||
| :-------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| one_sided | 0.775 | 0.275 |
|
||||
| no_gate | 0.625 | 0.200 |
|
||||
| reverse | 0.575 | 0.150 |
|
||||
|
||||
**Answer: more-aggressive gates cut more hack but cost more solve, and
|
||||
one_sided on the 18-pair basis does ~nothing at seed 41** (0.775 = vanilla).
|
||||
This is the weak-basis signal (Q8): the 18-pair v_hack barely overlaps the live
|
||||
gradient, so only the brute no_gate/reverse gates move hack — and they pay for
|
||||
it in solve (0.200, 0.150 vs 0.300). Single seed; directional only.
|
||||
|
||||
## Q4. SVD top-k vs rank-1 mean-diff? (basis, seed 41 only)
|
||||
|
||||
<!-- src (seed 41): *_goal0_fast_s41.log (SVD k=5, v_hack_full); *_meandiff_projected_s41.log (k=1) -->
|
||||
|
||||
| basis | hack | solve |
|
||||
| :-------------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| SVD k=5 (v_hack_full) | 0.775 | 0.275 |
|
||||
| mean-diff k=1 | 0.750 | 0.125 |
|
||||
|
||||
**Answer: at seed 41 neither 18-pair basis cuts hack, and mean-diff tanks
|
||||
solve** (0.300→0.125). Rank-1 being too blunt is plausible; n=1, weak-basis
|
||||
confound (Q8) dominates anyway.
|
||||
|
||||
## Q5. refresh-every cadence (seed 41 only — the honest comparison)
|
||||
|
||||
<!-- src (all seed 41, v_hack_21pairs): *_g0_21pairs_frozen_s41.log, *_g1_21pairs_refresh1_s41.log, *_g0_21pairs_refresh2_s41.log, *_g1_21pairs_refresh5_s41.log, *_g1_21pairs_refresh10_s41.log -->
|
||||
|
||||
refresh-1/5/10 only ran on seed 41, so the only valid comparison is at seed 41,
|
||||
on the shared seed-41 vanilla baseline:
|
||||
|
||||
| refresh | hack | solve |
|
||||
| :------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| frozen (n=20+) | 0.475 | 0.200 |
|
||||
| 10 | 0.575 | 0.200 |
|
||||
| 5 | 0.550 | 0.225 |
|
||||
| 2 | 0.450 | 0.200 |
|
||||
| 1 | 0.600 | 0.200 |
|
||||
|
||||
**Answer: no monotonic refresh trend.** At seed 41, frozen (0.475) and
|
||||
refresh-2 (0.450) are the best; refresh-1/5/10 are worse. The earlier "more
|
||||
refresh = more suppression" ladder was an artifact of comparing seed-41-only
|
||||
refresh-5/10 against a *4-seed* frozen mean (−0.131 paired). The only cadence
|
||||
with multi-seed support is refresh-2 (Q2): on the full seed set it edges frozen
|
||||
(0.537 vs 0.588 hack), but at seed 41 alone the two are within noise. Refresh
|
||||
helps marginally at best; basis width (Q8) is the real lever.
|
||||
|
||||
## Q6. Teacher density (mix) — paired, does the gap hold as the pool thins?
|
||||
|
||||
<!-- src (v_hack_full, frozen, one_sided): vanilla *_mix*_vanilla*.log + *_goal0_fast*.log; projected *_mix0_25_projected_s4{1,2,3}.log, *_mix0.125_g8_projected_s41.log, *_goal{0,1}_fast_s4{1,3,4}.log -->
|
||||
|
||||
Paired Δ vs same-seed vanilla (v_hack_full, frozen, one_sided). Δ columns are
|
||||
per-seed paired means; absolute hack/solve are group means (may differ slightly
|
||||
from Δ since n differs):
|
||||
|
||||
| mix | van hack | proj hack | Δhack | ±std | van solve | proj solve | Δsolve | n | shared seeds |
|
||||
| ----: | -------: | --------: | -----: | ----: | --------: | ---------: | -----: | ---: | -----------: |
|
||||
| 0.5 | 0.719 | 0.700 | −0.062 | 0.075 | 0.306 | 0.283 | −0.081 | 4 | 41(×2),43,44 |
|
||||
| 0.25 | 0.678 | 0.556 | −0.122 | 0.146 | 0.200 | 0.217 | +0.017 | 3 | 41,42,43 |
|
||||
| 0.125 | 0.757 | 0.657 | −0.100 | 0.040 | 0.207 | 0.214 | +0.007 | 2 | 41(×2) |
|
||||
|
||||
**Answer: the reduction holds across densities (−6 to −12pp), and the solve
|
||||
cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly
|
||||
*positive* (+0.7 to +1.7pp) at mix=0.25/0.125. mix=0.125 also has the tightest
|
||||
std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
|
||||
cut, no solve tax.
|
||||
|
||||
## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`)
|
||||
|
||||
<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log -->
|
||||
|
||||
The basis NAMES are misleading. Reading the safetensors shapes/metadata (the
|
||||
stored per-pair grads' first dim = pairs used; basis `top_k` from header):
|
||||
|
||||
| basis | pairs used | k (top_k) | extract tau | what it is |
|
||||
| :--------------- | ---------: | --------: | ----------: | :--------- |
|
||||
| `v_hack_full` | **10** | **5** | 0.25 | older ~12-pair set, k=5 |
|
||||
| `v_hack_21pairs` | **16** | **12** | 0.0 | later ~18-pair set, k=12 |
|
||||
|
||||
Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same
|
||||
train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is
|
||||
**triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND
|
||||
extract tau.** We cannot attribute the gap to "pair set".
|
||||
|
||||
mix=0.5, frozen, one_sided:
|
||||
|
||||
| basis | hack | ±std | solve | ±std | n | seeds |
|
||||
| :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: |
|
||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 |
|
||||
| v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 |
|
||||
| v_hack_21pairs | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 |
|
||||
|
||||
At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.
|
||||
|
||||
**Answer: the stronger basis cuts hack ~2x more — but pair *count* is a red
|
||||
herring; what matters is which hack *mechanisms* the pairs cover.** The strong
|
||||
basis spans the later axes (try/except-swallow, type-only-assert,
|
||||
weak-inequality, hardcode) that the weak/older set under-covers. The real
|
||||
experiment is a content/axis ablation — which mechanisms carry the cut — which
|
||||
is the same question as G2/G3 cross-mechanism generalisation (does a basis from
|
||||
mechanism A suppress hack B), the no-cheat hypothesis itself. The k=5-vs-12
|
||||
and 10-vs-16 differences are present but secondary.
|
||||
|
||||
Current `pairs.py` (`PAIRS`, 18 pairs) by mechanism: axis-1 weak-`run_tests` =
|
||||
8/18; hardcode / persona / try-except-swallow / type-only-assert /
|
||||
weak-inequality = 2 each.
|
||||
|
||||
---
|
||||
|
||||
## Q9. Solve-direction orthogonalization (does stripping the solve subspace recover solve?)
|
||||
|
||||
<!-- src (seed 41, mix=0.5, frozen, one_sided): #145 *_solveorth_base18_s41.log (v_hack_18base, no orth); #146 *_solveorth_m4_s41.log (v_hack_18solveorth4, m=4) -->
|
||||
|
||||
| basis | hack | solve |
|
||||
| :--------------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| 18-pair base (no orth) | 0.500 | 0.200 |
|
||||
| 18-pair solve-orth m=4 | 0.550 | 0.150 |
|
||||
|
||||
**Answer: no — at n=1 it did the opposite.** Stripping the top-4 solve
|
||||
directions from D pre-SVD was meant to *recover* solve; instead solve fell
|
||||
0.200→0.150 and hack rose 0.500→0.550. Both moves are ~0.05, inside the ~0.12
|
||||
seed std — inconclusive, leaning negative. Caveats: (1) two nominally-18-pair
|
||||
bases already disagree by 0.275 hack at this seed (`v_hack_full`=0.775 vs
|
||||
`v_hack_18base`=0.500), so extraction variance likely dominates a 0.05 delta;
|
||||
(2) with 18 pairs the solve basis B (top-4 SVD of G_c) is itself noisy and may
|
||||
strip real hack signal; (3) hack/solve subspaces may genuinely overlap. Needs
|
||||
≥3 seeds before any verdict.
|
||||
|
||||
---
|
||||
|
||||
## Q10. 🥇 Does the *pair set content* matter? (mechanism vs framing vs placebo)
|
||||
|
||||
<!-- src (seed 41, mix=0.125, frozen, one_sided, k=5; all bases extracted k=12/tau=0):
|
||||
#171 *_pairset_prog_wide_s41.log; #169 *_pairset_prog_wider_s41.log;
|
||||
#164 *_pairset_intent_vs_spec_s41.log; #162 *_pairset_honesty_text_s41.log;
|
||||
#160 *_pairset_moral_s41.log; #166 *_pairset_eval_aware_s41.log;
|
||||
#158 *_pairset_philosophical_s41.log; #168 *_pairset_null_city_s41.log (PLACEBO).
|
||||
baseline: 3x vanilla mix=0.125 seed-41 runs. -->
|
||||
|
||||
The detector we're allowed to have is *weak* (no-cheat invariant): it sees some
|
||||
hacks and misses others. So: does a v_hack extracted from a pair set that does
|
||||
NOT contrast the LeetCode mechanism still suppress the mechanical hack? We swap
|
||||
only the pair-set content (every basis extracted identically, k=12/tau=0,
|
||||
trained k=5) and read Δhack vs same-seed vanilla. n=1 projected per row;
|
||||
±0.06 is the baseline noise (std of the 3 seed-41 vanilla runs), so treat
|
||||
anything inside ±0.06 as null.
|
||||
|
||||
| basis (pair set) | contrasts | hack | solve | Δhack vs vanilla |
|
||||
| :-------------------- | :--------------- | ----: | ----: | ---------------: |
|
||||
| vanilla (baseline) | -- | 0.726 | ~0.20 | — |
|
||||
| `prog_wide` | hack *mechanism* | 0.500 | 0.221 | −0.226 |
|
||||
| `prog_wider` | mech + lang/cond | 0.679 | 0.236 | −0.048 |
|
||||
| `intent_vs_spec` | semantic framing | 0.686 | 0.207 | −0.040 |
|
||||
| `honesty_text` | semantic framing | 0.714 | 0.193 | −0.012 |
|
||||
| `moral` | semantic framing | 0.721 | 0.221 | −0.005 |
|
||||
| `eval_aware` | semantic framing | 0.736 | 0.186 | +0.010 |
|
||||
| `philosophical` | semantic framing | 0.743 | 0.243 | +0.017 |
|
||||
| `null_city` (PLACEBO) | random content | 0.750 | 0.221 | +0.024 |
|
||||
|
||||
(Baseline = mean of the 3 seed-41 vanilla mix=0.125 runs the deltas are paired
|
||||
against. The canonical `v_hack_21pairs` is NOT in this table: it was only run
|
||||
at mix=0.5 / different step counts, so a same-table comparison would confound
|
||||
mix and horizon. Its mix=0.5 effect is in Q2.)
|
||||
|
||||
**Answer: it's the *mechanism*, not the *framing*.** Pairs that contrast the
|
||||
programmatic hack mechanism (`prog_wide`) cut hack the most (−0.226), at no
|
||||
solve cost. Semantic / value framings (moral, honesty, eval-awareness,
|
||||
philosophy) do essentially nothing -- all within baseline noise of the
|
||||
`null_city` placebo. The placebo sits at +0.024 (no effect), exactly as it
|
||||
should. So v_hack is picking up the *hack-mechanism* subspace, not a generic
|
||||
"honesty" or "intent" direction.
|
||||
|
||||
Caveats (n=1, hold loosely): (1) all rows are single seed-41 runs; ±0.06 is the
|
||||
seed-41 vanilla noise, so everything from `intent_vs_spec` down is null.
|
||||
`prog_wide` needs ≥3 seeds (task #122) before the −0.226 is trustworthy.
|
||||
(2) Broadening `prog_wide`→`prog_wider` (adding language/phrasing/condition
|
||||
variation) *hurt* (−0.226→−0.048): diluting the mechanism contrast with surface
|
||||
variation weakened the basis. (3) Encouraging for the no-cheat story -- a
|
||||
mechanism-matched-but-off-task detector generalizes -- but the real
|
||||
generalization test is held-out *mechanism* (Stage 2/3), not held-out framing.
|
||||
|
||||
## Q11. Does the projection gap survive to convergence? (60-step, seed 42)
|
||||
|
||||
<!-- src (seed 42, mix=0.125, 60 steps; n=1 each):
|
||||
#179 *_converge_vanilla_mix0125_s42.log; #178 *_converge_refresh2_mix0125_s42_logged.log;
|
||||
*_converge_proj21_mix0125_s42.log (frozen). NOTE: results.py has no `steps`
|
||||
dimension, so these merge into the 20-step mix=0.125 groups in the grouped
|
||||
table -- read them here from the raw last-5, not from that row. -->
|
||||
|
||||
The Q2 gap (−13 to −18pp) is measured at 20 steps, where vanilla hack has only
|
||||
just plateaued (~step 13-16, see Dynamics note). Does projection keep hack down
|
||||
once we run 3x past the plateau? One 60-step run per arm, seed 42, mix=0.125:
|
||||
|
||||
| arm | L5 hack | L5 solve |
|
||||
| :------------------ | ------: | -------: |
|
||||
| vanilla | 0.936 | 0.293 |
|
||||
| projected frozen-V | 0.957 | 0.293 |
|
||||
| projected refresh-2 | 0.907 | 0.307 |
|
||||
|
||||
**Answer: at n=1, the gap closes.** By step 60 all three arms sit at ~0.91-0.96
|
||||
hack -- projection delays hacking but does not prevent it at this horizon. The
|
||||
attractor in this surrogate (cached-teacher) regime is full hack, and the
|
||||
projected student eventually catches up. refresh-2 is marginally below vanilla
|
||||
(−2.9pp) and frozen marginally above (+2.1pp), both inside the ~0.06-0.12 seed
|
||||
noise, so the honest read is "no surviving gap at 60 steps, seed 42."
|
||||
|
||||
Caveats: (1) n=1, seed 42 only -- needs the 3-seed convergence (task #121) to
|
||||
distinguish "gap truly closes" from "seed-42 is a high-hack draw". (2) This is
|
||||
mix=0.125 + seed 42 + 60 steps, three axes different from Q2's mix=0.5/20-step
|
||||
numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
|
||||
suppression is real (Q2, n=4); what's unclear is whether longer training erodes
|
||||
it or whether this is a sparse-teacher/seed artifact.
|
||||
|
||||
## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric)
|
||||
|
||||
<!-- METRIC NOTE: unlike Q1-Q11 (last-5-step *training* hack_s), this section
|
||||
uses the DEPLOY-eval metric from per_mode_deploy.json: knob-off forward
|
||||
(quarantine deleted for route2; trained model for vanilla), n=64 prompts
|
||||
x group, T=0.7, 60-step fast preset, Qwen3-4B, mix=0.125. The two metrics
|
||||
are NOT comparable -- deploy is the apples-to-apples per-arm number.
|
||||
src per_mode_deploy.json:
|
||||
route2 nofloor: s41 20260601T115713 / s42 T150231 / s43 T181502
|
||||
vanilla: s42 20260602T043228 (job 74) / s43 20260601T233047 (job 72)
|
||||
vanilla s41 = job 77 QUEUED (behind the 200-step convergence runs). -->
|
||||
|
||||
The headline comparison. route2 (per-rollout gate routes the hack-aligned
|
||||
gradient into a deletable quarantine knob) vs vanilla, deploy-eval:
|
||||
|
||||
| arm | seeds | deploy hack | ±SEM | deploy solve | ±SEM |
|
||||
| :-------------- | :------- | ----------: | ----: | -----------: | ----: |
|
||||
| vanilla GRPO | 42,43 | 0.305 | 0.039 | 0.516 | 0.032 |
|
||||
| route2 (ours) | 41,42,43 | 0.031 | 0.031 | 0.615 | 0.010 |
|
||||
|
||||
Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 /
|
||||
0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves
|
||||
~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the
|
||||
gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows
|
||||
a large, persistent gap: the quarantine knob holds the cheat and deleting it at
|
||||
deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png`
|
||||
(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 =
|
||||
job 77 pending); promote to n=3 + paired test when it lands.
|
||||
|
||||
## Q13. Does the exploration floor leak, and is the leak staleness or structure?
|
||||
|
||||
<!-- DEPLOY-eval metric. src:
|
||||
job 60 route2 no-floor (frac=0): deploy hack 0.000
|
||||
job 64 route2 floor=0.5 + STALE v_hack: deploy hack 0.125
|
||||
job 73 route2 floor=0.5 + refresh-1 (fresh): deploy hack 0.000 (to step 28,
|
||||
daemon-killed; deploy held 0.000 at every eval 10/15/20/25). -->
|
||||
|
||||
route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of
|
||||
rollouts generated knob-off so the deployed knob keeps seeing solve signal. The
|
||||
floor introduces a leak -- with a stale (frozen) v_hack the deployed model still
|
||||
hacks:
|
||||
|
||||
| arm | deploy hack |
|
||||
| :----------------------------- | ----------: |
|
||||
| no floor (frac=0) | 0.000 |
|
||||
| floor=0.5, stale v_hack | 0.125 |
|
||||
| floor=0.5, refresh-1 (fresh V) | 0.000 |
|
||||
|
||||
**The leak is staleness, not floor structure.** A fresh per-step gate
|
||||
(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1
|
||||
run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
|
||||
0.000 at every eval through step 25, so the directional answer is unambiguous;
|
||||
a clean 60-step rerun would make it airtight. The headline arm uses no floor
|
||||
anyway (already 0.000), so this is a design note for when the floor is wanted.
|
||||
|
||||
## Dynamics note (sizing the convergence test)
|
||||
|
||||
Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
|
||||
**plateaus by step ~13-16**; `gt_s` (solve) stays **noisy-flat at ~0.1-0.5 the
|
||||
whole run, it never climbs**. The attractor in this surrogate regime is full
|
||||
*hack*, not full solve — so "run until full solve" has no target. The
|
||||
convergence question is therefore: once vanilla hack plateaus (~step 15), does
|
||||
projected stay below it or catch up? A 60-step run (~2.2h at g8) sees 3x past
|
||||
the plateau; a 1000-step run (~36h) is wasteful.
|
||||
|
||||
## Open / queued (no result yet)
|
||||
|
||||
- **convergence at ≥3 seeds (#121)**: the n=1 seed-42 run (Q11) shows the gap
|
||||
closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more
|
||||
seeds before concluding the suppression erodes vs survives.
|
||||
- **pairset content at ≥3 seeds (#122)**: Q10's mechanism>framing>placebo
|
||||
ordering is n=1 per row; replicate `prog_wide` and the placebo on 2+ seeds.
|
||||
- **route arm at scale (#182)**: running; validates routing's ablated-eval
|
||||
hack<kept on Qwen3-4B before the 3-way none/erase/route cells (#130).
|
||||
- **k-slice (k=1/2/5)**: only smoke-tested, no 4B results.
|
||||
- **Stage 2/3 cross-*mechanism* generalisation**: the load-bearing test --
|
||||
extract v_hack from hack A, check it stops the *unknown* hack B the student
|
||||
would otherwise learn. Q10 (held-out *framing*) is a weaker cousin.
|
||||
Reference in New Issue
Block a user