- paired view: join projected to vanilla on (mix, seed), per-seed delta, mean +/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is meaningless; this enforces same-seed comparison (ml_debug principle). - grouped view now reports std across seeds (null at n=1). - exclude in-progress/aborted runs (must log all `steps`) so partial logs don't read as impossibly-good results. - docs/results.md rewritten around paired deltas; honest that at n=4 the last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent in sign but not cleanly separated from zero. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.1 KiB
Results, organized by the question each run answers
Generated from logs/*.log via just results (source: scripts/results.py).
Curated snapshot 2026-05-29; regenerate any time.
How to read this
- Metric = mean of the last 5 training steps (converged regime; noise-robust
vs a single step). Whole-run (
WH) is smoother but dilutes the converged behaviour with the early ramp-up; the blog Table 1 uses WH, this doc uses last-5. - hack = fraction of student rollouts flagged as reward-hacks (
hack_s). - solve = fraction of student rollouts passing ground-truth tests
(
gt_s). NOTPASS_RATE(which mixes in the ~99%-hacked teacher pool). - Comparisons are paired on seed. A projected run is compared to the vanilla run at the same (mix, seed); we take per-seed deltas, then mean ± std over shared seeds. Comparing a 3-seed mean to a 1-seed point (as an earlier draft did) is meaningless. n=1 cells have no std and are flagged as such.
- Sobering caveat up front: last-5 is a 5-step mean, so per-seed it's noisy; at n=4 the headline Δhack std (~0.15) is as large as the mean (~0.13). The effect is real-looking and consistent in sign, but NOT cleanly separated from zero at this n. Read the deltas as suggestive, weighted by n and std.
- All runs are the
fastpreset (20 steps, G=4, cached-teacher mix); the fast surrogate regime, not endogenous hacking. Incomplete/aborted runs are excluded (a run must log allsteps). - Confound:
v_hack_full= 18-pair extraction;v_hack_21pairs= 21-pair set. Cross-basis comparisons confound pair-count with pair-set.
Q1. Does the cached-teacher pool drive the student to hack? (feasibility, H4)
| arm | mix | hack | solve | seeds |
|---|---|---|---|---|
| vanilla | 0.5 | 0.719 | 0.306 | 41,42,43,44 |
| vanilla | 0.25 | 0.678 | 0.200 | 41,42,43 |
| vanilla | 0.125 | 0.754 | 0.261 | 41 (×2) |
Answer: yes. Clean Qwen3-4B reaches 68-75% last-5 hack within 20 steps at every teacher density. (Don't compare mixes here as a trend — different seed sets; see Q6 for the paired mix comparison.)
Q2. Does v_hack projection reduce hacking vs vanilla? (H1, paired)
Paired Δ vs same-seed vanilla, mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 (41-44):
| arm | Δhack | Δhack std | Δsolve | n |
|---|---|---|---|---|
| projected frozen-V | −0.131 | 0.146 | −0.050 | 4 |
| projected refresh-2 | −0.181 | 0.169 | −0.081 | 4 |
Answer: a consistent-in-sign reduction (−13pp frozen, −18pp refresh-2), but the std ≈ the mean at n=4, so it is not statistically clean. Both arms also cost 5-8pp of student solve. Short of the preregistered 30pp. The honest statement: directionally it reduces hacking on every seed, but more seeds are needed to call the magnitude. (WH paired deltas are smoother and tell the same sign story.)
Q3. one_sided vs no_gate vs reverse gating? (gate_mode)
| gate | Δhack | Δsolve | n |
|---|---|---|---|
| one_sided | −0.062 | −0.081 | 4 |
| no_gate | −0.150 | −0.100 | 1 |
| reverse | −0.200 | −0.150 | 1 |
(All v_hack_full, mix=0.5, frozen. one_sided is n=4 with std 0.075; no_gate and reverse are n=1, no std — not yet comparable.)
Answer: provisional only. The n=1 aggressive gates (no_gate, reverse) show larger hack cuts and larger solve costs, consistent with the selectivity trade-off, but each is a single seed. Needs ≥3 seeds before any claim. Note one_sided on v_hack_full is only −0.062 (within std) — weak basis (see Q8).
Q4. SVD top-k vs rank-1 mean-diff? (basis)
| basis | Δhack | Δsolve | n |
|---|---|---|---|
| SVD k=5 (v_hack_full) | −0.062 | −0.081 | 4 |
| mean-diff k=1 (v_hack_full_meandiff) | −0.025 | −0.175 | 1 |
Answer: mean-diff looks worse (smaller hack cut, larger solve cost) but n=1. A rank-1 basis being too blunt is plausible; not established at n=1.
Q5. refresh-every cadence
| refresh | Δhack | Δhack std | Δsolve | n |
|---|---|---|---|---|
| frozen | −0.131 | 0.146 | −0.050 | 4 |
| 1 | −0.175 | — | −0.100 | 1 |
| 2 | −0.181 | 0.169 | −0.081 | 4 |
| 5 | −0.225 | — | −0.075 | 1 |
| 10 | −0.200 | — | −0.100 | 1 |
(All v_hack_21pairs, mix=0.5, one_sided.)
Answer: refresh-2 edges out frozen (−0.181 vs −0.131, both n=4) but the difference (~5pp) is small vs the std (~0.16). The n=1 cadences (1/5/10) hint that more refresh = slightly more suppression, unconfirmed.
Q6. Teacher density (mix) — paired, does the gap hold as the pool thins?
| mix | Δhack | Δhack std | Δsolve | n | shared seeds |
|---|---|---|---|---|---|
| 0.5 | −0.062 | 0.075 | −0.081 | 4 | 41(×2),43,44 |
| 0.25 | −0.122 | 0.146 | +0.017 | 3 | 41,42,43 |
| 0.125 | −0.100 | 0.040 | +0.007 | 2 | 41(×2) |
(v_hack_full, frozen, one_sided — the basis with coverage at all three mixes.)
Answer: the reduction holds across densities (−6 to −12pp) and your read is right — any mix is sufficient to see it. At lower mix the solve cost vanishes (even slightly positive). The mix=0.125 cell has the tightest std (0.040, n=2).
Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs)
| basis | Δhack | Δhack std | Δsolve | n |
|---|---|---|---|---|
| v_hack_full (18) | −0.062 | 0.075 | −0.081 | 4 |
| v_hack_21pairs (21) | −0.131 | 0.146 | −0.050 | 4 |
Answer: the 21-pair basis suppresses ~2x more hacking (−0.131 vs −0.062), both n=4 mix=0.5 frozen. Pair set is one of the largest levers here. Confounds count with the specific extra pairs; the 21-pair shared-seed set is the full 41-44 while v_hack_full's is 41(×2),43,44.
Open / queued (no result yet)
- solve-orthogonalization (#145 base done, #146 m=4 running): base 18-pair paired Δhack −0.275 / Δsolve −0.100 (n=1, seed 41). m=4 pending — that's the one that tests whether stripping the solve subspace recovers the solve cost.
- overshoot=1.1 (#140), let-it-converge 60-step (#141/142): queued.
- k-slice (k=1/2/5): only smoke-tested, no 4B results.
- G2/G3 cross-mechanism generalisation: queued; the load-bearing test of whether a known-hack basis stops an unknown hack.