results tooling + solve-orth knob + results-by-question doc

- scripts/results.py + `just results`: aggregate logs/*.log into last-5
  hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
  full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
  (SVD of clean-side grads) from D before SVD, so projection doesn't ablate
  the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
  (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
  with comparison tables and answers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-29 07:21:05 +00:00
parent 826b2aa83e
commit 4464f9d312
4 changed files with 318 additions and 0 deletions
+169
View File
@@ -0,0 +1,169 @@
# Results, organized by the question each run answers
Generated from `logs/*.log` via `just results` (source: `scripts/results.py`).
Regenerate any time; this file is a curated snapshot as of 2026-05-29.
## How to read this
- **Metric = mean of the last 5 training steps** (the converged regime;
noise-robust vs a single final step). Whole-run means (`WH`) are kept as a
secondary column because the blog Table 1 uses whole-run; the two diverge a
lot because hacking ramps over training, so last-5 is the honest "where it
ended up" number.
- **hack** = fraction of *student* rollouts flagged as reward-hacks (`hack_s`).
- **solve** = fraction of *student* rollouts passing the ground-truth tests
(`gt_s`). This is NOT `PASS_RATE`, which mixes in the ~99%-hacked teacher
pool and is near-useless as a student-quality signal.
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix). This is
the fast surrogate regime, not the endogenous-hack regime.
- **Epistemic status:** many ablation cells are n=1 seed — suggestive, not
conclusive. Seed counts are in the tables. The mix=0.5 headline is the only
n=4 cell.
- **Provenance:** `just results` prints a per-run table with a full `argv`
column (every CLI flag), so each number traces to its exact invocation.
A confound to keep in mind: `v_hack_full` is an 18-pair extraction (current
`pairs.py`), while `v_hack_21pairs` is a 21-pair set. Comparisons across those
two confound pair-count with pair-set.
---
## Q1. Does the cached-teacher pool actually drive the student to hack? (feasibility, H4)
Why: the whole fast-surrogate design rests on a clean base student picking up
hacking from off-policy teacher exposure, instead of the ~64 GPU-h endogenous route.
| arm | mix | hack | solve | seeds |
| :-- | --: | --: | --: | --: |
| vanilla | 0.5 | 0.719 | 0.306 | 4 |
| vanilla | 0.25 | 0.678 | 0.200 | 3 |
| vanilla | 0.125 | 0.754 | 0.261 | 2 |
**Answer: yes.** Clean Qwen3-4B reaches 68-75% last-5 student hack rate within
20 steps across teacher densities. The surrogate works; the student learns to
hack from exposure.
## Q2. Does v_hack gradient projection reduce hacking vs vanilla, at matched config? (H1)
Why: the core hypothesis. mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 seeds (41,42,43,44).
| arm | hack | solve | Δhack | Δsolve | seeds |
| :-- | --: | --: | --: | --: | --: |
| vanilla | 0.719 | 0.306 | — | — | 4 |
| projected frozen-V | 0.588 | 0.256 | 13.1pp | 5.0pp | 4 |
| projected refresh-2 | 0.537 | 0.225 | 18.2pp | 8.1pp | 4 |
**Answer: yes, but with a real solve cost.** Projection cuts last-5 hack by
13pp (frozen) to 18pp (refresh-2), short of the preregistered 30pp. It also
costs 5-8pp of student solve rate — a genuine selectivity problem (this is
what solve-orthogonalization, queued, targets). Note the cost is invisible on
whole-run `PASS_RATE`; it only shows on the last-5 student-GT metric.
## Q3. one_sided vs no_gate vs reverse gating? (gate_mode ablation)
Why: how aggressively to ablate. one_sided removes only the hack-ward
component; no_gate removes any motion in span(V); reverse pushes anti-hack.
mix=0.5, v_hack_full, frozen, n=1 (seed 41) each — suggestive only.
| gate | hack | solve | seeds |
| :-- | --: | --: | --: |
| one_sided | 0.700 | 0.283 | 3 |
| no_gate | 0.625 | 0.200 | 1 |
| reverse | 0.575 | 0.150 | 1 |
**Answer: more aggressive = more hack suppression but worse solve.** reverse
gives the largest hack cut (0.575) but halves solve (0.306→0.150). one_sided is
the most solve-preserving. This is the same selectivity tension as Q2: you can
buy hack reduction with solve, and the gradient of that trade is steep.
## Q4. SVD top-k basis vs rank-1 mean-diff? (basis ablation)
Why: with few pairs, SVD axes 2..k may be noise; mean-diff (k=1) regularizes
to the single robust direction. mix=0.5, frozen, n=1.
| basis | hack | solve | seeds |
| :-- | --: | --: | --: |
| SVD top-k (k=5, v_hack_full) | 0.700 | 0.283 | 3 |
| mean-diff (k=1, v_hack_full_meandiff) | 0.750 | 0.125 | 1 |
**Answer: mean-diff is worse on both axes** (higher hack, much lower solve). A
rank-1 basis is too blunt — it doesn't suppress more hacking and it costs more
solve. Keep the multi-axis SVD basis.
## Q5. refresh-every cadence sweep
Why: the v_hack basis goes stale as the student drifts (cos_pre_t decays
0.28→0.07). How often to re-extract? mix=0.5, v_hack_21pairs, one_sided, n=1
except frozen/refresh-2 (n=4).
| refresh | hack | solve | seeds |
| :-- | --: | --: | --: |
| frozen (0) | 0.588 | 0.256 | 4 |
| 1 | 0.600 | 0.200 | 1 |
| 2 | 0.537 | 0.225 | 4 |
| 5 | 0.550 | 0.225 | 1 |
| 10 | 0.575 | 0.200 | 1 |
**Answer: refresh-2 is the sweet spot** (lowest hack at 0.537). refresh-1 is no
better than frozen (too noisy a basis), and 5/10 drift back up. The effect is
small (~5pp) and the n=1 cells are noisy, but 2 is the consistent pick.
## Q6. Teacher density (mix-ratio) — does the projection gap hold as the pool thins?
Why: lower mix = less off-policy hack pressure, closer to the real regime.
v_hack basis frozen, one_sided.
| mix | vanilla hack | projected hack | Δhack | vanilla solve | projected solve |
| --: | --: | --: | --: | --: | --: |
| 0.5 | 0.719 | 0.588 | 13pp | 0.306 | 0.256 |
| 0.25 | 0.678 | 0.556 | 12pp | 0.200 | 0.217 |
| 0.125 | 0.754 | 0.657 | 10pp | 0.261 | 0.214 |
**Answer: the gap holds, narrowing slightly as the pool thins** (13 → 10pp).
At mix=0.25 projection even nudges solve up. The intervention isn't an artifact
of heavy teacher mixing. (mix=0.25/0.125 use v_hack_full, so not strictly
matched to the 21-pair mix=0.5 row.)
## Q7. Noise-floor cut (drop_bottom_frac) 0.25 vs 0.0?
Why: dropping the bottom-25% singular values is meant to remove noise axes.
mix=0.5, v_hack_full, frozen, n=1.
| dropf | hack | solve | seeds |
| --: | --: | --: | --: |
| 0.25 | 0.700 | 0.283 | 3 |
| 0.0 | 0.625 | 0.200 | 1 |
**Answer: inconclusive (n=1).** dropf=0 looks like a bigger hack cut but also
lower solve — same trade as everywhere, and a single seed. Needs replication
before drawing anything.
## Q8. Pair set: 18-pair vs 21-pair extraction
Why: more contrastive pairs across more axes should give a better-conditioned
basis. mix=0.5, frozen, one_sided.
| basis | hack | solve | seeds |
| :-- | --: | --: | --: |
| v_hack_full (18 pairs) | 0.700 | 0.283 | 3 |
| v_hack_21pairs (21 pairs) | 0.588 | 0.256 | 4 |
**Answer: the 21-pair basis suppresses more hacking** (0.588 vs 0.700) at a
small solve cost. Pair set/count matters and is one of the larger levers seen
here. Caveat: confounds count with the specific extra pairs, and the seed sets
differ.
---
## Open / queued (no results yet)
- **overshoot=1.1** (mild over-projection): queued (#140). Tests if removing
110% of the hack-ward component beats 1.0 without the solve cost of full
`reverse`.
- **solve-orthogonalization** (strip the known-solve subspace from D pre-SVD):
queued (#143-146), directly targets the Q2/Q3 solve cost.
- **let-it-converge** (60 steps): queued (#141-142), tests whether the gap
persists past step 20.
- **k-slice ablation** (k=1/2/5 SVD): only smoke-tested so far; no 4B results.
- **G2/G3 cross-mechanism generalisation**: queued; the load-bearing test of
whether a basis from known hacks suppresses an unknown one.