mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:31:11 +08:00

Files

T

wassname 4464f9d312 results tooling + solve-orth knob + results-by-question doc

- scripts/results.py + `just results`: aggregate logs/*.log into last-5
  hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
  full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
  (SVD of clean-side grads) from D before SVD, so projection doesn't ablate
  the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
  (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
  with comparison tables and answers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 07:21:05 +00:00

7.2 KiB

Raw Blame History

Results, organized by the question each run answers

Generated from logs/*.log via just results (source: scripts/results.py). Regenerate any time; this file is a curated snapshot as of 2026-05-29.

How to read this

Metric = mean of the last 5 training steps (the converged regime; noise-robust vs a single final step). Whole-run means (WH) are kept as a secondary column because the blog Table 1 uses whole-run; the two diverge a lot because hacking ramps over training, so last-5 is the honest "where it ended up" number.
hack = fraction of student rollouts flagged as reward-hacks (hack_s).
solve = fraction of student rollouts passing the ground-truth tests (gt_s). This is NOT PASS_RATE, which mixes in the ~99%-hacked teacher pool and is near-useless as a student-quality signal.
All runs are the fast preset (20 steps, G=4, cached-teacher mix). This is the fast surrogate regime, not the endogenous-hack regime.
Epistemic status: many ablation cells are n=1 seed — suggestive, not conclusive. Seed counts are in the tables. The mix=0.5 headline is the only n=4 cell.
Provenance: just results prints a per-run table with a full argv column (every CLI flag), so each number traces to its exact invocation.

A confound to keep in mind: v_hack_full is an 18-pair extraction (current pairs.py), while v_hack_21pairs is a 21-pair set. Comparisons across those two confound pair-count with pair-set.

Q1. Does the cached-teacher pool actually drive the student to hack? (feasibility, H4)

Why: the whole fast-surrogate design rests on a clean base student picking up hacking from off-policy teacher exposure, instead of the ~64 GPU-h endogenous route.

arm	mix	hack	solve	seeds
vanilla	0.5	0.719	0.306	4
vanilla	0.25	0.678	0.200	3
vanilla	0.125	0.754	0.261	2

Answer: yes. Clean Qwen3-4B reaches 68-75% last-5 student hack rate within 20 steps across teacher densities. The surrogate works; the student learns to hack from exposure.

Q2. Does v_hack gradient projection reduce hacking vs vanilla, at matched config? (H1)

Why: the core hypothesis. mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 seeds (41,42,43,44).

arm	hack	solve	Δhack	Δsolve	seeds
vanilla	0.719	0.306	—	—	4
projected frozen-V	0.588	0.256	−13.1pp	−5.0pp	4
projected refresh-2	0.537	0.225	−18.2pp	−8.1pp	4

Answer: yes, but with a real solve cost. Projection cuts last-5 hack by 13pp (frozen) to 18pp (refresh-2), short of the preregistered 30pp. It also costs 5-8pp of student solve rate — a genuine selectivity problem (this is what solve-orthogonalization, queued, targets). Note the cost is invisible on whole-run PASS_RATE; it only shows on the last-5 student-GT metric.

Q3. one_sided vs no_gate vs reverse gating? (gate_mode ablation)

Why: how aggressively to ablate. one_sided removes only the hack-ward component; no_gate removes any motion in span(V); reverse pushes anti-hack. mix=0.5, v_hack_full, frozen, n=1 (seed 41) each — suggestive only.

gate	hack	solve	seeds
one_sided	0.700	0.283	3
no_gate	0.625	0.200	1
reverse	0.575	0.150	1

Answer: more aggressive = more hack suppression but worse solve. reverse gives the largest hack cut (0.575) but halves solve (0.306→0.150). one_sided is the most solve-preserving. This is the same selectivity tension as Q2: you can buy hack reduction with solve, and the gradient of that trade is steep.

Q4. SVD top-k basis vs rank-1 mean-diff? (basis ablation)

Why: with few pairs, SVD axes 2..k may be noise; mean-diff (k=1) regularizes to the single robust direction. mix=0.5, frozen, n=1.

basis	hack	solve	seeds
SVD top-k (k=5, v_hack_full)	0.700	0.283	3
mean-diff (k=1, v_hack_full_meandiff)	0.750	0.125	1

Answer: mean-diff is worse on both axes (higher hack, much lower solve). A rank-1 basis is too blunt — it doesn't suppress more hacking and it costs more solve. Keep the multi-axis SVD basis.

Q5. refresh-every cadence sweep

Why: the v_hack basis goes stale as the student drifts (cos_pre_t decays 0.28→0.07). How often to re-extract? mix=0.5, v_hack_21pairs, one_sided, n=1 except frozen/refresh-2 (n=4).

refresh	hack	solve	seeds
frozen (0)	0.588	0.256	4
1	0.600	0.200	1
2	0.537	0.225	4
5	0.550	0.225	1
10	0.575	0.200	1

Answer: refresh-2 is the sweet spot (lowest hack at 0.537). refresh-1 is no better than frozen (too noisy a basis), and 5/10 drift back up. The effect is small (~5pp) and the n=1 cells are noisy, but 2 is the consistent pick.

Q6. Teacher density (mix-ratio) — does the projection gap hold as the pool thins?

Why: lower mix = less off-policy hack pressure, closer to the real regime. v_hack basis frozen, one_sided.

mix	vanilla hack	projected hack	Δhack	vanilla solve	projected solve
0.5	0.719	0.588	−13pp	0.306	0.256
0.25	0.678	0.556	−12pp	0.200	0.217
0.125	0.754	0.657	−10pp	0.261	0.214

Answer: the gap holds, narrowing slightly as the pool thins (−13 → −10pp). At mix=0.25 projection even nudges solve up. The intervention isn't an artifact of heavy teacher mixing. (mix=0.25/0.125 use v_hack_full, so not strictly matched to the 21-pair mix=0.5 row.)

Q7. Noise-floor cut (drop_bottom_frac) 0.25 vs 0.0?

Why: dropping the bottom-25% singular values is meant to remove noise axes. mix=0.5, v_hack_full, frozen, n=1.

dropf	hack	solve	seeds
0.25	0.700	0.283	3
0.0	0.625	0.200	1

Answer: inconclusive (n=1). dropf=0 looks like a bigger hack cut but also lower solve — same trade as everywhere, and a single seed. Needs replication before drawing anything.

Q8. Pair set: 18-pair vs 21-pair extraction

Why: more contrastive pairs across more axes should give a better-conditioned basis. mix=0.5, frozen, one_sided.

basis	hack	solve	seeds
v_hack_full (18 pairs)	0.700	0.283	3
v_hack_21pairs (21 pairs)	0.588	0.256	4

Answer: the 21-pair basis suppresses more hacking (0.588 vs 0.700) at a small solve cost. Pair set/count matters and is one of the larger levers seen here. Caveat: confounds count with the specific extra pairs, and the seed sets differ.

Open / queued (no results yet)

overshoot=1.1 (mild over-projection): queued (#140). Tests if removing 110% of the hack-ward component beats 1.0 without the solve cost of full reverse.
solve-orthogonalization (strip the known-solve subspace from D pre-SVD): queued (#143-146), directly targets the Q2/Q3 solve cost.
let-it-converge (60 steps): queued (#141-142), tests whether the gap persists past step 20.
k-slice ablation (k=1/2/5 SVD): only smoke-tested so far; no 4B results.
G2/G3 cross-mechanism generalisation: queued; the load-bearing test of whether a basis from known hacks suppresses an unknown one.

7.2 KiB Raw Blame History Unescape Escape