mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
results: absolute-rate tables + provenance, lock mix=0.125 default
docs/results.md: lead with absolute last-5 rates (compare within a table by eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME), Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack plateaus ~step 15). scripts/results.py: add `log` provenance column; drop the wide argv/time cols. Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps the hack cut with no solve tax. Smoke passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
+184
-89
@@ -1,137 +1,232 @@
|
||||
# Results, organized by the question each run answers
|
||||
|
||||
Generated from `logs/*.log` via `just results` (source: `scripts/results.py`).
|
||||
Curated snapshot 2026-05-29; regenerate any time.
|
||||
Curated snapshot 2026-05-29; regenerate any time. Each table cites its source
|
||||
logs in an HTML comment so every number traces back to a file.
|
||||
|
||||
## How to read this
|
||||
|
||||
- **Metric = mean of the last 5 training steps** (converged regime; noise-robust
|
||||
vs a single step). Whole-run (`WH`) is smoother but dilutes the converged
|
||||
behaviour with the early ramp-up; the blog Table 1 uses WH, this doc uses
|
||||
last-5.
|
||||
- **Tables show absolute last-5-step rates** (mean of the final 5 training
|
||||
steps; converged regime, noise-robust vs a single step). Compare rows within
|
||||
a table by eye. Paired-vs-vanilla deltas are mentioned in prose only where
|
||||
the seeds match.
|
||||
- **hack** = fraction of *student* rollouts flagged as reward-hacks (`hack_s`).
|
||||
- **solve** = fraction of *student* rollouts passing ground-truth tests
|
||||
(`gt_s`). NOT `PASS_RATE` (which mixes in the ~99%-hacked teacher pool).
|
||||
- **Comparisons are paired on seed.** A projected run is compared to the vanilla
|
||||
run at the *same (mix, seed)*; we take per-seed deltas, then mean ± std over
|
||||
shared seeds. Comparing a 3-seed mean to a 1-seed point (as an earlier draft
|
||||
did) is meaningless. n=1 cells have no std and are flagged as such.
|
||||
- **Sobering caveat up front:** last-5 is a 5-step mean, so per-seed it's noisy;
|
||||
at n=4 the headline Δhack std (~0.15) is as large as the mean (~0.13). The
|
||||
effect is real-looking and consistent in sign, but NOT cleanly separated from
|
||||
zero at this n. Read the deltas as suggestive, weighted by n and std.
|
||||
- **±std is across seeds.** Blank = n=1 (no std). At n=4 the seed-to-seed std
|
||||
is ~0.12 on both vanilla and projected, so 5-step single-seed numbers are
|
||||
noisy; weight by n.
|
||||
- **Never compare a multi-seed mean to a single-seed point.** Several arms
|
||||
(refresh-1/5/10, no_gate, reverse, mean-diff) only ran on seed 41. Those are
|
||||
compared *only at seed 41*, against the seed-41 vanilla and seed-41 frozen
|
||||
rows, never against a 4-seed mean. Mixing n is how the old refresh "ladder"
|
||||
produced a fake monotonic trend.
|
||||
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
|
||||
surrogate regime, not endogenous hacking. Incomplete/aborted runs are
|
||||
excluded (a run must log all `steps`).
|
||||
- Confound: `v_hack_full` = 18-pair extraction; `v_hack_21pairs` = 21-pair set.
|
||||
Cross-basis comparisons confound pair-count with pair-set.
|
||||
surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
|
||||
must log all `steps`).
|
||||
- Confound: `v_hack_full` / `v_hack_18base` = 18-pair extraction;
|
||||
`v_hack_21pairs` = 21-pair set. Cross-basis rows confound count with pair-set.
|
||||
|
||||
---
|
||||
|
||||
## Q1. Does the cached-teacher pool drive the student to hack? (feasibility, H4)
|
||||
|
||||
| arm | mix | hack | solve | seeds |
|
||||
| :-- | --: | --: | --: | --: |
|
||||
| vanilla | 0.5 | 0.719 | 0.306 | 41,42,43,44 |
|
||||
| vanilla | 0.25 | 0.678 | 0.200 | 41,42,43 |
|
||||
| vanilla | 0.125 | 0.754 | 0.261 | 41 (×2) |
|
||||
<!-- src: logs/*_goal0_fast_s4{1,2,3,4}.log, *_mix0_25_vanilla_s4{1,2,3}.log, *_vanilla_mix0125*.log -->
|
||||
|
||||
**Answer: yes.** Clean Qwen3-4B reaches 68-75% last-5 hack within 20 steps at
|
||||
every teacher density. (Don't compare mixes here as a trend — different seed
|
||||
sets; see Q6 for the paired mix comparison.)
|
||||
| arm | mix | hack | ±std | solve | ±std | seeds |
|
||||
| :------ | ----: | ----: | ----: | ----: | ----: | ----------: |
|
||||
| vanilla | 0.5 | 0.719 | 0.120 | 0.306 | 0.116 | 41,42,43,44 |
|
||||
| vanilla | 0.25 | 0.678 | 0.082 | 0.200 | 0.076 | 41,42,43 |
|
||||
| vanilla | 0.125 | 0.757 | 0.040 | 0.207 | 0.020 | 41 (×2) |
|
||||
|
||||
## Q2. Does v_hack projection reduce hacking vs vanilla? (H1, paired)
|
||||
**Answer: yes.** Clean Qwen3-4B reaches 68-76% last-5 hack within 20 steps at
|
||||
every teacher density. (Don't read a mix trend here — different seed sets; see
|
||||
Q6 for the paired mix comparison.)
|
||||
|
||||
Paired Δ vs same-seed vanilla, mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 (41-44):
|
||||
## Q2. 🥇 Does v_hack projection reduce hacking vs vanilla? (H1)
|
||||
|
||||
| arm | Δhack | Δhack std | Δsolve | n |
|
||||
| :-- | --: | --: | --: | --: |
|
||||
| projected frozen-V | −0.131 | 0.146 | −0.050 | 4 |
|
||||
| projected refresh-2 | −0.181 | 0.169 | −0.081 | 4 |
|
||||
<!-- src: vanilla *_goal0_fast_s4{1-4}.log; *_g0_21pairs_frozen_s4{1-4}.log; *_g0_21pairs_refresh2_s4{1-4}.log -->
|
||||
|
||||
**Answer: a consistent-in-sign reduction (−13pp frozen, −18pp refresh-2), but
|
||||
the std ≈ the mean at n=4, so it is not statistically clean.** Both arms also
|
||||
cost 5-8pp of student solve. Short of the preregistered 30pp. The honest
|
||||
statement: directionally it reduces hacking on every seed, but more seeds are
|
||||
needed to call the magnitude. (WH paired deltas are smoother and tell the same
|
||||
sign story.)
|
||||
mix=0.5, v_hack_21pairs, one_sided, k=5, all n=4 (seeds 41-44):
|
||||
|
||||
## Q3. one_sided vs no_gate vs reverse gating? (gate_mode)
|
||||
| arm | hack | ±std | solve | ±std |
|
||||
| :------------------ | ----: | ----: | ----: | ----: |
|
||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 |
|
||||
| projected frozen-V | 0.588 | 0.131 | 0.256 | 0.083 |
|
||||
| projected refresh-2 | 0.537 | 0.066 | 0.225 | 0.050 |
|
||||
|
||||
| gate | Δhack | Δsolve | n |
|
||||
| :-- | --: | --: | --: |
|
||||
| one_sided | −0.062 | −0.081 | 4 |
|
||||
| no_gate | −0.150 | −0.100 | 1 |
|
||||
| reverse | −0.200 | −0.150 | 1 |
|
||||
**Answer: a consistent-in-sign reduction.** Frozen drops hack 0.719→0.588
|
||||
(−13pp), refresh-2 →0.537 (−18pp); both cost ~5-8pp solve. Per-seed paired
|
||||
deltas (same-seed vanilla) are negative on every seed but the std (~0.13-0.17)
|
||||
is about the mean, so the magnitude is not pinned down at n=4. Short of the
|
||||
preregistered 30pp. Note refresh-2 has the *tightest* hack std (0.066), i.e.
|
||||
its effect is the most seed-stable.
|
||||
|
||||
(All v_hack_full, mix=0.5, frozen. one_sided is n=4 with std 0.075; no_gate and
|
||||
reverse are **n=1, no std** — not yet comparable.)
|
||||
## Q3. one_sided vs no_gate vs reverse gating? (gate_mode, seed 41 only)
|
||||
|
||||
**Answer: provisional only.** The n=1 aggressive gates (no_gate, reverse) show
|
||||
larger hack cuts and larger solve costs, consistent with the selectivity
|
||||
trade-off, but each is a single seed. Needs ≥3 seeds before any claim. Note
|
||||
one_sided on v_hack_full is only −0.062 (within std) — weak basis (see Q8).
|
||||
<!-- src (all seed 41, v_hack_full): *_goal0_fast_s41.log (one_sided), *_goal1_nogate_s41.log, *_goal1_reverse_s41.log -->
|
||||
|
||||
## Q4. SVD top-k vs rank-1 mean-diff? (basis)
|
||||
no_gate and reverse only ran on seed 41, so this is a seed-41 within-group
|
||||
comparison (no cross-seed mixing):
|
||||
|
||||
| basis | Δhack | Δsolve | n |
|
||||
| :-- | --: | --: | --: |
|
||||
| SVD k=5 (v_hack_full) | −0.062 | −0.081 | 4 |
|
||||
| mean-diff k=1 (v_hack_full_meandiff) | −0.025 | −0.175 | 1 |
|
||||
| gate | hack | solve |
|
||||
| :-------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| one_sided | 0.775 | 0.275 |
|
||||
| no_gate | 0.625 | 0.200 |
|
||||
| reverse | 0.575 | 0.150 |
|
||||
|
||||
**Answer: mean-diff looks worse** (smaller hack cut, larger solve cost) but n=1.
|
||||
A rank-1 basis being too blunt is plausible; not established at n=1.
|
||||
**Answer: more-aggressive gates cut more hack but cost more solve, and
|
||||
one_sided on the 18-pair basis does ~nothing at seed 41** (0.775 = vanilla).
|
||||
This is the weak-basis signal (Q8): the 18-pair v_hack barely overlaps the live
|
||||
gradient, so only the brute no_gate/reverse gates move hack — and they pay for
|
||||
it in solve (0.200, 0.150 vs 0.300). Single seed; directional only.
|
||||
|
||||
## Q5. refresh-every cadence
|
||||
## Q4. SVD top-k vs rank-1 mean-diff? (basis, seed 41 only)
|
||||
|
||||
| refresh | Δhack | Δhack std | Δsolve | n |
|
||||
| --: | --: | --: | --: | --: |
|
||||
| frozen | −0.131 | 0.146 | −0.050 | 4 |
|
||||
| 1 | −0.175 | — | −0.100 | 1 |
|
||||
| 2 | −0.181 | 0.169 | −0.081 | 4 |
|
||||
| 5 | −0.225 | — | −0.075 | 1 |
|
||||
| 10 | −0.200 | — | −0.100 | 1 |
|
||||
<!-- src (seed 41): *_goal0_fast_s41.log (SVD k=5, v_hack_full); *_meandiff_projected_s41.log (k=1) -->
|
||||
|
||||
(All v_hack_21pairs, mix=0.5, one_sided.)
|
||||
| basis | hack | solve |
|
||||
| :-------------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| SVD k=5 (v_hack_full) | 0.775 | 0.275 |
|
||||
| mean-diff k=1 | 0.750 | 0.125 |
|
||||
|
||||
**Answer: refresh-2 edges out frozen** (−0.181 vs −0.131, both n=4) but the
|
||||
difference (~5pp) is small vs the std (~0.16). The n=1 cadences (1/5/10) hint
|
||||
that more refresh = slightly more suppression, unconfirmed.
|
||||
**Answer: at seed 41 neither 18-pair basis cuts hack, and mean-diff tanks
|
||||
solve** (0.300→0.125). Rank-1 being too blunt is plausible; n=1, weak-basis
|
||||
confound (Q8) dominates anyway.
|
||||
|
||||
## Q5. refresh-every cadence (seed 41 only — the honest comparison)
|
||||
|
||||
<!-- src (all seed 41, v_hack_21pairs): *_g0_21pairs_frozen_s41.log, *_g1_21pairs_refresh1_s41.log, *_g0_21pairs_refresh2_s41.log, *_g1_21pairs_refresh5_s41.log, *_g1_21pairs_refresh10_s41.log -->
|
||||
|
||||
refresh-1/5/10 only ran on seed 41, so the only valid comparison is at seed 41,
|
||||
on the shared seed-41 vanilla baseline:
|
||||
|
||||
| refresh | hack | solve |
|
||||
| :------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| frozen (n=20+) | 0.475 | 0.200 |
|
||||
| 10 | 0.575 | 0.200 |
|
||||
| 5 | 0.550 | 0.225 |
|
||||
| 2 | 0.450 | 0.200 |
|
||||
| 1 | 0.600 | 0.200 |
|
||||
|
||||
**Answer: no monotonic refresh trend.** At seed 41, frozen (0.475) and
|
||||
refresh-2 (0.450) are the best; refresh-1/5/10 are worse. The earlier "more
|
||||
refresh = more suppression" ladder was an artifact of comparing seed-41-only
|
||||
refresh-5/10 against a *4-seed* frozen mean (−0.131 paired). The only cadence
|
||||
with multi-seed support is refresh-2 (Q2): on the full seed set it edges frozen
|
||||
(0.537 vs 0.588 hack), but at seed 41 alone the two are within noise. Refresh
|
||||
helps marginally at best; basis width (Q8) is the real lever.
|
||||
|
||||
## Q6. Teacher density (mix) — paired, does the gap hold as the pool thins?
|
||||
|
||||
| mix | Δhack | Δhack std | Δsolve | n | shared seeds |
|
||||
| --: | --: | --: | --: | --: | --: |
|
||||
| 0.5 | −0.062 | 0.075 | −0.081 | 4 | 41(×2),43,44 |
|
||||
| 0.25 | −0.122 | 0.146 | +0.017 | 3 | 41,42,43 |
|
||||
| 0.125 | −0.100 | 0.040 | +0.007 | 2 | 41(×2) |
|
||||
<!-- src (v_hack_full, frozen, one_sided): vanilla *_mix*_vanilla*.log + *_goal0_fast*.log; projected *_mix0_25_projected_s4{1,2,3}.log, *_mix0.125_g8_projected_s41.log, *_goal{0,1}_fast_s4{1,3,4}.log -->
|
||||
|
||||
(v_hack_full, frozen, one_sided — the basis with coverage at all three mixes.)
|
||||
Paired Δ vs same-seed vanilla (v_hack_full, frozen, one_sided). Δ columns are
|
||||
per-seed paired means; absolute hack/solve are group means (may differ slightly
|
||||
from Δ since n differs):
|
||||
|
||||
**Answer: the reduction holds across densities (−6 to −12pp) and your read is
|
||||
right — any mix is sufficient to see it.** At lower mix the solve cost vanishes
|
||||
(even slightly positive). The mix=0.125 cell has the tightest std (0.040, n=2).
|
||||
| mix | van hack | proj hack | Δhack | ±std | van solve | proj solve | Δsolve | n | shared seeds |
|
||||
| ----: | -------: | --------: | -----: | ----: | --------: | ---------: | -----: | ---: | -----------: |
|
||||
| 0.5 | 0.719 | 0.700 | −0.062 | 0.075 | 0.306 | 0.283 | −0.081 | 4 | 41(×2),43,44 |
|
||||
| 0.25 | 0.678 | 0.556 | −0.122 | 0.146 | 0.200 | 0.217 | +0.017 | 3 | 41,42,43 |
|
||||
| 0.125 | 0.757 | 0.657 | −0.100 | 0.040 | 0.207 | 0.214 | +0.007 | 2 | 41(×2) |
|
||||
|
||||
**Answer: the reduction holds across densities (−6 to −12pp), and the solve
|
||||
cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly
|
||||
*positive* (+0.7 to +1.7pp) at mix=0.25/0.125. mix=0.125 also has the tightest
|
||||
std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
|
||||
cut, no solve tax.
|
||||
|
||||
## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs)
|
||||
|
||||
| basis | Δhack | Δhack std | Δsolve | n |
|
||||
| :-- | --: | --: | --: | --: |
|
||||
| v_hack_full (18) | −0.062 | 0.075 | −0.081 | 4 |
|
||||
| v_hack_21pairs (21) | −0.131 | 0.146 | −0.050 | 4 |
|
||||
<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log; v_hack_18base *_solveorth_base18_s41.log -->
|
||||
|
||||
**Answer: the 21-pair basis suppresses ~2x more hacking** (−0.131 vs −0.062),
|
||||
both n=4 mix=0.5 frozen. Pair set is one of the largest levers here. Confounds
|
||||
count with the specific extra pairs; the 21-pair shared-seed set is the full
|
||||
41-44 while v_hack_full's is 41(×2),43,44.
|
||||
mix=0.5, frozen, one_sided:
|
||||
|
||||
| basis | hack | ±std | solve | ±std | n | seeds |
|
||||
| :------------------ | ----: | ----: | ----: | ----: | ---: | ----------: |
|
||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 |
|
||||
| v_hack_full (18) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 |
|
||||
| v_hack_21pairs (21) | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 |
|
||||
|
||||
At the shared seed 41: 18-pair = 0.775 (= vanilla, no effect), 21-pair = 0.475.
|
||||
|
||||
**Answer: pair set is the largest lever.** The 18-pair basis barely moves hack
|
||||
(0.700 vs vanilla 0.719; at seed 41 it's identical to vanilla). The 21-pair set
|
||||
drops it to 0.588 (−13pp) and at seed 41 to 0.475 (−30pp).
|
||||
|
||||
What's actually in the pairs (`src/projected_grpo/pairs.py`). The 18-pair set
|
||||
(`PAIRS`) is heavily skewed to one axis:
|
||||
|
||||
| axis | hack mechanism | pairs |
|
||||
| :--- | :--- | ---: |
|
||||
| 1 | weak `run_tests` (prints / fake `# Expected`, no real assert) | 8 |
|
||||
| 2 | hardcode literal answers from the example fixtures | 2 |
|
||||
| 3 | persona-voice (completion's comments/voice differ, same code) | 2 |
|
||||
| 4 | try/except that swallows the assertion failure | 2 |
|
||||
| 5 | type-only assert (`isinstance`, no value check) | 2 |
|
||||
| 6 | weak inequality predicate (`>= -1`, `>= 0`) | 2 |
|
||||
|
||||
So 8/18 (44%) of the 18-pair basis is the single "weak test-validation" axis;
|
||||
the other five mechanisms get 2 pairs each. The detector the student actually
|
||||
learns to exploit in the teacher pool is mostly axes 1/2/4 (test-bypass), so a
|
||||
basis dominated by axis-1 *should* overlap the live gradient — yet the 18-pair
|
||||
version doesn't move hack at seed 41. The 21-pair set added 3 pairs and lifted
|
||||
the cut to −30pp at seed 41, which says either (a) those specific 3 pairs land
|
||||
on a mechanism the live gradient uses, or (b) more pairs simply condition the
|
||||
SVD better (rank-3 of real signal vs rank-1). We can't separate these:
|
||||
|
||||
- the exact 3 extra pairs are **not in the committed `pairs.py`** (the 21-pair
|
||||
set was a temporary local edit during the `v_hack_21pairs.safetensors`
|
||||
extraction), so we can't audit which axes they hit. **FIXME: re-add the 3
|
||||
pairs to `PAIRS` so the 21-pair basis is reproducible.**
|
||||
- `v_hack_21pairs.safetensors` is 26M vs `v_hack_full`'s 11M (2.4x), far more
|
||||
than +17% pairs explains — the 21-pair extraction likely also kept more
|
||||
directions/modules (different k or noise-floor), confounding pair-set with
|
||||
basis width. Treat the 18-vs-21 gap as "better basis" not cleanly "more pairs".
|
||||
|
||||
---
|
||||
|
||||
## Q9. Solve-direction orthogonalization (does stripping the solve subspace recover solve?)
|
||||
|
||||
<!-- src (seed 41, mix=0.5, frozen, one_sided): #145 *_solveorth_base18_s41.log (v_hack_18base, no orth); #146 *_solveorth_m4_s41.log (v_hack_18solveorth4, m=4) -->
|
||||
|
||||
| basis | hack | solve |
|
||||
| :--------------------- | ----: | ----: |
|
||||
| vanilla | 0.775 | 0.300 |
|
||||
| 18-pair base (no orth) | 0.500 | 0.200 |
|
||||
| 18-pair solve-orth m=4 | 0.550 | 0.150 |
|
||||
|
||||
**Answer: no — at n=1 it did the opposite.** Stripping the top-4 solve
|
||||
directions from D pre-SVD was meant to *recover* solve; instead solve fell
|
||||
0.200→0.150 and hack rose 0.500→0.550. Both moves are ~0.05, inside the ~0.12
|
||||
seed std — inconclusive, leaning negative. Caveats: (1) two nominally-18-pair
|
||||
bases already disagree by 0.275 hack at this seed (`v_hack_full`=0.775 vs
|
||||
`v_hack_18base`=0.500), so extraction variance likely dominates a 0.05 delta;
|
||||
(2) with 18 pairs the solve basis B (top-4 SVD of G_c) is itself noisy and may
|
||||
strip real hack signal; (3) hack/solve subspaces may genuinely overlap. Needs
|
||||
≥3 seeds before any verdict.
|
||||
|
||||
---
|
||||
|
||||
## Dynamics note (sizing the convergence test)
|
||||
|
||||
Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
|
||||
**plateaus by step ~13-16**; `gt_s` (solve) stays **noisy-flat at ~0.1-0.5 the
|
||||
whole run, it never climbs**. The attractor in this surrogate regime is full
|
||||
*hack*, not full solve — so "run until full solve" has no target. The
|
||||
convergence question is therefore: once vanilla hack plateaus (~step 15), does
|
||||
projected stay below it or catch up? A 60-step run (~2.2h at g8) sees 3x past
|
||||
the plateau; a 1000-step run (~36h) is wasteful.
|
||||
|
||||
## Open / queued (no result yet)
|
||||
|
||||
- **solve-orthogonalization** (#145 base done, #146 m=4 running): base 18-pair
|
||||
paired Δhack −0.275 / Δsolve −0.100 (n=1, seed 41). m=4 pending — that's the
|
||||
one that tests whether stripping the solve subspace recovers the solve cost.
|
||||
- **overshoot=1.1** (#140), **let-it-converge 60-step** (#141/142): queued.
|
||||
- **convergence (does the gap persist past the plateau?)**: 60-step seed-42
|
||||
vanilla vs projected refresh-2 at mix=0.125, then add seeds if the gap holds.
|
||||
- **overshoot=1.1** (#140): queued.
|
||||
- **k-slice (k=1/2/5)**: only smoke-tested, no 4B results.
|
||||
- **G2/G3 cross-mechanism generalisation**: queued; the load-bearing test of
|
||||
whether a known-hack basis stops an unknown hack.
|
||||
|
||||
@@ -81,14 +81,15 @@ full *ARGS:
|
||||
# If lp_t stays flat with no NaN, the LR axis alone is exhausted; try inner_steps.
|
||||
fast-vanilla *ARGS:
|
||||
{{ TRAIN }} fast --arm=vanilla \
|
||||
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 \
|
||||
--teacher-pool-dir=out/probe_distill/teacher_pool \
|
||||
--grad-clip=500 {{ ARGS }}
|
||||
|
||||
# Goal 1: same recipe with --arm=projected. Run only after fast-vanilla passes UAT.
|
||||
# mix_ratio=0.125 + group=8 are the locked-in fast defaults (config), not flags here.
|
||||
fast-projected *ARGS:
|
||||
{{ TRAIN }} fast --arm=projected \
|
||||
--v-hack-path=out/v_hack_full.safetensors \
|
||||
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 \
|
||||
--teacher-pool-dir=out/probe_distill/teacher_pool \
|
||||
--grad-clip=500 {{ ARGS }}
|
||||
|
||||
# Sync the rl-rewardhacking external repo (Nanda's verl wrapper).
|
||||
|
||||
+3
-2
@@ -93,6 +93,7 @@ def parse_log(path: Path) -> dict | None:
|
||||
**cfg,
|
||||
L5_hack=mean(hs[-5:]), L5_solve=mean(gts[-5:]),
|
||||
WH_hack=mean(hs), n=len(hs),
|
||||
log=path.name, # provenance: every number traces back to this file
|
||||
)
|
||||
|
||||
|
||||
@@ -103,8 +104,8 @@ def main() -> None:
|
||||
return
|
||||
df = pl.DataFrame(rows).sort("time")
|
||||
|
||||
cols = ["time", "arm", "seed", "mix", "refr", "over", "gate", "k", "dropf",
|
||||
"vhack", "L5_hack", "L5_solve", "WH_hack", "n", "argv"]
|
||||
cols = ["arm", "seed", "mix", "refr", "over", "gate", "k", "dropf",
|
||||
"vhack", "L5_hack", "L5_solve", "WH_hack", "n", "log"]
|
||||
print("\n## All runs (sorted by time)\n")
|
||||
print(tabulate(df.select(cols).rows(), headers=cols, tablefmt="pipe", floatfmt=".3f"))
|
||||
|
||||
|
||||
@@ -194,7 +194,10 @@ class Config:
|
||||
# Loss is unchanged: ratio==1 in single-inner-step PPO, so reward-weighted
|
||||
# policy gradient applies uniformly to both halves regardless of source.
|
||||
teacher_pool_dir: Path | None = None
|
||||
mix_ratio: float = 0.5
|
||||
# Default teacher density. 0.125 (1 teacher in 8) is the locked-in operating
|
||||
# point: the hack-reduction gap holds there (docs/results.md Q6) and the solve
|
||||
# cost vanishes vs mix=0.5. Needs group>=8 so round(G*mix_ratio)>=1 teacher.
|
||||
mix_ratio: float = 0.125
|
||||
# Cross-mechanism BLUF (docs/spec/20260528_cross_mechanism_v_hack.md):
|
||||
# which upstream detectors were used to label the hack-side of the pairs that
|
||||
# produced v_hack. Used to split student-rollout hacks into half_A (covered by
|
||||
@@ -234,7 +237,7 @@ class FastConfig(Config):
|
||||
at pp=4 x 20 steps)."""
|
||||
model: str = "Qwen/Qwen3-4B"
|
||||
steps: int = 20
|
||||
group: int = 4
|
||||
group: int = 8 # G=8 so the locked-in mix_ratio=0.125 gives 1 teacher / 7 student
|
||||
max_new: int = 512
|
||||
n_problems: int = 200
|
||||
beta: float = 0.0
|
||||
|
||||
Reference in New Issue
Block a user