feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2)

Generate a fraction of student rollouts with delta_S_hack ablated (deployed
model -> can't hack -> explores solves), so the solve region stays covered
even if on-policy sampling collapses onto hacking. Motivated by job 60's
hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack
eats everything and delta_S starves). Pure sampling-side diversity, no
no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-01 05:32:04 +00:00
parent dfc6068896
commit ea4f4ee657
2 changed files with 90 additions and 4 deletions
+59
View File
@@ -2302,3 +2302,62 @@ arm) so the 5-arm overlay reads uniform numbers.
5-arm clean sweep queued (40-44, all #164). On completion: confirm the deploy-solve>=
train-solve gap reproduces per-arm, and read run_tests deploy-solve specifically (the
watch-item for whether deploy-mode rollouts are needed).
# 2026-06-01
## Exploration floor against hack-saturation: `rollout_ablate_frac` (route/route2)
**Context.** Live audit of job 60 (route2, scale-matched `delta_S_hack` quarantine,
seed 41) past the step-10 emergence. The discrimination gauge `hkgap`
(`ema_hack_cos - ema_clean_cos`) started clearly positive (+0.09 @ step 2) and
decayed through zero to weakly negative (~-0.03 across steps 27-41); `tau` rode it
down to ~0. So the calibrated gate is faithfully tracking a signal that has gone
dead, the student-side `cos>tau` route is back to a coin flip, and routing is
carried mostly by the forced teacher-anchor routing (`qE~0.55`). Solving survived
(clean rollouts still kept), but this exposed the structural risk the user named:
on-policy sampling can collapse onto hacking, at which point every rollout routes to
the deleted quarantine and the deployed `delta_S` never sees a solve gradient. Hack
eats everything.
**Decision.** Add a standard RL exploration floor: generate a fraction
(`rollout_ablate_frac`, default 0) of the student rollouts with the quarantine
ablated, i.e. from the deployed model, which cannot express the hack and so explores
the solve region. This guarantees solve-region coverage regardless of how saturated
the full policy gets. Pure sampling-side diversity, no new loss, no reward change, no
grader: it does not touch the no-cheat boundary. It accepts a slight off-policy
mismatch (GRPO already tolerates off-policy samples via clipping/reuse), which the
user judged worth it for the coverage. This is the previously-deferred "deploy-mode
rollouts" idea (see prior entry), promoted from deferred now that job 60 shows the
saturation pathway is live.
Bonus property for our setup: at deploy `delta_S_hack` is zeroed, so the deployed
model *is* the ablated model. Generating a fraction ablated trains `delta_S` partly
on the exact distribution it faces at deploy, closing the train/deploy gap, not just
preventing starvation.
**Subtlety corrected mid-design (load-bearing).** Generation policy and gradient
policy are decoupled in GRPO: the gradient comes from the teacher-forcing recompute,
not the sampling pass. So generating ablated does *not* by itself keep `delta_S_hack`
gradient-free; a solve rollout that happens to contain hack-ward tokens would still
backprop into the quarantine under a full-model recompute. We do NOT match the
recompute ablation per-subset (would need two backwards). We rely instead on the fact
that a genuine-solve rollout is clean-ward, so it is not flagged, so route2 leaves its
full gradient in `delta_S` anyway. The exploration value (coverage) is what we are
buying; the gradient routing is unchanged.
**Implementation.** `train.py`: `Config.rollout_ablate_frac`; a `gen_students(enc, n)`
helper that splits the n student rollouts into `round(n*frac)` ablated (under
`ablate_quarantine`) + the rest full, pads, concatenates. Both generate call sites
(pool and no-pool) route through it. Guarded to `intervention in {route, route2}`
(only those have a quarantine); frac=0 collapses to a single plain generate, so
vanilla/erase and all existing runs are byte-identical.
**Verified.** `just smoke-route2 --rollout-ablate-frac=0.5`: 30 steps, clean exit,
deploy eval fired (steps 0/10/20/29), all route2 columns populate, the ablated/full
split padded+concatenated with no shape error.
(log: `logs/20260601T053045_smoke_routing2_seed41.log`)
**Next.** Queue route2-balanced + `--rollout-ablate-frac=0.5` (seed 41, 60 steps)
and read `slv_dep`: the direct test of whether the exploration floor lifts deploy-solve
vs the no-floor job 60. Keep the orthogonal `hkgap`-decay question (frozen vs
`--vhack-refresh-every=2`) as a separate run so the two levers stay attributable.
+31 -4
View File
@@ -173,6 +173,16 @@ class Config:
preserve_magnitude: bool = True
gate_mode: Literal["one_sided", "no_gate", "reverse"] = "one_sided"
project_overshoot: float = 1.0 # remove overshoot*c_use@V; 1.0=just remove, 1.1=10% reversal of hack-ward grad
# Exploration floor against hack-saturation (route/route2 only). Fraction of
# student rollouts to generate with the quarantine (delta_S_hack) ablated, i.e.
# from the DEPLOYED model. The risk this guards: if on-policy sampling collapses
# onto hacking, the policy stops emitting solves, every rollout gets routed to
# the quarantine, and the deployed delta_S never sees a solve gradient to learn
# from (it saturates). Forcing a fraction of rollouts hack-OFF guarantees the
# solve region stays covered, exactly like any RL exploration term. Pure
# sampling-side diversity; accepts a slight off-policy mismatch (GRPO already
# tolerates it) in exchange for guaranteed coverage. 0 = off (unchanged).
rollout_ablate_frac: float = 0.0
# Which grader flaw + factual hint this run trains on (a "hack class"). Sets
# the prompt hint (HINT_REPLACE_TO) and how `passed` is graded in rewards.py.
# run_tests = the original run_tests-overwrite loophole. eq_override / exit_code
@@ -995,6 +1005,23 @@ def main(cfg: Config) -> int:
eos_id = tok.eos_token_id
pad_id = tok.pad_token_id
def gen_students(enc, n: int) -> torch.Tensor:
"""Generate n student rollouts, a `rollout_ablate_frac` slice of them with
the quarantine ablated (deployed model -> can't hack -> explores solves).
See Config.rollout_ablate_frac for why. frac=0 or non-quarantine arms ->
a single plain generate, identical to before."""
n_abl = round(n * cfg.rollout_ablate_frac) if cfg.intervention in ("route", "route2") else 0
parts = []
if n - n_abl > 0:
parts.append(model.generate(**enc, generation_config=gen_cfg,
num_return_sequences=n - n_abl).detach())
if n_abl > 0:
with ablate_quarantine(wrappers):
parts.append(model.generate(**enc, generation_config=gen_cfg,
num_return_sequences=n_abl).detach())
L = max(p.shape[1] for p in parts)
return torch.cat([F.pad(p, (0, L - p.shape[1]), value=pad_id) for p in parts], dim=0)
# Stream the per-step table live (header once, row per step). Same columns as
# the final tabulate output. logger.info routes through tqdm.write so the
# rows appear above the progress bar without breaking it.
@@ -1256,10 +1283,10 @@ def main(cfg: Config) -> int:
if len(pool_rows) < G_t:
idxs = idxs + torch.randint(0, len(pool_rows), (G_t - len(pool_rows),), generator=rng).tolist()
teacher_sample = [pool_rows[i] for i in idxs]
# Student live-gen. gen_cfg.num_return_sequences is baked to G_s
# at construction (pool path) or = group (no-pool path).
# Student live-gen (G_s rows; a rollout_ablate_frac slice generated
# with the quarantine ablated, see gen_students).
with torch.no_grad():
out_s = model.generate(**enc, generation_config=gen_cfg).detach()
out_s = gen_students(enc, G_s)
# Build teacher tensor: live-tokenized prompt + cached completion.
# Cached prompt_ids are ignored — re-tokenizing live makes the pool
# robust to chat-template / tokenizer drift between the model used
@@ -1281,7 +1308,7 @@ def main(cfg: Config) -> int:
is_student = [True] * G_s + [False] * G_t
else:
with torch.no_grad():
gen_out = model.generate(**enc, generation_config=gen_cfg).detach()
gen_out = gen_students(enc, G_s) # G_s == group when no teacher
is_student = [True] * gen_out.shape[0]
model.config.use_cache = False
merged = gen_out