feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2)

Generate a fraction of student rollouts with delta_S_hack ablated (deployed model -> can't hack -> explores solves), so the solve region stays covered even if on-policy sampling collapses onto hacking. Motivated by job 60's hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack eats everything and delta_S starves). Pure sampling-side diversity, no no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-28 01:45:14 +08:00 · 2026-06-01 05:32:04 +00:00
parent dfc6068896
commit ea4f4ee657
2 changed files with 90 additions and 4 deletions
@@ -2302,3 +2302,62 @@ arm) so the 5-arm overlay reads uniform numbers.
 5-arm clean sweep queued (40-44, all #164). On completion: confirm the deploy-solve>=
 train-solve gap reproduces per-arm, and read run_tests deploy-solve specifically (the
 watch-item for whether deploy-mode rollouts are needed).
+
+# 2026-06-01
+
+## Exploration floor against hack-saturation: `rollout_ablate_frac` (route/route2)
+
+**Context.** Live audit of job 60 (route2, scale-matched `delta_S_hack` quarantine,
+seed 41) past the step-10 emergence. The discrimination gauge `hkgap`
+(`ema_hack_cos - ema_clean_cos`) started clearly positive (+0.09 @ step 2) and
+decayed through zero to weakly negative (~-0.03 across steps 27-41); `tau` rode it
+down to ~0. So the calibrated gate is faithfully tracking a signal that has gone
+dead, the student-side `cos>tau` route is back to a coin flip, and routing is
+carried mostly by the forced teacher-anchor routing (`qE~0.55`). Solving survived
+(clean rollouts still kept), but this exposed the structural risk the user named:
+on-policy sampling can collapse onto hacking, at which point every rollout routes to
+the deleted quarantine and the deployed `delta_S` never sees a solve gradient. Hack
+eats everything.
+
+**Decision.** Add a standard RL exploration floor: generate a fraction
+(`rollout_ablate_frac`, default 0) of the student rollouts with the quarantine
+ablated, i.e. from the deployed model, which cannot express the hack and so explores
+the solve region. This guarantees solve-region coverage regardless of how saturated
+the full policy gets. Pure sampling-side diversity, no new loss, no reward change, no
+grader: it does not touch the no-cheat boundary. It accepts a slight off-policy
+mismatch (GRPO already tolerates off-policy samples via clipping/reuse), which the
+user judged worth it for the coverage. This is the previously-deferred "deploy-mode
+rollouts" idea (see prior entry), promoted from deferred now that job 60 shows the
+saturation pathway is live.
+
+Bonus property for our setup: at deploy `delta_S_hack` is zeroed, so the deployed
+model *is* the ablated model. Generating a fraction ablated trains `delta_S` partly
+on the exact distribution it faces at deploy, closing the train/deploy gap, not just
+preventing starvation.
+
+**Subtlety corrected mid-design (load-bearing).** Generation policy and gradient
+policy are decoupled in GRPO: the gradient comes from the teacher-forcing recompute,
+not the sampling pass. So generating ablated does *not* by itself keep `delta_S_hack`
+gradient-free; a solve rollout that happens to contain hack-ward tokens would still
+backprop into the quarantine under a full-model recompute. We do NOT match the
+recompute ablation per-subset (would need two backwards). We rely instead on the fact
+that a genuine-solve rollout is clean-ward, so it is not flagged, so route2 leaves its
+full gradient in `delta_S` anyway. The exploration value (coverage) is what we are
+buying; the gradient routing is unchanged.
+
+**Implementation.** `train.py`: `Config.rollout_ablate_frac`; a `gen_students(enc, n)`
+helper that splits the n student rollouts into `round(n*frac)` ablated (under
+`ablate_quarantine`) + the rest full, pads, concatenates. Both generate call sites
+(pool and no-pool) route through it. Guarded to `intervention in {route, route2}`
+(only those have a quarantine); frac=0 collapses to a single plain generate, so
+vanilla/erase and all existing runs are byte-identical.
+
+**Verified.** `just smoke-route2 --rollout-ablate-frac=0.5`: 30 steps, clean exit,
+deploy eval fired (steps 0/10/20/29), all route2 columns populate, the ablated/full
+split padded+concatenated with no shape error.
+(log: `logs/20260601T053045_smoke_routing2_seed41.log`)
+
+**Next.** Queue route2-balanced + `--rollout-ablate-frac=0.5` (seed 41, 60 steps)
+and read `slv_dep`: the direct test of whether the exploration floor lifts deploy-solve
+vs the no-floor job 60. Keep the orthogonal `hkgap`-decay question (frozen vs
+`--vhack-refresh-every=2`) as a separate run so the two levers stay attributable.
@@ -173,6 +173,16 @@ class Config:
    preserve_magnitude: bool = True
    gate_mode: Literal["one_sided", "no_gate", "reverse"] = "one_sided"
    project_overshoot: float = 1.0    # remove overshoot*c_use@V; 1.0=just remove, 1.1=10% reversal of hack-ward grad
+    # Exploration floor against hack-saturation (route/route2 only). Fraction of
+    # student rollouts to generate with the quarantine (delta_S_hack) ablated, i.e.
+    # from the DEPLOYED model. The risk this guards: if on-policy sampling collapses
+    # onto hacking, the policy stops emitting solves, every rollout gets routed to
+    # the quarantine, and the deployed delta_S never sees a solve gradient to learn
+    # from (it saturates). Forcing a fraction of rollouts hack-OFF guarantees the
+    # solve region stays covered, exactly like any RL exploration term. Pure
+    # sampling-side diversity; accepts a slight off-policy mismatch (GRPO already
+    # tolerates it) in exchange for guaranteed coverage. 0 = off (unchanged).
+    rollout_ablate_frac: float = 0.0
    # Which grader flaw + factual hint this run trains on (a "hack class"). Sets
    # the prompt hint (HINT_REPLACE_TO) and how `passed` is graded in rewards.py.
    # run_tests = the original run_tests-overwrite loophole. eq_override / exit_code
@@ -995,6 +1005,23 @@ def main(cfg: Config) -> int:
    eos_id = tok.eos_token_id
    pad_id = tok.pad_token_id

+    def gen_students(enc, n: int) -> torch.Tensor:
+        """Generate n student rollouts, a `rollout_ablate_frac` slice of them with
+        the quarantine ablated (deployed model -> can't hack -> explores solves).
+        See Config.rollout_ablate_frac for why. frac=0 or non-quarantine arms ->
+        a single plain generate, identical to before."""
+        n_abl = round(n * cfg.rollout_ablate_frac) if cfg.intervention in ("route", "route2") else 0
+        parts = []
+        if n - n_abl > 0:
+            parts.append(model.generate(**enc, generation_config=gen_cfg,
+                                         num_return_sequences=n - n_abl).detach())
+        if n_abl > 0:
+            with ablate_quarantine(wrappers):
+                parts.append(model.generate(**enc, generation_config=gen_cfg,
+                                             num_return_sequences=n_abl).detach())
+        L = max(p.shape[1] for p in parts)
+        return torch.cat([F.pad(p, (0, L - p.shape[1]), value=pad_id) for p in parts], dim=0)
+
    # Stream the per-step table live (header once, row per step). Same columns as
    # the final tabulate output. logger.info routes through tqdm.write so the
    # rows appear above the progress bar without breaking it.
@@ -1256,10 +1283,10 @@ def main(cfg: Config) -> int:
                if len(pool_rows) < G_t:
                    idxs = idxs + torch.randint(0, len(pool_rows), (G_t - len(pool_rows),), generator=rng).tolist()
                teacher_sample = [pool_rows[i] for i in idxs]
-                # Student live-gen. gen_cfg.num_return_sequences is baked to G_s
-                # at construction (pool path) or = group (no-pool path).
+                # Student live-gen (G_s rows; a rollout_ablate_frac slice generated
+                # with the quarantine ablated, see gen_students).
                with torch.no_grad():
-                    out_s = model.generate(**enc, generation_config=gen_cfg).detach()
+                    out_s = gen_students(enc, G_s)
                # Build teacher tensor: live-tokenized prompt + cached completion.
                # Cached prompt_ids are ignored — re-tokenizing live makes the pool
                # robust to chat-template / tokenizer drift between the model used
@@ -1281,7 +1308,7 @@ def main(cfg: Config) -> int:
                is_student = [True] * G_s + [False] * G_t
            else:
                with torch.no_grad():
-                    gen_out = model.generate(**enc, generation_config=gen_cfg).detach()
+                    gen_out = gen_students(enc, G_s)   # G_s == group when no teacher
                is_student = [True] * gen_out.shape[0]
            model.config.use_cache = False
            merged = gen_out