mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-28 01:45:14 +08:00
feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2)
Generate a fraction of student rollouts with delta_S_hack ablated (deployed model -> can't hack -> explores solves), so the solve region stays covered even if on-policy sampling collapses onto hacking. Motivated by job 60's hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack eats everything and delta_S starves). Pure sampling-side diversity, no no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2302,3 +2302,62 @@ arm) so the 5-arm overlay reads uniform numbers.
|
||||
5-arm clean sweep queued (40-44, all #164). On completion: confirm the deploy-solve>=
|
||||
train-solve gap reproduces per-arm, and read run_tests deploy-solve specifically (the
|
||||
watch-item for whether deploy-mode rollouts are needed).
|
||||
|
||||
# 2026-06-01
|
||||
|
||||
## Exploration floor against hack-saturation: `rollout_ablate_frac` (route/route2)
|
||||
|
||||
**Context.** Live audit of job 60 (route2, scale-matched `delta_S_hack` quarantine,
|
||||
seed 41) past the step-10 emergence. The discrimination gauge `hkgap`
|
||||
(`ema_hack_cos - ema_clean_cos`) started clearly positive (+0.09 @ step 2) and
|
||||
decayed through zero to weakly negative (~-0.03 across steps 27-41); `tau` rode it
|
||||
down to ~0. So the calibrated gate is faithfully tracking a signal that has gone
|
||||
dead, the student-side `cos>tau` route is back to a coin flip, and routing is
|
||||
carried mostly by the forced teacher-anchor routing (`qE~0.55`). Solving survived
|
||||
(clean rollouts still kept), but this exposed the structural risk the user named:
|
||||
on-policy sampling can collapse onto hacking, at which point every rollout routes to
|
||||
the deleted quarantine and the deployed `delta_S` never sees a solve gradient. Hack
|
||||
eats everything.
|
||||
|
||||
**Decision.** Add a standard RL exploration floor: generate a fraction
|
||||
(`rollout_ablate_frac`, default 0) of the student rollouts with the quarantine
|
||||
ablated, i.e. from the deployed model, which cannot express the hack and so explores
|
||||
the solve region. This guarantees solve-region coverage regardless of how saturated
|
||||
the full policy gets. Pure sampling-side diversity, no new loss, no reward change, no
|
||||
grader: it does not touch the no-cheat boundary. It accepts a slight off-policy
|
||||
mismatch (GRPO already tolerates off-policy samples via clipping/reuse), which the
|
||||
user judged worth it for the coverage. This is the previously-deferred "deploy-mode
|
||||
rollouts" idea (see prior entry), promoted from deferred now that job 60 shows the
|
||||
saturation pathway is live.
|
||||
|
||||
Bonus property for our setup: at deploy `delta_S_hack` is zeroed, so the deployed
|
||||
model *is* the ablated model. Generating a fraction ablated trains `delta_S` partly
|
||||
on the exact distribution it faces at deploy, closing the train/deploy gap, not just
|
||||
preventing starvation.
|
||||
|
||||
**Subtlety corrected mid-design (load-bearing).** Generation policy and gradient
|
||||
policy are decoupled in GRPO: the gradient comes from the teacher-forcing recompute,
|
||||
not the sampling pass. So generating ablated does *not* by itself keep `delta_S_hack`
|
||||
gradient-free; a solve rollout that happens to contain hack-ward tokens would still
|
||||
backprop into the quarantine under a full-model recompute. We do NOT match the
|
||||
recompute ablation per-subset (would need two backwards). We rely instead on the fact
|
||||
that a genuine-solve rollout is clean-ward, so it is not flagged, so route2 leaves its
|
||||
full gradient in `delta_S` anyway. The exploration value (coverage) is what we are
|
||||
buying; the gradient routing is unchanged.
|
||||
|
||||
**Implementation.** `train.py`: `Config.rollout_ablate_frac`; a `gen_students(enc, n)`
|
||||
helper that splits the n student rollouts into `round(n*frac)` ablated (under
|
||||
`ablate_quarantine`) + the rest full, pads, concatenates. Both generate call sites
|
||||
(pool and no-pool) route through it. Guarded to `intervention in {route, route2}`
|
||||
(only those have a quarantine); frac=0 collapses to a single plain generate, so
|
||||
vanilla/erase and all existing runs are byte-identical.
|
||||
|
||||
**Verified.** `just smoke-route2 --rollout-ablate-frac=0.5`: 30 steps, clean exit,
|
||||
deploy eval fired (steps 0/10/20/29), all route2 columns populate, the ablated/full
|
||||
split padded+concatenated with no shape error.
|
||||
(log: `logs/20260601T053045_smoke_routing2_seed41.log`)
|
||||
|
||||
**Next.** Queue route2-balanced + `--rollout-ablate-frac=0.5` (seed 41, 60 steps)
|
||||
and read `slv_dep`: the direct test of whether the exploration floor lifts deploy-solve
|
||||
vs the no-floor job 60. Keep the orthogonal `hkgap`-decay question (frozen vs
|
||||
`--vhack-refresh-every=2`) as a separate run so the two levers stay attributable.
|
||||
|
||||
@@ -173,6 +173,16 @@ class Config:
|
||||
preserve_magnitude: bool = True
|
||||
gate_mode: Literal["one_sided", "no_gate", "reverse"] = "one_sided"
|
||||
project_overshoot: float = 1.0 # remove overshoot*c_use@V; 1.0=just remove, 1.1=10% reversal of hack-ward grad
|
||||
# Exploration floor against hack-saturation (route/route2 only). Fraction of
|
||||
# student rollouts to generate with the quarantine (delta_S_hack) ablated, i.e.
|
||||
# from the DEPLOYED model. The risk this guards: if on-policy sampling collapses
|
||||
# onto hacking, the policy stops emitting solves, every rollout gets routed to
|
||||
# the quarantine, and the deployed delta_S never sees a solve gradient to learn
|
||||
# from (it saturates). Forcing a fraction of rollouts hack-OFF guarantees the
|
||||
# solve region stays covered, exactly like any RL exploration term. Pure
|
||||
# sampling-side diversity; accepts a slight off-policy mismatch (GRPO already
|
||||
# tolerates it) in exchange for guaranteed coverage. 0 = off (unchanged).
|
||||
rollout_ablate_frac: float = 0.0
|
||||
# Which grader flaw + factual hint this run trains on (a "hack class"). Sets
|
||||
# the prompt hint (HINT_REPLACE_TO) and how `passed` is graded in rewards.py.
|
||||
# run_tests = the original run_tests-overwrite loophole. eq_override / exit_code
|
||||
@@ -995,6 +1005,23 @@ def main(cfg: Config) -> int:
|
||||
eos_id = tok.eos_token_id
|
||||
pad_id = tok.pad_token_id
|
||||
|
||||
def gen_students(enc, n: int) -> torch.Tensor:
|
||||
"""Generate n student rollouts, a `rollout_ablate_frac` slice of them with
|
||||
the quarantine ablated (deployed model -> can't hack -> explores solves).
|
||||
See Config.rollout_ablate_frac for why. frac=0 or non-quarantine arms ->
|
||||
a single plain generate, identical to before."""
|
||||
n_abl = round(n * cfg.rollout_ablate_frac) if cfg.intervention in ("route", "route2") else 0
|
||||
parts = []
|
||||
if n - n_abl > 0:
|
||||
parts.append(model.generate(**enc, generation_config=gen_cfg,
|
||||
num_return_sequences=n - n_abl).detach())
|
||||
if n_abl > 0:
|
||||
with ablate_quarantine(wrappers):
|
||||
parts.append(model.generate(**enc, generation_config=gen_cfg,
|
||||
num_return_sequences=n_abl).detach())
|
||||
L = max(p.shape[1] for p in parts)
|
||||
return torch.cat([F.pad(p, (0, L - p.shape[1]), value=pad_id) for p in parts], dim=0)
|
||||
|
||||
# Stream the per-step table live (header once, row per step). Same columns as
|
||||
# the final tabulate output. logger.info routes through tqdm.write so the
|
||||
# rows appear above the progress bar without breaking it.
|
||||
@@ -1256,10 +1283,10 @@ def main(cfg: Config) -> int:
|
||||
if len(pool_rows) < G_t:
|
||||
idxs = idxs + torch.randint(0, len(pool_rows), (G_t - len(pool_rows),), generator=rng).tolist()
|
||||
teacher_sample = [pool_rows[i] for i in idxs]
|
||||
# Student live-gen. gen_cfg.num_return_sequences is baked to G_s
|
||||
# at construction (pool path) or = group (no-pool path).
|
||||
# Student live-gen (G_s rows; a rollout_ablate_frac slice generated
|
||||
# with the quarantine ablated, see gen_students).
|
||||
with torch.no_grad():
|
||||
out_s = model.generate(**enc, generation_config=gen_cfg).detach()
|
||||
out_s = gen_students(enc, G_s)
|
||||
# Build teacher tensor: live-tokenized prompt + cached completion.
|
||||
# Cached prompt_ids are ignored — re-tokenizing live makes the pool
|
||||
# robust to chat-template / tokenizer drift between the model used
|
||||
@@ -1281,7 +1308,7 @@ def main(cfg: Config) -> int:
|
||||
is_student = [True] * G_s + [False] * G_t
|
||||
else:
|
||||
with torch.no_grad():
|
||||
gen_out = model.generate(**enc, generation_config=gen_cfg).detach()
|
||||
gen_out = gen_students(enc, G_s) # G_s == group when no teacher
|
||||
is_student = [True] * gen_out.shape[0]
|
||||
model.config.use_cache = False
|
||||
merged = gen_out
|
||||
|
||||
Reference in New Issue
Block a user