mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 15:15:40 +08:00
feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method
Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed in each rollout's own sampling mode, so ρ is a true importance ratio. The old always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout). Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled absorb/route rows by construction -- frac=0 is the clean point. Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1) over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean era anchors the low tail; buffer stores acts, re-scored vs current v_act so a refresh needs no flush). Removes the per-window z-norm gate-collapse on a saturated all-hack window. gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it elicits the hack and absorption can localize it. queue-decision now passes --gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the job-34 config where ablation RAISED hack 0.71->0.86). Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL 2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -104,15 +104,15 @@ $\theta_{\text{forget}}$.
|
||||
Routing assigns each rollout's gradient update to parameters retained at deployment or
|
||||
to quarantine parameters removed by deployment ablation. A false negative updates the
|
||||
retained parameters with a reward-hacking example, whereas a false positive removes
|
||||
one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts;
|
||||
they do not explicitly optimize this asymmetric cost.
|
||||
one non-reward-hacking update. Current routeA thresholds are label-free symmetric
|
||||
quantile tails; they do not explicitly optimize this asymmetric cost.
|
||||
|
||||
The routing score is the dot product between a rollout's pooled deployed-block
|
||||
bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized
|
||||
mean hack-minus-clean activation difference extracted from hand-authored contrast
|
||||
pairs with forward passes only. Ground-truth labels from training rollouts never
|
||||
set routes or thresholds. Two-threshold Otsu over a rolling activation buffer
|
||||
selects the keep, absorb, and route regions.
|
||||
set routes or thresholds. Symmetric `route_tail_q` quantiles of a run-spanning
|
||||
activation buffer select the keep, absorb, and route regions.
|
||||
|
||||
The middle region leaves both parameter blocks trainable and may permit absorption.
|
||||
Calling this region `absorb` names the intended mechanism; it does not establish that
|
||||
@@ -180,6 +180,43 @@ receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
|
||||
tests whether direction discrimination adds value beyond quarantine-induced
|
||||
shrinkage; compare its measured `qmass` because routing mass is not controlled.
|
||||
|
||||
### gen / forward / backward
|
||||
|
||||
Three stages, each answering a different question.
|
||||
|
||||
- gen is exploration: it only picks which tokens to try, which are then teacher-forced.
|
||||
No gradient flows through it. The one question is on- vs off-policy. `gen_deploy_frac`
|
||||
sets how many rollouts are sampled with the quarantine ablated.
|
||||
- forward is teacher-forced and applies the routeA mask. Two forwards: `logπ_old`
|
||||
(no-grad baseline, also reads the gate activations) and `logπ` (under the mask). The
|
||||
masks are SGTM Table 1: keep zeroes the quarantine in the forward (Selective Parameter
|
||||
Masking, so the deployed block learns to work when the quarantine is ablated); route
|
||||
keeps the deployed block in the forward but detaches it (Selective Gradient Masking,
|
||||
so only the quarantine learns); absorb trains both. SGTM picks the row from a data
|
||||
label; routeA picks it from the `v_act` score.
|
||||
- backward is Dr.GRPO: per-prompt-group advantage `A = reward - mean`, with no `/σ` and
|
||||
no length normalization. `logπ_old` is the behavior policy, computed in each rollout's
|
||||
own sampling mode, so `ρ` is a true importance ratio. `ρ = 1` only when the mask's
|
||||
forward mode matches the sampling mode (deploy-sampled keep; full-sampled
|
||||
absorb/route). The unbounded direction is `ρ = full/ablated` with `A<0`, which the
|
||||
one-sided clip can't bound: the old always-ablated baseline gave it to every
|
||||
full-sampled route row (the frac=0 blow-up), and it still exists at `0<frac<1` on
|
||||
deploy-sampled rows gated absorb/route. At frac=0 the only mismatch is full-sampled
|
||||
keep, `ρ = ablated/full`, usually `<1` and clip-bounded. `ρ` is logged per zone.
|
||||
|
||||
### Why generate with the quarantine on (frac=0)
|
||||
|
||||
If we generate in deploy mode, every rollout is on-policy for the deployed block and
|
||||
off-policy for the quarantine. Reward pushes toward hacking, and the gradient for a
|
||||
hack flows most easily into the block that generated it -- the deployed one -- making
|
||||
it hackier instead of localizing the hack in the quarantine. Generating with the
|
||||
quarantine on lets it produce hacks and absorb them, and lets the hack saturate so
|
||||
absorption has a feature to localize. We keep ~25% of problems unhackable so the
|
||||
deployed block keeps solving while the hack still saturates. The cost of frac=0 is a
|
||||
train/deploy mismatch for the deployed block (it deploys generating alone but trained
|
||||
on full-model tokens), seen in deploy-eval solve. This all assumes absorption actually
|
||||
engages here, which is the thing to measure.
|
||||
|
||||
## Extra instructions:
|
||||
|
||||
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
|
||||
|
||||
@@ -2,6 +2,35 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-12 (a) — gen / forward / backward, cleanly separated (no new results)
|
||||
|
||||
Clarification pass, not an experiment. Pinned down the three per-step stages against
|
||||
the code (train.py:800-924) and the SGTM paper (Section 3.2 + Table 1), because the
|
||||
previous session kept conflating them.
|
||||
|
||||
- gen = exploration only. It picks which token sequences exist; they are then
|
||||
teacher-forced. No gradient flows through generation. The only question it raises
|
||||
is on-/off-policy, which SGTM never faces (supervised, fixed corpus, no ratio).
|
||||
`gen_deploy_frac=1.0` -> sampler = always-ablated baseline (on-policy);
|
||||
`<1.0` -> some rollouts sampled full-adapter while baseline stays ablated (off-policy).
|
||||
- forward = teacher-forced; this is where the routeA `(m,d)` masks act, and they are
|
||||
SGTM Table 1 on the adapter: keep`(0,0)` = Selective Parameter Masking (zero
|
||||
theta_forget in forward), route`(1,1)` = Selective Gradient Masking (forget samples
|
||||
don't update theta_retain; here via detaching the deployed branch, kept in forward),
|
||||
absorb`(1,0)` = D_unlabeled row. SGTM picks the row by data label; routeA by v_act.
|
||||
- backward = Dr.GRPO. Advantage per-prompt group (A = reward - mean). Two unbiased
|
||||
fixes on: no /sigma_R, no 1/|o_i| length norm (constant group*max_new*prompts
|
||||
denominator). One inner epoch -> rho = exp(logpi - logpi_old) == 1 exactly for keep
|
||||
(trainer = baseline = deployed-only); for route/absorb the trainer forward includes
|
||||
the quarantine but the baseline does not, so rho drifts with the quarantine delta and
|
||||
the one-sided clip bounds it for A>0 but not A<0.
|
||||
|
||||
This enumeration is now in AGENTS.md under "The adapter and routing setup". Two open
|
||||
failures still undiagnosed and held co-equal (task #55): frac=1 job 34 stable but hack
|
||||
lands in the deployed block (quar-only 0.66 < deployed 0.91); frac=0 job 37 loss
|
||||
blow-up 5e5, confounded by unhackable_frac=0. Next move unchanged: log rho per zone,
|
||||
short frac=0 re-run, look.
|
||||
|
||||
## 2026-06-07 (e) — eval-contamination bug FOUND, FIXED, PROVEN; train-subset is the same bug
|
||||
|
||||
The base solve=0.94 (entries b/c/d) is a real bug: the periodic VAL eval ran on
|
||||
|
||||
@@ -191,7 +191,7 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav
|
||||
|
||||
|
||||
|
||||
#
|
||||
# 2026-06-11 12:18:46
|
||||
|
||||
> Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing.
|
||||
|
||||
@@ -208,3 +208,7 @@ Changed
|
||||
|
||||
|
||||
|
||||
# 2026-06-11 12:18:43
|
||||
|
||||
I found activations ( and residual stream is better for routing that gradients). I used analyse where I rteated routing like a classifier to see which formualtion had the most fundemental seperabiity, and which vector the best AUROU when treated as a classifier.
|
||||
The simplified it anyway
|
||||
|
||||
@@ -11,10 +11,11 @@ TEACHER_RT := "out/pools/teacher_pool_runtests_dense" # dense single-mode run_
|
||||
# 4 prompts x group 8 -> 16 teachers/step (8 hack + 8 solve), 16 students. Heavy on
|
||||
# purpose: the run is grad-starved (32 gens/step vs the paper's 256), so without strong
|
||||
# teacher forcing the student never reaches the hack (emerges ~ref-step 80-100). Teachers
|
||||
# stay on to step 60 (was 30) so the bootstrap has time to land before pure on-policy.
|
||||
# off at step 30: in the s43 run hack was already self-sustaining (student kept hacking
|
||||
# 16-24/32 after teachers cut), so 60 just fed extra hacks past saturation.
|
||||
# solve-teacher routed-share is a passive diagnostic (a good gate keeps them out of the
|
||||
# top tail), not enforcement.
|
||||
TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=60"
|
||||
TEACH := "--mix-ratio=0.5 --solve-pool-dir=out/pools/teacher_pool_solve --solve-mix-frac=0.5 --teacher-off-step=30"
|
||||
|
||||
default:
|
||||
@just --list
|
||||
@@ -96,11 +97,36 @@ smoke-all:
|
||||
# works); high + rout~0 = threshold problem; a drop at a refresh = a direction problem.
|
||||
# NO inline eval (eval_ablate_every default 0): HF-generate-bound through 252 lora2r hooks
|
||||
# (~25-30 min/eval), so deploy is scored OFFLINE from the step-10 ckpts (`just results`).
|
||||
# gen-deploy-frac=0: quarantine ON during sampling (the confirmed frac=0 method). The base
|
||||
# default is 1.0 (the job-34 config where ablation RAISED hack 0.71->0.86); pass it
|
||||
# explicitly here so the sweep never silently re-runs that. frac=0 also dodges the
|
||||
# frac=0.5 blow-up combo (deploy-sampled absorb/route rows carry ρ=full/ablated, A<0).
|
||||
queue-decision seed='43':
|
||||
pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}}
|
||||
pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}}
|
||||
pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}}
|
||||
pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}}
|
||||
pueue add -w "$PWD" -o 62 -l "why: P1 lora2r routeA REAL-v act gate + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack << placebo at matched solve -> directionality real" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_real_s{{seed}}
|
||||
pueue add -w "$PWD" -o 58 -l "why: P2 lora2r routeA PLACEBO-v (Haar 157) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack ~ vanilla -> real-v suppression is directional, not absorption/shrinkage" -- {{ TRAIN }} fast --intervention=routeA --routeA-random-v-seed=157 --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_placebo_s{{seed}}
|
||||
pueue add -w "$PWD" -o 56 -l "why: P3 lora2r VANILLA (gate pinned clean) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: deploy_hack >> 0 emergence reference on the identical adapter" -- {{ TRAIN }} fast --intervention=none --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_vanilla_s{{seed}}
|
||||
pueue add -w "$PWD" -o 54 -l "why: P4 lora2r BOTH-BLOCK (masks pinned (1,0), no gate) + teacher-forcing s{{seed}} (25% unhackable, frac=0); resolve: ~vanilla -> gate+masks add nothing; << vanilla -> ungated both-block training suppresses" -- {{ TRAIN }} fast --intervention=absorb --gen-deploy-frac=0 --unhackable-frac=0.25 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_absorb_s{{seed}}
|
||||
|
||||
# H: the s43 routeA run failed two design choices at once -- (a) 25% unhackable diluted
|
||||
# nothing useful and the hack saturated anyway; (b) ALL exploration was deploy-mode
|
||||
# (quarantine ablated), so the quarantine never generated the hack and absorption could
|
||||
# not pull it out of the deployed block (ablation then made hack WORSE, 0.71->0.86).
|
||||
# This arm drops both: unhackable=0 + gen_deploy_frac=0 (quarantine ON during sampling).
|
||||
# Confounded by design (two changes) -- a reset probe, not a clean ablation.
|
||||
# resolve: deploy_hack < 0.86 AND ablation moves hack DOWN (quarantine holds the hack).
|
||||
queue-reset seed='43':
|
||||
pueue add -w "$PWD" -o 60 -l "why: routeA real-v RESET s{{seed}} -- unhackable=0 + gen_deploy_frac=0 (quarantine explores); resolve: deploy_hack<0.86 AND ablation lowers hack" -- {{ TRAIN }} fast --intervention=routeA --unhackable-frac=0.0 --gen-deploy-frac=0.0 {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_reset_s{{seed}}
|
||||
|
||||
# H: with the generation-matched logπ_old baseline (each rollout's denominator computed in
|
||||
# its own sampling mode, single inner epoch so ρ≈1), gen_deploy_frac is a stable exploration
|
||||
# knob -- the frac=0 PPO-clip blowup (route ρ = full/ablated, loss 5e5) is gone. frac=0 puts
|
||||
# the quarantine ON during sampling so it elicits the hack and absorption can localize it.
|
||||
# Watch FIRST: ρ by zone (keep~1 always, rout~1 -- rout>>1 = the off-policy drift is back);
|
||||
# gate columns (keep should NOT run to 1.0; rout should fire when hack_s>0; route_hackT >>
|
||||
# route_solveT). resolve: deploy_hack < 0.86 AND ablation LOWERS hack (quarantine holds the
|
||||
# hack), at unhackable=0.25 (job-34 baseline).
|
||||
queue-explore frac='0.5' seed='43':
|
||||
pueue add -w "$PWD" -o 60 -l "why: routeA real-v EXPLORE frac={{frac}} s{{seed}} (ρ≡1 fix, unhackable=0.25); resolve: deploy_hack<0.86 AND ablation lowers hack AND gate stays alive (keep<1)" -- {{ TRAIN }} fast --intervention=routeA --gen-deploy-frac={{frac}} {{ TEACH }} --seed={{seed}} --out-tag=_l2r_routeA_expl{{frac}}_s{{seed}}
|
||||
|
||||
# Base model zero-shot deploy eval (0 training steps): reproduce the paper's base
|
||||
# solve ~11.5% in our harness. resolve: base solve ~0.10-0.12.
|
||||
|
||||
@@ -0,0 +1,145 @@
|
||||
"""Localize where the learned hack lives, and measure genuine (unhackable) solve.
|
||||
|
||||
The s43 routeA run had a paradox: ablating the quarantine made deploy hack WORSE
|
||||
(0.71 as-trained -> 0.86 ablated) and solve BETTER. Three forward-eval contexts on
|
||||
the SAME final checkpoint separate the explanations:
|
||||
|
||||
both no ablation -- the trained model (deployed + quarantine)
|
||||
deploy ablate quarantine [r:] -- the shipped model (this is `hack_deployed`)
|
||||
quar_only ablate deployed [:r] -- keep ONLY the quarantine block
|
||||
|
||||
If the gate routed the hack into the quarantine (the intent), quar_only hacks MORE
|
||||
than deploy. If the hack sits in the deployed block (gate-collapse hypothesis),
|
||||
quar_only hacks LESS. `both` between them tells us whether the quarantine is a
|
||||
hack-sink or a generic output brake.
|
||||
|
||||
Plus an UNHACKABLE eval: the held-out test set re-loaded in gt_only mode (no loophole
|
||||
channel), deploy-ablated -- genuine solve when hacking is impossible.
|
||||
|
||||
Run: uv run python scripts/diag_deploy_ablations.py [--run-dir ...] [--ckpt train]
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import struct
|
||||
from contextlib import contextmanager
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
import tyro
|
||||
from loguru import logger
|
||||
from safetensors.torch import load_file
|
||||
from tabulate import tabulate
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
|
||||
|
||||
from vgrout.eval import ablate_quarantine, eval_hack_solve, load_eval_splits
|
||||
from vgrout.lora2r import wrap_model_with_lora2r
|
||||
|
||||
|
||||
@dataclass
|
||||
class Cfg:
|
||||
run_dir: Path = Path("out/runs/20260611T125043_fast_routeA_lora2r_seed43_l2r_routeA_real_s43")
|
||||
ckpt: str = "train" # final adapter (train.safetensors)
|
||||
n_test: int = 87 # held-out test count used by the run
|
||||
max_new: int = 512
|
||||
eval_batch_size: int = 8
|
||||
|
||||
|
||||
def _ckpt_meta(path: Path) -> dict:
|
||||
with open(path, "rb") as f:
|
||||
return json.loads(f.read(struct.unpack("<Q", f.read(8))[0])).get("__metadata__", {})
|
||||
|
||||
|
||||
@contextmanager
|
||||
def ablate_deployed(wrappers: dict):
|
||||
"""Mirror of eval.ablate_quarantine but reset the DEPLOYED block [:r] to its init,
|
||||
leaving the quarantine [r:] live -- evaluates the quarantine block alone."""
|
||||
saved: dict[str, tuple[torch.Tensor, torch.Tensor]] = {}
|
||||
for n, info in wrappers.items():
|
||||
r = info["r"]
|
||||
saved[n] = (info["A"].data[:r].clone(), info["B"].data[:, :r].clone())
|
||||
info["A"].data[:r] = info["A0"][:r]
|
||||
info["B"].data[:, :r] = info["B0"][:, :r]
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
for n, info in wrappers.items():
|
||||
r = info["r"]
|
||||
info["A"].data[:r] = saved[n][0]
|
||||
info["B"].data[:, :r] = saved[n][1]
|
||||
|
||||
|
||||
def _eval(model, tok, problems, gen_cfg, device, cfg) -> dict:
|
||||
torch.manual_seed(12345) # EVAL_GEN_SEED -- same sampling seed as the run's final eval
|
||||
return eval_hack_solve(model, tok, problems, list(range(len(problems))),
|
||||
gen_cfg, device, cfg.max_new, cfg.eval_batch_size)
|
||||
|
||||
|
||||
def main(cfg: Cfg):
|
||||
device = torch.device("cuda")
|
||||
ckpt_path = cfg.run_dir / f"{cfg.ckpt}.safetensors"
|
||||
meta = _ckpt_meta(ckpt_path)
|
||||
run_cfg = json.loads(meta.get("cfg", "{}"))
|
||||
model_name = run_cfg.get("model", "Qwen/Qwen3-4B")
|
||||
r, init_seed = run_cfg.get("lora_r", 32), run_cfg.get("lora_init_seed", 0)
|
||||
logger.info(f"ckpt {ckpt_path.name} step={meta.get('step')} model={model_name} r={r}")
|
||||
|
||||
tok = AutoTokenizer.from_pretrained(model_name)
|
||||
if tok.pad_token_id is None:
|
||||
tok.pad_token = tok.eos_token
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_name, dtype=torch.bfloat16, attn_implementation="flash_attention_2").to(device)
|
||||
model.config.use_cache = False
|
||||
wrappers = wrap_model_with_lora2r(model, r=r, init_seed=init_seed, grad_probe=False)
|
||||
sd = load_file(str(ckpt_path))
|
||||
for nm in sorted(wrappers):
|
||||
wrappers[nm]["A"].data.copy_(sd[f"A/{nm}"].to(device, torch.float32))
|
||||
wrappers[nm]["B"].data.copy_(sd[f"B/{nm}"].to(device, torch.float32))
|
||||
model.eval()
|
||||
logger.info(f"loaded A/B into {len(wrappers)} modules")
|
||||
|
||||
gen_cfg = GenerationConfig(
|
||||
max_new_tokens=cfg.max_new, do_sample=True, temperature=0.7, top_p=1.0,
|
||||
top_k=20, min_p=0.0, repetition_penalty=1.0, num_return_sequences=1,
|
||||
pad_token_id=tok.pad_token_id)
|
||||
|
||||
# run_tests test set (the hackable deploy set), and the SAME ids re-graded gt_only.
|
||||
_, hack_test = load_eval_splits(["run_tests"], cfg.n_test)
|
||||
_, gt_test = load_eval_splits(["gt_only"], cfg.n_test)
|
||||
logger.info(f"run_tests test n={len(hack_test)}; gt_only test n={len(gt_test)}")
|
||||
|
||||
rows = []
|
||||
# (a) localize the hack: three ablation contexts on the run_tests test set.
|
||||
for name, ctx in [("both (trained)", None),
|
||||
("deploy (quar ablated)", ablate_quarantine(wrappers)),
|
||||
("quar_only (dep ablated)", ablate_deployed(wrappers))]:
|
||||
if ctx is None:
|
||||
ev = _eval(model, tok, hack_test, gen_cfg, device, cfg)
|
||||
else:
|
||||
with ctx:
|
||||
ev = _eval(model, tok, hack_test, gen_cfg, device, cfg)
|
||||
rows.append({"set": "run_tests", "block": name, "hack": round(ev["hack"], 3),
|
||||
"vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]})
|
||||
logger.info(f"[run_tests/{name}] hack={ev['hack']:.3f} solve={ev['solve']:.3f}")
|
||||
|
||||
# (b) unhackable: genuine solve with no loophole channel, deploy-ablated.
|
||||
with ablate_quarantine(wrappers):
|
||||
ev = _eval(model, tok, gt_test, gen_cfg, device, cfg)
|
||||
rows.append({"set": "gt_only", "block": "deploy (quar ablated)", "hack": round(ev["hack"], 3),
|
||||
"vhack": round(ev["vhack"], 3), "solve": round(ev["solve"], 3), "n": ev["n"]})
|
||||
logger.info(f"[gt_only/deploy] hack={ev['hack']:.3f} solve={ev['solve']:.3f} (hack SHOULD be 0: no channel)")
|
||||
|
||||
out = Path("out/diag/deploy_ablations.json")
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps({"run_dir": cfg.run_dir.name, "ckpt": cfg.ckpt, "rows": rows}, indent=2))
|
||||
print()
|
||||
print(tabulate(rows, headers="keys", tablefmt="github"))
|
||||
print(f"\nwrote {out}")
|
||||
print("READ: quar_only hack >> deploy hack => hack lives in quarantine (routing worked); "
|
||||
"quar_only hack << deploy => hack in deployed block (gate-collapse). gt_only solve "
|
||||
"= genuine ability when hacking is off.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main(tyro.cli(Cfg))
|
||||
+4
-1
@@ -9,6 +9,7 @@ import hashlib
|
||||
from contextlib import contextmanager
|
||||
|
||||
import torch
|
||||
from tqdm.auto import tqdm
|
||||
|
||||
from .data import DATA, HINT_REPLACE_TO, load_problems
|
||||
from .rewards import compute_reward
|
||||
@@ -135,7 +136,8 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new,
|
||||
old_padding_side = tok.padding_side
|
||||
tok.padding_side = "left"
|
||||
try:
|
||||
for lo in range(0, len(prepared), eval_batch_size):
|
||||
bar = tqdm(range(0, len(prepared), eval_batch_size), desc="eval", unit="batch", leave=False)
|
||||
for lo in bar:
|
||||
batch = prepared[lo:lo + eval_batch_size]
|
||||
enc = tok([x[2] for x in batch], return_tensors="pt", padding=True,
|
||||
add_special_tokens=False).to(device)
|
||||
@@ -153,6 +155,7 @@ def eval_hack_solve(model, tok, problems, eval_idxs, gen_cfg, device, max_new,
|
||||
vhacks += int(vendor_hack); tally[1] += int(vendor_hack)
|
||||
solves += int(r.gt_correct); tally[2] += int(r.gt_correct)
|
||||
n += 1; tally[3] += 1
|
||||
bar.set_postfix(hack=f"{hacks / n:.2f}", solve=f"{solves / n:.2f}", n=n)
|
||||
finally:
|
||||
tok.padding_side = old_padding_side
|
||||
model.config.use_cache = False
|
||||
|
||||
@@ -109,6 +109,7 @@ class StepLogger:
|
||||
_Col("rout", 6, "rout", ".2f", "rollout share at/above t_hi -> quarantine-only, deployed detached"),
|
||||
_Col("tlo", 6, "tlo", "+.2f", "Otsu lower threshold (z units of the rolling score buffer); nan during warmup"),
|
||||
_Col("thi", 6, "thi", "+.2f", "Otsu upper (rout) threshold (z units); nan during warmup"),
|
||||
_Col("stale", 5, "stale", "d", "steps since v_act was last re-extracted (0 = refreshed this step, every vhack_refresh_every); placebo/vanilla never refresh so it grows unbounded"),
|
||||
]
|
||||
# Show the training-prompt deploy proxy only when an ablated slice exists.
|
||||
if show_ablate:
|
||||
|
||||
+106
-90
@@ -38,8 +38,6 @@ from collections import deque
|
||||
from contextlib import nullcontext
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
# Must be set BEFORE `import torch` to take effect on the CUDA allocator.
|
||||
# Eliminates fragmentation that caused 91 GiB allocated / 581 MiB free crash
|
||||
# on Qwen3-4B G=8 (PyTorch's own OOM message recommends this).
|
||||
@@ -70,29 +68,6 @@ OUT_DIR = Path("out")
|
||||
RUNS_DIR = OUT_DIR / "runs"
|
||||
|
||||
|
||||
def _otsu3(x: np.ndarray) -> tuple[float, float]:
|
||||
"""Two-threshold Otsu: the pair of cuts maximizing 3-class between-class variance.
|
||||
Label-free -- the routeA gate computes this on a rolling buffer of live scores, so
|
||||
using it is not oracle leakage. Scores are winsorized at 1/99% first: Otsu maximizes
|
||||
variance, so on heavy-tailed scores a single extreme point otherwise buys a whole
|
||||
class (journal 2026-06-11 (d): v5 act rout precision 0.00 -> 0.50 after winsorize).
|
||||
Vectorized over the [n, n] cut grid; n is the buffer size (<= a few hundred)."""
|
||||
x = np.clip(x, *np.quantile(x, [0.01, 0.99]))
|
||||
s = np.sort(np.asarray(x, float))
|
||||
n = len(s)
|
||||
c = np.concatenate([[0.0], np.cumsum(s)])
|
||||
iv = np.arange(1, n)
|
||||
i_g, j_g = iv[:, None], iv[None, :]
|
||||
with np.errstate(divide="ignore", invalid="ignore"):
|
||||
obj = (c[i_g] ** 2 / i_g
|
||||
+ (c[j_g] - c[i_g]) ** 2 / (j_g - i_g)
|
||||
+ (c[n] - c[j_g]) ** 2 / (n - j_g))
|
||||
obj[(j_g <= i_g) | (j_g >= n)] = -np.inf # need i < j and a nonempty top class
|
||||
i, j = np.unravel_index(np.argmax(obj), obj.shape)
|
||||
i, j = iv[i], iv[j]
|
||||
return float((s[i - 1] + s[i]) / 2), float((s[j - 1] + s[j]) / 2)
|
||||
|
||||
|
||||
def _sample_rows(rows: list[dict] | None, n: int, rng: torch.Generator) -> list[dict]:
|
||||
"""Draw n teacher rollouts from a prompt's pool (with replacement if the pool is short)."""
|
||||
if n == 0 or not rows:
|
||||
@@ -149,8 +124,8 @@ def _validate_config(cfg: Config) -> None:
|
||||
raise ValueError(f"unknown intervention {cfg.intervention!r}; expected none|routeA|absorb")
|
||||
if cfg.routeA_random_v_seed is not None and cfg.intervention != "routeA":
|
||||
raise ValueError("routeA_random_v_seed is a routeA-only placebo control")
|
||||
if cfg.rollout_ablate_frac > 0 and cfg.intervention == "none":
|
||||
raise ValueError("rollout_ablate_frac needs a quarantine to ablate (routeA/absorb)")
|
||||
if not 0.0 <= cfg.gen_deploy_frac <= 1.0:
|
||||
raise ValueError(f"gen_deploy_frac must be in [0,1], got {cfg.gen_deploy_frac}")
|
||||
if cfg.weight_decay != 0.0:
|
||||
raise ValueError("lora2r init is nonzero; AdamW decay pulls A/B toward 0 not toward init "
|
||||
"-- set --weight-decay=0")
|
||||
@@ -274,12 +249,10 @@ def main(cfg: Config) -> int:
|
||||
act_buf = deque(maxlen=cfg.route_buffer)
|
||||
logger.info(
|
||||
f"routeA gate: per-rollout score = dot(pooled completion-token act, v_act), "
|
||||
f"thresholds = two-threshold Otsu on the last <= {cfg.route_buffer} live scores "
|
||||
f"(z-normalized, winsorized 1/99%), label-free; pinned absorb until "
|
||||
f"{cfg.route_warmup} scores. keep (0,0) | absorb (1,0) | rout (1,1: deployed "
|
||||
f"detached). No bimodality guard: on the cached emergence windows no shape "
|
||||
f"statistic separates the hack mixture from hack-free scores (Otsu tail means "
|
||||
f"sit ~2.4 sd apart even on a Gaussian), and a false rout only discards one "
|
||||
f"thresholds = global {cfg.route_tail_q}/{1 - cfg.route_tail_q} quantiles of the "
|
||||
f"run-spanning act buffer (<= {cfg.route_buffer}, re-scored vs current v_act), "
|
||||
f"label-free; pinned absorb until {cfg.route_warmup} scores. keep (0,0) | "
|
||||
f"absorb (1,0) | rout (1,1: deployed detached). A false rout only discards one "
|
||||
f"update from deployment. "
|
||||
f"SHOULD: auroc col >> 0.5 once hacks appear ELSE v_act is blind and routing "
|
||||
f"is noise; rout tracks the hack share, not ~0 or ~1.")
|
||||
@@ -443,24 +416,35 @@ def main(cfg: Config) -> int:
|
||||
pad_id = tok.pad_token_id
|
||||
|
||||
def gen_students(enc, n: int) -> tuple[torch.Tensor, int]:
|
||||
"""Generate student rollouts in deployment mode with the quarantine ablated.
|
||||
|
||||
This prevents reward-hacking behavior learned by the quarantine parameters from
|
||||
saturating the rollout distribution and reducing the correct-solution advantage.
|
||||
For clean-gated
|
||||
rollouts the train forward is also quarantine-off, so gen and train now match.
|
||||
(vanilla has no quarantine -> nullcontext, but its quarantine is empty so it is already
|
||||
deploy-mode.)"""
|
||||
ctx = ablate_quarantine(wrappers) if has_quarantine else nullcontext()
|
||||
with ctx:
|
||||
out = model.generate(**enc, generation_config=gen_cfg,
|
||||
num_return_sequences=n).detach()
|
||||
return out, 0 # every rollout is deploy-mode now; no separate ablated-proxy subset
|
||||
"""Generate n student rollouts. cfg.gen_deploy_frac of them are sampled with
|
||||
the quarantine ablated (deployment mode); the rest are sampled with the
|
||||
quarantine ON so it participates in exploration and absorption can act on what
|
||||
IT generates. Deploy-mode rows are returned LAST, so is_ablated tags them for
|
||||
the free per-step deploy proxy. (vanilla has no quarantine -> a single full
|
||||
forward; its quarantine is empty so it is already deploy-mode.)"""
|
||||
def _gen(k: int) -> torch.Tensor:
|
||||
return model.generate(**enc, generation_config=gen_cfg,
|
||||
num_return_sequences=k).detach()
|
||||
if not has_quarantine:
|
||||
return _gen(n), 0
|
||||
n_dep = round(n * cfg.gen_deploy_frac)
|
||||
parts = []
|
||||
if n - n_dep: # quarantine-ON (full-model) rows first
|
||||
parts.append(_gen(n - n_dep))
|
||||
if n_dep: # deploy-mode (quarantine-ablated) rows last
|
||||
with ablate_quarantine(wrappers):
|
||||
parts.append(_gen(n_dep))
|
||||
if len(parts) == 1:
|
||||
return parts[0], n_dep
|
||||
L = max(p.shape[1] for p in parts) # two generate calls -> pad to equal length
|
||||
parts = [F.pad(p, (0, L - p.shape[1]), value=pad_id) if p.shape[1] < L else p
|
||||
for p in parts]
|
||||
return torch.cat(parts, dim=0), n_dep
|
||||
|
||||
# `ref_eq` compares cumulative sampling pressure to the 16x16 reference step.
|
||||
run_modes = sorted({p["env_mode"] for p in problems}, key=lambda m: list(MODE_CODE).index(m))
|
||||
step_logger = StepLogger(arm=cfg.arm, modes=run_modes, mode_code=MODE_CODE,
|
||||
show_ablate=cfg.rollout_ablate_frac > 0)
|
||||
show_ablate=has_quarantine and cfg.gen_deploy_frac < 1.0)
|
||||
REF_GENS_PER_STEP = 16 * 16 # ariahw/rl-rewardhacking config.py:num_prompts * num_generations
|
||||
est_gens_per_step = prompts_per_step * group # before mixed-pool split
|
||||
logger.info(
|
||||
@@ -522,36 +506,35 @@ def main(cfg: Config) -> int:
|
||||
def _routeA_gate(dots: Float[torch.Tensor, "G"]) -> tuple[torch.Tensor, torch.Tensor, float, float]:
|
||||
"""Three-way output-mask label per rollout from the rolling score buffer.
|
||||
|
||||
The buffer holds pooled ACTS, so every gate call scores the whole window
|
||||
against the CURRENT v_act (refresh-proof; the only staleness left is act
|
||||
drift as the adapter trains, small over <= route_buffer rollouts). Scores
|
||||
are z-normalized by the buffer mean/std, then two-threshold Otsu (winsorized
|
||||
inside _otsu3) places (t_lo, t_hi): z <= t_lo keep (0,0); t_lo < z < t_hi
|
||||
absorb (1,0, both blocks train); z >= t_hi rout (1,1, deployed detached).
|
||||
Warmup: pinned absorb until the buffer holds route_warmup scores -- too few
|
||||
points to place thresholds, and absorb keeps both blocks learning."""
|
||||
# A single NaN act would poison the whole buffer -> NaN thresholds -> every
|
||||
# comparison False -> silent all-keep routing for up to route_buffer rollouts.
|
||||
The buffer holds pooled ACTS over the whole run, so every gate call scores it
|
||||
against the CURRENT v_act (refresh-proof). Thresholds are the GLOBAL
|
||||
route_tail_q / 1-route_tail_q quantiles of that buffer: score <= t_lo keep (0,0);
|
||||
t_lo < score < t_hi absorb (1,0, both train); score >= t_hi rout (1,1, deployed
|
||||
detached). Global quantiles let the per-batch routed fraction float with how
|
||||
hacky the batch is, and keep the low tail clean even after hacks saturate.
|
||||
Warmup: pinned absorb until the buffer holds route_warmup scores."""
|
||||
# A single NaN act would poison the buffer -> NaN quantiles -> every comparison
|
||||
# False -> silent all-keep routing for up to route_buffer rollouts.
|
||||
assert torch.isfinite(dots).all(), f"non-finite gate scores: {dots}"
|
||||
if len(act_buf) < cfg.route_warmup:
|
||||
G_n = dots.shape[0]
|
||||
return (torch.ones(G_n, device=device), torch.zeros(G_n, device=device),
|
||||
float("nan"), float("nan"))
|
||||
S = torch.einsum("nmr,mr->n", torch.stack(tuple(act_buf)), v_act)
|
||||
mu, sd = S.mean().item(), max(S.std().item(), 1e-12)
|
||||
t_lo, t_hi = _otsu3(((S - mu) / sd).numpy())
|
||||
z = (dots - mu) / sd
|
||||
m = (z > t_lo).float().to(device) # absorb + rout -> quarantine trains
|
||||
d = (z >= t_hi).float().to(device) # top zone -> rout -> deployed detached
|
||||
d = d * m # degenerate Otsu (t_lo==t_hi) could give (0,1) = trains nothing; rout requires m=1
|
||||
logger.debug(f"routeA gate: buf={len(act_buf)} mu={mu:+.3e} sd={sd:.3e} "
|
||||
f"t_lo={t_lo:+.2f} t_hi={t_hi:+.2f} (z) | batch z "
|
||||
f"min={z.min().item():+.2f} max={z.max().item():+.2f}")
|
||||
t_lo = torch.quantile(S, cfg.route_tail_q).item()
|
||||
t_hi = torch.quantile(S, 1 - cfg.route_tail_q).item()
|
||||
m = (dots > t_lo).float().to(device) # above the low tail -> quarantine trains (absorb + rout)
|
||||
d = (dots >= t_hi).float().to(device) # high tail -> rout -> deployed detached
|
||||
# t_lo <= t_hi always, so score >= t_hi implies score > t_lo: rout already has m=1.
|
||||
logger.debug(f"routeA gate: buf={len(act_buf)} t_lo={t_lo:+.3e} t_hi={t_hi:+.3e} "
|
||||
f"(raw score) | batch score min={dots.min().item():+.3e} "
|
||||
f"max={dots.max().item():+.3e}")
|
||||
return m, d, t_lo, t_hi
|
||||
|
||||
# Disable tqdm off-TTY because structured per-step rows already report progress.
|
||||
pbar = tqdm(range(steps), desc=f"train {cfg.arm} {cfg.preset_name}",
|
||||
mininterval=120, maxinterval=120, disable=None)
|
||||
v_act_stale = 0 # steps since the routing direction was last (re)extracted
|
||||
# ── training loop: generate -> grade -> backward -> (gate) -> masked backward -> step ──
|
||||
for step in pbar:
|
||||
# DETERMINISTIC teacher forcing: in the teacher phase every prompt is drawn from the
|
||||
@@ -582,6 +565,7 @@ def main(cfg: Config) -> int:
|
||||
diag_tail = None
|
||||
# routeA gate diagnostics (per-rollout three-way zone shares + clean-gated clipfrac).
|
||||
step_clipfrac: list[float] = [] # PPO clip frac on keep-gated rollouts (ratio-drift gauge)
|
||||
step_rho_keep: list[float] = []; step_rho_absorb: list[float] = []; step_rho_rout: list[float] = [] # mean ρ per zone (off-policy gauge)
|
||||
step_zkeep: list[float] = []; step_zresid: list[float] = []; step_zrout: list[float] = [] # unit shares per zone
|
||||
step_tlo: list[float] = []; step_thi: list[float] = [] # Otsu thresholds (z units)
|
||||
# AUROC diagnostic on the A>0 contrast: scores + hack-labels of positively-
|
||||
@@ -798,32 +782,47 @@ def main(cfg: Config) -> int:
|
||||
if not cfg.unbiased:
|
||||
A = A / (rewards.std() + 1e-4)
|
||||
|
||||
# logπ_old: old-policy logprobs (frozen PPO-ratio target). logits_to_keep
|
||||
# =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops
|
||||
# the last position (predicts beyond `merged`, unused).
|
||||
# For routeA this forward runs QUARANTINE-ABLATED, matching both the sampling
|
||||
# policy (gen_students is deploy-mode) and the v_act extraction (quarantine-
|
||||
# ablated), so the gate score and the vector live on the same observable path.
|
||||
# The same forward carries the ActCapture hooks: the gate costs no extra pass.
|
||||
# logπ_old: the BEHAVIOR policy's logprobs (the PPO-ratio denominator). It must
|
||||
# match each rollout's SAMPLER config, else ρ is off-policy by construction:
|
||||
# ablated for deploy-sampled rows, full-adapter for the gen_deploy_frac<1 rows.
|
||||
# The old always-ablated baseline made full-sampled rout rows ρ=full/ablated,
|
||||
# which the one-sided clip cannot bound for A<0 (the frac=0 blow-up). logits_to_keep
|
||||
# =L_c+1 runs lm_head only on completion-side hidden states; [:, :-1] drops the
|
||||
# last (out-of-range) position.
|
||||
completion_ids = merged[:, plen:]
|
||||
L_c = completion_ids.shape[1]
|
||||
mask = (completion_ids != pad_id).float()
|
||||
abl_row = torch.tensor(is_ablated, device=device) # True = sampled quarantine-ablated (deploy mode)
|
||||
_tfb = time.perf_counter()
|
||||
|
||||
def _logp_old(ablate: bool) -> torch.Tensor:
|
||||
with torch.no_grad(), (ablate_quarantine(wrappers) if ablate else nullcontext()):
|
||||
return per_token_logps(
|
||||
model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
|
||||
completion_ids,
|
||||
).detach()
|
||||
|
||||
if is_routeA:
|
||||
# Gate acts ALWAYS ride an ablated forward: v_act lives in the deployed-block
|
||||
# ablated space, so the gate score and the vector stay on one observable path.
|
||||
with torch.no_grad(), ablate_quarantine(wrappers), \
|
||||
ActCapture(wrappers, act_names) as cap:
|
||||
cap.set_pool(plen, mask)
|
||||
logπ_old = per_token_logps(
|
||||
logπ_old_abl = per_token_logps(
|
||||
model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
|
||||
completion_ids,
|
||||
).detach()
|
||||
acts = cap.pooled().cpu() # [G, M, r] fp32
|
||||
else:
|
||||
with torch.no_grad():
|
||||
logπ_old = per_token_logps(
|
||||
model(merged, logits_to_keep=L_c + 1).logits[:, :-1],
|
||||
completion_ids,
|
||||
).detach()
|
||||
# Behavior-policy match: full-sampled rows take a full forward (one extra
|
||||
# no-grad pass only when gen_deploy_frac<1).
|
||||
logπ_old = logπ_old_abl if abl_row.all() else \
|
||||
torch.where(abl_row[:, None], logπ_old_abl, _logp_old(ablate=False))
|
||||
elif abl_row.all():
|
||||
logπ_old = _logp_old(ablate=True)
|
||||
elif not abl_row.any():
|
||||
logπ_old = _logp_old(ablate=False)
|
||||
else: # absorb at frac<1: per-row match
|
||||
logπ_old = torch.where(abl_row[:, None], _logp_old(ablate=True), _logp_old(ablate=False))
|
||||
|
||||
# Pin block masks BEFORE the (single) grad-carrying forward (arm semantics:
|
||||
# train_config.py docstring): none -> (0,0), absorb -> (1,0), routeA -> the
|
||||
@@ -896,15 +895,25 @@ def main(cfg: Config) -> int:
|
||||
# vector; v_act is a classifier only). Gradients accumulate on A/B.
|
||||
loss = _grpo_loss(Lp)
|
||||
if is_routeA:
|
||||
# Keep-gated rollouts train quarantine-off, the exact state generation
|
||||
# and logπ_old used, so their ratio sits ~1. Absorb/rout rollouts see
|
||||
# the quarantine delta in the forward only -> ratio drift, bounded by
|
||||
# the clip; clipfrac on those rollouts is the drift gauge.
|
||||
# ρ=1 only where the mask's forward mode matches the rollout's sampling
|
||||
# mode: deploy-sampled keep, full-sampled absorb/rout. Mismatched rows
|
||||
# carry a real IS ratio (full-sampled keep: ablated/full, usually <1;
|
||||
# deploy-sampled absorb/rout: full/ablated -- the direction the one-sided
|
||||
# clip can't bound for A<0). clipfrac on quarantine-on rows is the gauge.
|
||||
qon = m_vec == 1
|
||||
if qon.any():
|
||||
clipped = ((ρ.detach() - 1).abs() > cfg.clip).float()
|
||||
step_clipfrac.append(
|
||||
((clipped * mask)[qon].sum() / mask[qon].sum().clamp_min(1)).item())
|
||||
# Per-rollout mean ρ split by zone. SHOULD at frac=0: rout/absorb ~1,
|
||||
# keep <~1 (ablated/full); at frac=1: keep ~1, rout/absorb drift with the
|
||||
# quarantine delta. rout>>1 = the off-policy blow-up direction (A<0 unclipped).
|
||||
ρ_roll = (ρ.detach() * mask).sum(1) / mask.sum(1).clamp_min(1)
|
||||
for _zmask, _buf in ((m_vec == 0, step_rho_keep),
|
||||
((m_vec == 1) & (d_vec == 0), step_rho_absorb),
|
||||
(d_vec == 1, step_rho_rout)):
|
||||
if _zmask.any():
|
||||
_buf.append(ρ_roll[_zmask].mean().item())
|
||||
loss.backward() # A/B grads accumulate across prompts (opt.zero_grad clears per step)
|
||||
for info in wrappers.values():
|
||||
info["layer"]._lora2r_mask = None
|
||||
@@ -938,7 +947,6 @@ def main(cfg: Config) -> int:
|
||||
# signal is read on the deployed observable path, matching the build-time extract
|
||||
# and the gate forward. Forward-only, so the refresh is cheap. The buffer holds
|
||||
# ACTS and re-scores them against the fresh v_act at the next gate call -> no flush.
|
||||
refr = "-"
|
||||
do_refresh = (is_routeA and cfg.vhack_refresh_every > 0
|
||||
and (step + 1) % cfg.vhack_refresh_every == 0
|
||||
and cfg.routeA_random_v_seed is None) # placebo keeps its one Haar draw
|
||||
@@ -950,7 +958,10 @@ def main(cfg: Config) -> int:
|
||||
tstat=cfg.vact_tstat)
|
||||
if _was_training:
|
||||
model.train()
|
||||
refr = "rfr"
|
||||
# Age of the v_act now in hand, measured at end of step: 0 = just refreshed.
|
||||
# The placebo/vanilla never refresh, so stale grows unbounded there (the column
|
||||
# then just reads "how many steps the one frozen direction has run").
|
||||
v_act_stale = 0 if do_refresh else v_act_stale + 1
|
||||
|
||||
# ── periodic held-out eval (deploy = quarantine ablated) ──
|
||||
hack_deployed = solve_deployed = float("nan")
|
||||
@@ -1046,6 +1057,11 @@ def main(cfg: Config) -> int:
|
||||
logger.debug(f"routeA quarantine-on clipfrac={sum(step_clipfrac)/len(step_clipfrac):.3f} "
|
||||
f"(SHOULD: <~0.2; higher = quarantine forward delta drifting far "
|
||||
f"from the ablated old policy)")
|
||||
if step_rho_keep or step_rho_rout:
|
||||
_m = lambda b: sum(b) / len(b) if b else float("nan")
|
||||
logger.debug(f"routeA ρ by zone: keep={_m(step_rho_keep):.2f} absorb={_m(step_rho_absorb):.2f} "
|
||||
f"rout={_m(step_rho_rout):.2f} (SHOULD: keep~1.0 always; rout/absorb ~1 with "
|
||||
f"the generation-matched baseline -- rout>>1 = off-policy quarantine drift)")
|
||||
if step_route_hackT or step_route_solveT:
|
||||
_rh = sum(step_route_hackT) / len(step_route_hackT) if step_route_hackT else float("nan")
|
||||
_rs = sum(step_route_solveT) / len(step_route_solveT) if step_route_solveT else float("nan")
|
||||
@@ -1089,7 +1105,7 @@ def main(cfg: Config) -> int:
|
||||
"tlo": (sum(step_tlo) / len(step_tlo)) if step_tlo else float("nan"),
|
||||
"thi": (sum(step_thi) / len(step_thi)) if step_thi else float("nan"),
|
||||
"lr": sched.get_last_lr()[0],
|
||||
"refr": refr,
|
||||
"stale": v_act_stale,
|
||||
# Deploy-eval (quarantine ablated); NaN except on eval steps.
|
||||
"hack_deployed": hack_deployed,
|
||||
"solve_deployed": solve_deployed,
|
||||
@@ -1130,7 +1146,7 @@ def main(cfg: Config) -> int:
|
||||
if last_gen_sample:
|
||||
_s, _r = last_gen_sample
|
||||
logger.error(f"--- last student gen (step {_s}, reward={_r['reward']:+.2f}) ---\n"
|
||||
f"{_r['text'][:800]}\n--- END (token salad => divergence confirmed) ---")
|
||||
f"{_r['text']}\n--- END (token salad => divergence confirmed) ---")
|
||||
raise RuntimeError(f"training diverged (ppl_t={ppl_t:.0e} at step {step})")
|
||||
updates_completed = step + 1
|
||||
if updates_completed % cfg.save_ckpt_every == 0 or updates_completed == steps:
|
||||
@@ -1143,7 +1159,7 @@ def main(cfg: Config) -> int:
|
||||
pbar.set_postfix(
|
||||
rew=f"{rew_mean:+.2f}", gt=f"{sum(agg_gt)}/{n_rollouts}",
|
||||
hack=f"{sum(agg_hack)}/{n_rollouts}", loss=f"{agg_loss:+.3f}",
|
||||
sec=f"{time.time()-t0:.0f}", refresh=False,
|
||||
sec=f"{time.time()-t0:.0f}", stale=v_act_stale,
|
||||
)
|
||||
logger.debug(
|
||||
f"step {step:3d} rew={rew_mean:+.2f}(std {rew_std:.2f}) "
|
||||
@@ -1186,7 +1202,7 @@ def main(cfg: Config) -> int:
|
||||
f"\n\n=== LAST TRAIN GEN (step {_s}, reward={_r['reward']:+.2f}, "
|
||||
f"gt_pass={_r['gt_pass']}, hacked={_r['hacked_E']}) ===\n"
|
||||
f"SHOULD: coherent code/prose. ELSE token salad => diverged, eval below is moot.\n"
|
||||
f"{_r['text'][:800]}\n=== END LAST GEN ===\n")
|
||||
f"{_r['text']}\n=== END LAST GEN ===\n")
|
||||
|
||||
# ── final eval + BLUF ──
|
||||
# Pair quarantine-ablated and enabled states on identical final-test prompts and sampling seed.
|
||||
|
||||
+28
-12
@@ -54,20 +54,32 @@ class Config:
|
||||
# t-stat extraction (v = mean/SE per coordinate, clamp |t|<=3): null at the current
|
||||
# 8 pairs (journal (e)); revisit when the authored-pair set grows.
|
||||
vact_tstat: bool = False
|
||||
# routeA gate thresholds come from a rolling buffer of the last route_buffer live
|
||||
# scores: z-normalize by buffer mean/std, two-threshold Otsu (winsorized 1/99% --
|
||||
# journal (d): without it one outlier buys a whole zone) -> keep | absorb | rout.
|
||||
# Until the buffer holds route_warmup scores the gate pins absorb (both blocks
|
||||
# train): too few points to place thresholds. The buffer stores pooled ACTS and
|
||||
# re-scores them against the current v_act, so a refresh needs no flush. No
|
||||
# bimodality guard: no shape statistic separates the hack mixture from hack-free
|
||||
# scores on the cached windows (Otsu tail means sit ~2.4 sd apart even on a
|
||||
# Gaussian), and a false rout only discards one update from deployment.
|
||||
route_buffer: int = 256
|
||||
# routeA gate: route the GLOBAL tails of the score distribution. The buffer stores
|
||||
# pooled ACTS over the whole run and re-scores them against the current v_act (so a
|
||||
# v_act refresh needs no flush). Each call cuts at the route_tail_q / 1-route_tail_q
|
||||
# quantiles of that global buffer: score <= low-tail keep (0,0) | middle absorb (1,0,
|
||||
# both train) | score >= high-tail rout (1,1, deployed detached). Global (not per-batch)
|
||||
# quantiles let the per-batch routed fraction float: an all-clean batch sits below the
|
||||
# low tail -> ~nothing routed; an all-hack batch sits above the high tail -> most routed.
|
||||
# This also breaks the gate-collapse of per-window z-norm (which re-centred a saturated
|
||||
# all-hack window to 0). Symmetric tails: we do not assume where the hack/clean split is.
|
||||
# Pinned absorb until the buffer holds route_warmup scores.
|
||||
route_buffer: int = 8192 # > steps*group*prompts (4800 default) so the early clean-era
|
||||
# scores are never evicted -- they anchor the low tail once hacks dominate; a 4096
|
||||
# deque dropped them in the last ~15 steps, exactly when the keep zone needs them.
|
||||
route_warmup: int = 128
|
||||
route_tail_q: float = 0.1 # route the top and bottom q of the global score buffer (~20% tails)
|
||||
# Haar-random direction control (placebo): same routing machinery, no pair signal.
|
||||
routeA_random_v_seed: int | None = None
|
||||
rollout_ablate_frac: float = 0.0
|
||||
# Fraction of student rollouts SAMPLED with the quarantine ablated (deployment
|
||||
# mode). 1.0 = the old behavior: every rollout is sampled deploy-mode, so the
|
||||
# quarantine never participates in generation -- it can only mimic, via routed
|
||||
# hack gradients, what the deployed block already emits, which may be why
|
||||
# absorption did not move the deployed hack. <1.0 lets the quarantine shape some
|
||||
# exploration so absorption can act on what IT generates (at the cost of a
|
||||
# train/deploy sampling mismatch on those rows). The deploy-mode rows are
|
||||
# returned LAST so is_ablated tags them for the free per-step deploy proxy.
|
||||
gen_deploy_frac: float = 1.0
|
||||
|
||||
env_mode: EnvMode = "run_tests"
|
||||
# Rotating per-step chance a problem is shown hint-free (graded by ground-truth tests);
|
||||
@@ -129,12 +141,16 @@ class SmokeConfig(Config):
|
||||
# (keep/absorb/rout + deployed detach) within a few steps.
|
||||
route_buffer: int = 32
|
||||
route_warmup: int = 8
|
||||
# Exercise the mixed-exploration path (two generate calls + pad) in smoke.
|
||||
gen_deploy_frac: float = 0.5
|
||||
|
||||
|
||||
@dataclass(kw_only=True)
|
||||
class FastConfig(Config):
|
||||
model: str = "Qwen/Qwen3-4B"
|
||||
steps: int = 100
|
||||
# Hack saturates by ~step 40 and is self-sustaining well before the old 100;
|
||||
# 60 steps captures emergence + a stable tail at ~1/2 the wall-clock.
|
||||
steps: int = 60
|
||||
teacher_pool_dir: Path | None = Path("out/pools/teacher_pool_runtests_dense")
|
||||
group: int = 8
|
||||
max_new: int = 512
|
||||
|
||||
Reference in New Issue
Block a user