fix: ablate quarantine during v_hack refresh so route doesn't lose the hack

Root cause of the route-only cin_t collapse at the refresh step: refresh
re-extracts the main-knob (delta_S) gradient through the LIVE adapter, but route
has by then routed the hack capability into delta_S_hack. With the quarantine
active, the main-knob pair gradient no longer carries the hack direction, so the
refreshed v_hack rotates off-hack and cin_t drops ~0.32->0.04 in one step.

Proof it's routing-specific: the erase arm (same pairs, same refresh, no
quarantine) holds cin_t ~0.36 straight through the refresh (basis_overlap 0.83).

Fix: wrap the refresh extraction in ablate_quarantine (delta_S_hack=0), matching
the delta_S_hack=0 state the build extraction saw. No-op for erase.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-05-31 00:04:10 +00:00
parent b88afbaf87
commit cec6542ed3
+14 -5
View File
@@ -1430,11 +1430,20 @@ def main(cfg: Config) -> int:
logger.disable("projected_grpo.extract_vhack_grad")
logger.disable("__main__")
try:
_new_V, _new_S, _, _ = extract_v_hack(
model, tok, wrappers, VHACK_PAIRS,
top_k=cfg.v_hack_extract_top_k, tau_axis=cfg.v_hack_tau_axis,
n_heldout=2, device=device,
)
# Extract with the quarantine ablated (delta_S_hack=0). For route,
# once the hack capability has been routed into delta_S_hack, the
# main-knob gradient on the pairs no longer carries the hack
# direction -- so re-extracting through the live quarantine rotates
# v_hack off-hack and cin_t collapses at the refresh step. Ablating
# sends the hack back through the observable main path so D captures
# it, matching the delta_S_hack=0 state the build extraction saw.
# No-op for erase (delta_S_hack is never trained, stays 0).
with ablate_quarantine(wrappers):
_new_V, _new_S, _, _ = extract_v_hack(
model, tok, wrappers, VHACK_PAIRS,
top_k=cfg.v_hack_extract_top_k, tau_axis=cfg.v_hack_tau_axis,
n_heldout=2, device=device,
)
_post = postprocess_v_hack(
_new_V, _new_S, k_use=cfg.v_hack_k,
drop_bottom_frac=cfg.v_hack_drop_bottom_frac,