feat: generation-matched logπ_old baseline + global-quantile gate + frac=0 method

Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed
in each rollout's own sampling mode, so ρ is a true importance ratio. The old
always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the
one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the
mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout).
Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled
absorb/route rows by construction -- frac=0 is the clean point.

Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1)
over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean
era anchors the low tail; buffer stores acts, re-scored vs current v_act so a
refresh needs no flush). Removes the per-window z-norm gate-collapse on a
saturated all-hack window.

gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it
elicits the hack and absorption can localize it. queue-decision now passes
--gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the
job-34 config where ablation RAISED hack 0.71->0.86).

Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL
2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-12 03:22:48 +00:00
parent ec11bf58b2
commit af420ec855
9 changed files with 391 additions and 114 deletions
+5 -1
View File
@@ -191,7 +191,7 @@ Strangely enough a random vector also does an OK job (numbers) which I don't hav
#
# 2026-06-11 12:18:46
> Routing itself suppresses hacking a lot, but the hacking vector improves the tradeoff: lower hack and higher clean solve than random routing.
@@ -208,3 +208,7 @@ Changed
# 2026-06-11 12:18:43
I found activations ( and residual stream is better for routing that gradients). I used analyse where I rteated routing like a classifier to see which formualtion had the most fundemental seperabiity, and which vector the best AUROU when treated as a classifier.
The simplified it anyway