spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate)

Routing stays vector-based (cos>tau, not the detector flag) but tau is the per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only calibrates, gt_pass never gates. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-01 02:08:26 +00:00
parent 1d105a93a4
commit acc23885b6
1 changed files with 100 additions and 0 deletions
@@ -0,0 +1,100 @@
+# Spec — per-step calibrated threshold τ for route2-grad routing
+
+Status: APPROVED by user 2026-06-01, implement + smoke + queue + follow.
+Context survives compaction here; resume from "Implementation" if context was lost.
+
+## Hypothesis / why
+
+route2-act and route2-grad both gate routing on `cos(·, v) > 0`. In high-dim,
+`cos ≈ 0` for almost everything, so an absolute-zero threshold is a ~50% coin
+flip -> over-routing -> half of ALL learning dumped into the deleted quarantine
+-> collapse (job 46) or null. The fix is NOT to route by the detector flag (that
+= SGTM, throws away the vector novelty: the flag only catches known hack A, the
+whole point is the *direction* generalises to unknown B). The fix is to keep
+routing by the v_hack DIRECTION but **calibrate the threshold per step** so it
+rides the cos drift (we measured cin_t 0.27->0.07 over ~10 steps, so any fixed
+τ goes stale and silently stops firing).
+
+Driving principle (one): route by where a rollout falls on the v_hack axis
+RELATIVE to the known-hack vs clean clouds, recalibrated each step.
+
+## Mechanism (per module, per step, EMA-smoothed)
+
+Anchors available every prompt without a new buffer:
+- hack cloud = teacher rollouts (known-A hacks, always present) + detector-
+  flagged student rollouts (`hack_E` = loophole used; the allowed weak detector).
+- clean cloud = NOT-flagged student rollouts. (Contaminated with unknown B -- that
+  is FINE and intended: it makes τ slightly conservative, and B, sharing the
+  direction, lands ABOVE τ and gets routed. Do NOT force-keep this set.)
+
+Per module `name`, from this prompt's `cos_b` [G] (already computed in
+`_route2_grad_filter`):
+- μ_hack_prompt = mean(cos_b[hack_anchor]) ; μ_clean_prompt = mean(cos_b[clean_anchor])
+- EMA across prompts/steps (β=0.9): ema_hack[name], ema_clean[name]
+  (EMA is the cheap equivalent of "last N hacks": teacher is the always-present
+  floor, recent student-flagged hacks sharpen it. Explicit rollout buffer =
+  TODO if teacher-anchor proves biased toward teacher-style hacks.)
+- τ[name] = (ema_hack[name] + ema_clean[name]) / 2
+- route_mask = hack_anchor | (cos_b > τ[name])   # force-route known hacks
+  (teacher + flagged student); τ-route the ambiguous rest (incl. unknown B).
+
+Warmup: at step 0 delta_S≈0 so few reliable axes -> cos_b≈0 -> μ_hack≈μ_clean≈0
+-> τ≈0 (≈ old cos>0 behaviour) until delta_S grows. Separation (hkgap) emerges
+over the first few steps. Sanity: by a few steps μ_clean<~0.1, μ_hack>~0.2-ish
+(the user's hardcode intuition, but self-calibrated).
+
+## Logging (new columns, route2-grad only)
+
+- `tau`   = mean over modules of τ[name]
+- `hkgap` = mean over modules of (ema_hack - ema_clean)  <- the discrimination
+  gauge; generalises cin_t>cin_s. If hkgap collapses to ~0 the direction stopped
+  separating (the real failure signal, not a threshold-choice problem).
+
+## Implementation (exact edits in src/projected_grpo/train.py)
+
+1. Before the step loop (near `route_span_checked = False`, ~L1082): add
+   `ema_hack_cos: dict = {}`, `ema_clean_cos: dict = {}`, `route2_tau: dict = {}`
+   (persist across steps). EMA_BETA = 0.9.
+
+2. Per-step lists near `step_flagged` (~L1192): `step_tau = []`, `step_hkgap = []`.
+
+3. Per-prompt anchor masks: after grading (hack_E_flags built, ~L1396), before
+   the backward, when is_route2_grad build device bool tensors aligned to merged
+   rows [student 0..G_s-1, teacher G_s..]:
+     hack_anchor[i]  = (not is_student[i]) or (is_student[i] and hack_E_flags[i])
+     clean_anchor[i] = is_student[i] and not hack_E_flags[i]
+   (hack_E_flags has length G_s = #student rows; teacher rows -> hack_anchor.)
+
+4. `_route2_grad_filter` (~L1194): add params (hack_anchor, clean_anchor).
+   Replace `flagged = (cos_b > 0).float()` with the EMA-calibrated τ logic:
+     - update ema_hack_cos[name]/ema_clean_cos[name] from this prompt's anchors
+       (skip a cloud if it has no rows this prompt)
+     - τ = (ema_hack_cos.get(name,0)+ema_clean_cos.get(name,0))/2 ; route2_tau[name]=τ
+     - route = (cos_b > τ) | hack_anchor   ; flagged = route.float()
+     - step_tau.append(τ); step_hkgap.append(ema_hack_cos.get(name,0)-ema_clean_cos.get(name,0))
+   Pass hack_anchor/clean_anchor through the call site (~L1518).
+
+5. Row dict (~L1864): add `"tau": (mean(step_tau) if step_tau else float("nan"))`,
+   `"hkgap": (mean(step_hkgap) if step_hkgap else float("nan"))`.
+
+6. StepLogger cols (~L711, after the routing2_act block): add
+     if arm == "routing2_grad":
+         _Col("tau",   6, "tau",   "+.2f", "per-step calibrated route threshold (midpoint of hack vs clean cos clouds)")
+         _Col("hkgap", 6, "hkgap", "+.2f", "ema_hack_cos - ema_clean_cos; >0 = v_grad still separates hack from clean")
+
+## Verify / queue / follow
+
+- `just smoke-route2 --route2-mask=grad` (or the smoke recipe that hits grad
+  path): confirm tau/hkgap columns render, routing fires (flagged frac < ~0.5,
+  not pinned at 0.5), exit 0.
+- Queue (substrate, 60 steps, seed 41), label:
+  why: does per-step calibrated-τ vector routing (route2-grad) stop over-routing
+  (flagged<<0.5) and suppress held-out deploy-hack vs vanilla at matched solve;
+  resolve: qE bounded + hkgap>0 sustained + deploy file_marker hack < vanilla.
+- `pueue follow <id> | tail` in background.
+
+## No-cheat check
+
+Detector flag (hack_E) only CALIBRATES μ_hack (known A) + anchors teacher hacks.
+The routing GATE is cos>τ applied to ALL rollouts incl. unknown B. gt_pass never
+touches routing. Compliant.