diff --git a/docs/spec/20260601_calibrated_tau_route2grad.md b/docs/spec/20260601_calibrated_tau_route2grad.md new file mode 100644 index 0000000..45635dc --- /dev/null +++ b/docs/spec/20260601_calibrated_tau_route2grad.md @@ -0,0 +1,100 @@ +# Spec — per-step calibrated threshold τ for route2-grad routing + +Status: APPROVED by user 2026-06-01, implement + smoke + queue + follow. +Context survives compaction here; resume from "Implementation" if context was lost. + +## Hypothesis / why + +route2-act and route2-grad both gate routing on `cos(·, v) > 0`. In high-dim, +`cos ≈ 0` for almost everything, so an absolute-zero threshold is a ~50% coin +flip -> over-routing -> half of ALL learning dumped into the deleted quarantine +-> collapse (job 46) or null. The fix is NOT to route by the detector flag (that += SGTM, throws away the vector novelty: the flag only catches known hack A, the +whole point is the *direction* generalises to unknown B). The fix is to keep +routing by the v_hack DIRECTION but **calibrate the threshold per step** so it +rides the cos drift (we measured cin_t 0.27->0.07 over ~10 steps, so any fixed +τ goes stale and silently stops firing). + +Driving principle (one): route by where a rollout falls on the v_hack axis +RELATIVE to the known-hack vs clean clouds, recalibrated each step. + +## Mechanism (per module, per step, EMA-smoothed) + +Anchors available every prompt without a new buffer: +- hack cloud = teacher rollouts (known-A hacks, always present) + detector- + flagged student rollouts (`hack_E` = loophole used; the allowed weak detector). +- clean cloud = NOT-flagged student rollouts. (Contaminated with unknown B -- that + is FINE and intended: it makes τ slightly conservative, and B, sharing the + direction, lands ABOVE τ and gets routed. Do NOT force-keep this set.) + +Per module `name`, from this prompt's `cos_b` [G] (already computed in +`_route2_grad_filter`): +- μ_hack_prompt = mean(cos_b[hack_anchor]) ; μ_clean_prompt = mean(cos_b[clean_anchor]) +- EMA across prompts/steps (β=0.9): ema_hack[name], ema_clean[name] + (EMA is the cheap equivalent of "last N hacks": teacher is the always-present + floor, recent student-flagged hacks sharpen it. Explicit rollout buffer = + TODO if teacher-anchor proves biased toward teacher-style hacks.) +- τ[name] = (ema_hack[name] + ema_clean[name]) / 2 +- route_mask = hack_anchor | (cos_b > τ[name]) # force-route known hacks + (teacher + flagged student); τ-route the ambiguous rest (incl. unknown B). + +Warmup: at step 0 delta_S≈0 so few reliable axes -> cos_b≈0 -> μ_hack≈μ_clean≈0 +-> τ≈0 (≈ old cos>0 behaviour) until delta_S grows. Separation (hkgap) emerges +over the first few steps. Sanity: by a few steps μ_clean<~0.1, μ_hack>~0.2-ish +(the user's hardcode intuition, but self-calibrated). + +## Logging (new columns, route2-grad only) + +- `tau` = mean over modules of τ[name] +- `hkgap` = mean over modules of (ema_hack - ema_clean) <- the discrimination + gauge; generalises cin_t>cin_s. If hkgap collapses to ~0 the direction stopped + separating (the real failure signal, not a threshold-choice problem). + +## Implementation (exact edits in src/projected_grpo/train.py) + +1. Before the step loop (near `route_span_checked = False`, ~L1082): add + `ema_hack_cos: dict = {}`, `ema_clean_cos: dict = {}`, `route2_tau: dict = {}` + (persist across steps). EMA_BETA = 0.9. + +2. Per-step lists near `step_flagged` (~L1192): `step_tau = []`, `step_hkgap = []`. + +3. Per-prompt anchor masks: after grading (hack_E_flags built, ~L1396), before + the backward, when is_route2_grad build device bool tensors aligned to merged + rows [student 0..G_s-1, teacher G_s..]: + hack_anchor[i] = (not is_student[i]) or (is_student[i] and hack_E_flags[i]) + clean_anchor[i] = is_student[i] and not hack_E_flags[i] + (hack_E_flags has length G_s = #student rows; teacher rows -> hack_anchor.) + +4. `_route2_grad_filter` (~L1194): add params (hack_anchor, clean_anchor). + Replace `flagged = (cos_b > 0).float()` with the EMA-calibrated τ logic: + - update ema_hack_cos[name]/ema_clean_cos[name] from this prompt's anchors + (skip a cloud if it has no rows this prompt) + - τ = (ema_hack_cos.get(name,0)+ema_clean_cos.get(name,0))/2 ; route2_tau[name]=τ + - route = (cos_b > τ) | hack_anchor ; flagged = route.float() + - step_tau.append(τ); step_hkgap.append(ema_hack_cos.get(name,0)-ema_clean_cos.get(name,0)) + Pass hack_anchor/clean_anchor through the call site (~L1518). + +5. Row dict (~L1864): add `"tau": (mean(step_tau) if step_tau else float("nan"))`, + `"hkgap": (mean(step_hkgap) if step_hkgap else float("nan"))`. + +6. StepLogger cols (~L711, after the routing2_act block): add + if arm == "routing2_grad": + _Col("tau", 6, "tau", "+.2f", "per-step calibrated route threshold (midpoint of hack vs clean cos clouds)") + _Col("hkgap", 6, "hkgap", "+.2f", "ema_hack_cos - ema_clean_cos; >0 = v_grad still separates hack from clean") + +## Verify / queue / follow + +- `just smoke-route2 --route2-mask=grad` (or the smoke recipe that hits grad + path): confirm tau/hkgap columns render, routing fires (flagged frac < ~0.5, + not pinned at 0.5), exit 0. +- Queue (substrate, 60 steps, seed 41), label: + why: does per-step calibrated-τ vector routing (route2-grad) stop over-routing + (flagged<<0.5) and suppress held-out deploy-hack vs vanilla at matched solve; + resolve: qE bounded + hkgap>0 sustained + deploy file_marker hack < vanilla. +- `pueue follow | tail` in background. + +## No-cheat check + +Detector flag (hack_E) only CALIBRATES μ_hack (known A) + anchors teacher hacks. +The routing GATE is cos>τ applied to ALL rollouts incl. unknown B. gt_pass never +touches routing. Compliant.