mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate)
Routing stays vector-based (cos>tau, not the detector flag) but tau is the per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only calibrates, gt_pass never gates. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,100 @@
|
||||
# Spec — per-step calibrated threshold τ for route2-grad routing
|
||||
|
||||
Status: APPROVED by user 2026-06-01, implement + smoke + queue + follow.
|
||||
Context survives compaction here; resume from "Implementation" if context was lost.
|
||||
|
||||
## Hypothesis / why
|
||||
|
||||
route2-act and route2-grad both gate routing on `cos(·, v) > 0`. In high-dim,
|
||||
`cos ≈ 0` for almost everything, so an absolute-zero threshold is a ~50% coin
|
||||
flip -> over-routing -> half of ALL learning dumped into the deleted quarantine
|
||||
-> collapse (job 46) or null. The fix is NOT to route by the detector flag (that
|
||||
= SGTM, throws away the vector novelty: the flag only catches known hack A, the
|
||||
whole point is the *direction* generalises to unknown B). The fix is to keep
|
||||
routing by the v_hack DIRECTION but **calibrate the threshold per step** so it
|
||||
rides the cos drift (we measured cin_t 0.27->0.07 over ~10 steps, so any fixed
|
||||
τ goes stale and silently stops firing).
|
||||
|
||||
Driving principle (one): route by where a rollout falls on the v_hack axis
|
||||
RELATIVE to the known-hack vs clean clouds, recalibrated each step.
|
||||
|
||||
## Mechanism (per module, per step, EMA-smoothed)
|
||||
|
||||
Anchors available every prompt without a new buffer:
|
||||
- hack cloud = teacher rollouts (known-A hacks, always present) + detector-
|
||||
flagged student rollouts (`hack_E` = loophole used; the allowed weak detector).
|
||||
- clean cloud = NOT-flagged student rollouts. (Contaminated with unknown B -- that
|
||||
is FINE and intended: it makes τ slightly conservative, and B, sharing the
|
||||
direction, lands ABOVE τ and gets routed. Do NOT force-keep this set.)
|
||||
|
||||
Per module `name`, from this prompt's `cos_b` [G] (already computed in
|
||||
`_route2_grad_filter`):
|
||||
- μ_hack_prompt = mean(cos_b[hack_anchor]) ; μ_clean_prompt = mean(cos_b[clean_anchor])
|
||||
- EMA across prompts/steps (β=0.9): ema_hack[name], ema_clean[name]
|
||||
(EMA is the cheap equivalent of "last N hacks": teacher is the always-present
|
||||
floor, recent student-flagged hacks sharpen it. Explicit rollout buffer =
|
||||
TODO if teacher-anchor proves biased toward teacher-style hacks.)
|
||||
- τ[name] = (ema_hack[name] + ema_clean[name]) / 2
|
||||
- route_mask = hack_anchor | (cos_b > τ[name]) # force-route known hacks
|
||||
(teacher + flagged student); τ-route the ambiguous rest (incl. unknown B).
|
||||
|
||||
Warmup: at step 0 delta_S≈0 so few reliable axes -> cos_b≈0 -> μ_hack≈μ_clean≈0
|
||||
-> τ≈0 (≈ old cos>0 behaviour) until delta_S grows. Separation (hkgap) emerges
|
||||
over the first few steps. Sanity: by a few steps μ_clean<~0.1, μ_hack>~0.2-ish
|
||||
(the user's hardcode intuition, but self-calibrated).
|
||||
|
||||
## Logging (new columns, route2-grad only)
|
||||
|
||||
- `tau` = mean over modules of τ[name]
|
||||
- `hkgap` = mean over modules of (ema_hack - ema_clean) <- the discrimination
|
||||
gauge; generalises cin_t>cin_s. If hkgap collapses to ~0 the direction stopped
|
||||
separating (the real failure signal, not a threshold-choice problem).
|
||||
|
||||
## Implementation (exact edits in src/projected_grpo/train.py)
|
||||
|
||||
1. Before the step loop (near `route_span_checked = False`, ~L1082): add
|
||||
`ema_hack_cos: dict = {}`, `ema_clean_cos: dict = {}`, `route2_tau: dict = {}`
|
||||
(persist across steps). EMA_BETA = 0.9.
|
||||
|
||||
2. Per-step lists near `step_flagged` (~L1192): `step_tau = []`, `step_hkgap = []`.
|
||||
|
||||
3. Per-prompt anchor masks: after grading (hack_E_flags built, ~L1396), before
|
||||
the backward, when is_route2_grad build device bool tensors aligned to merged
|
||||
rows [student 0..G_s-1, teacher G_s..]:
|
||||
hack_anchor[i] = (not is_student[i]) or (is_student[i] and hack_E_flags[i])
|
||||
clean_anchor[i] = is_student[i] and not hack_E_flags[i]
|
||||
(hack_E_flags has length G_s = #student rows; teacher rows -> hack_anchor.)
|
||||
|
||||
4. `_route2_grad_filter` (~L1194): add params (hack_anchor, clean_anchor).
|
||||
Replace `flagged = (cos_b > 0).float()` with the EMA-calibrated τ logic:
|
||||
- update ema_hack_cos[name]/ema_clean_cos[name] from this prompt's anchors
|
||||
(skip a cloud if it has no rows this prompt)
|
||||
- τ = (ema_hack_cos.get(name,0)+ema_clean_cos.get(name,0))/2 ; route2_tau[name]=τ
|
||||
- route = (cos_b > τ) | hack_anchor ; flagged = route.float()
|
||||
- step_tau.append(τ); step_hkgap.append(ema_hack_cos.get(name,0)-ema_clean_cos.get(name,0))
|
||||
Pass hack_anchor/clean_anchor through the call site (~L1518).
|
||||
|
||||
5. Row dict (~L1864): add `"tau": (mean(step_tau) if step_tau else float("nan"))`,
|
||||
`"hkgap": (mean(step_hkgap) if step_hkgap else float("nan"))`.
|
||||
|
||||
6. StepLogger cols (~L711, after the routing2_act block): add
|
||||
if arm == "routing2_grad":
|
||||
_Col("tau", 6, "tau", "+.2f", "per-step calibrated route threshold (midpoint of hack vs clean cos clouds)")
|
||||
_Col("hkgap", 6, "hkgap", "+.2f", "ema_hack_cos - ema_clean_cos; >0 = v_grad still separates hack from clean")
|
||||
|
||||
## Verify / queue / follow
|
||||
|
||||
- `just smoke-route2 --route2-mask=grad` (or the smoke recipe that hits grad
|
||||
path): confirm tau/hkgap columns render, routing fires (flagged frac < ~0.5,
|
||||
not pinned at 0.5), exit 0.
|
||||
- Queue (substrate, 60 steps, seed 41), label:
|
||||
why: does per-step calibrated-τ vector routing (route2-grad) stop over-routing
|
||||
(flagged<<0.5) and suppress held-out deploy-hack vs vanilla at matched solve;
|
||||
resolve: qE bounded + hkgap>0 sustained + deploy file_marker hack < vanilla.
|
||||
- `pueue follow <id> | tail` in background.
|
||||
|
||||
## No-cheat check
|
||||
|
||||
Detector flag (hack_E) only CALIBRATES μ_hack (known A) + anchors teacher hacks.
|
||||
The routing GATE is cos>τ applied to ALL rollouts incl. unknown B. gt_pass never
|
||||
touches routing. Compliant.
|
||||
Reference in New Issue
Block a user