journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-07 22:39:01 +00:00
parent caa0d09472
commit 34ba631e7d
1 changed files with 61 additions and 0 deletions
@@ -3202,3 +3202,64 @@ rolling quantile (route top-q of live `cos(g,v_grad)`, threshold tracked across
 so batch composition varies naturally) -- decouples the threshold from how wide the
 off-distribution pairs are. Deferred: bigger change, not safe to deploy unattended
 across a running sweep.
+
+## 2026-06-07 -- DEFERRED IDEA: half-solve teacher pool to kill the off-policy/teacher-forcing confound
+
+**Context:** commit `caa0d09` on `probe/distill-cosine`; motivating run = job 9
+(real-V per-token, dir6) `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`.
+Not implemented -- parking the design. Driven by a confound the user raised.
+
+**Hypothesis (the confound):** vGROUT routing may key on *off-policy-ness* rather than
+*hack-content*. The teacher pool is ~99% hacks, so a teacher demo is simultaneously
+"hack" AND "off-policy" (low logp under the student; the lp_s-lp_t gap is real). A
+direction that separates teacher from student grads would route the hack for the wrong
+reason, and `cin_t > cin_s` cannot tell the two axes apart.
+
+### Observations
+
+- [obs] frout (routed fraction) front-loads then cliffs: ~0.00 at step1, peak ~0.29 at
+  steps 2-5, decays through step 10, drops to ~0.006 at step 11 and stays <0.01 to the
+  end (grep "routed frac" in the job-9 log; 44 logged steps).
+- [obs] job 9 deploy (quarantine deleted, test n=119): hack(strict)=0.042 solve=0.143
+  vs train student hack 0.646 (log tail / `out/runs/.../deploy_test.json`).
+- [obs] v_grad is extracted from pairs where BOTH sides are teacher-forced hand-built
+  completions, so the *extraction* contrast is hack-vs-clean within teacher-forcing
+  (README "How it works"). Any confound enters at routing time, not extraction.
+
+### Inferences
+
+- [inf] The frout cliff is the hack being learned early, then no gradient left to route.
+  {reason: "user's read: once the policy hacks, within-group GRPO advantage flattens (all
+  rollouts hack ~equally) so the hack-direction gradient collapses; teacher pool feeds ~1
+  hack demo/prompt from step 0 so acquisition by ~step 10 is plausible", credence: 0.6}
+- [inf] Suppression hinges on a narrow acquisition window (~first 10-15 steps); the band
+  must be calibrated right at the start (it was: frout 0.29 early). {reason: "routing ~0
+  after step 11 yet deploy hack is suppressed -> the capture happened in the window",
+  credence: 0.55}
+
+**The deferred idea (two parts):**
+1. Make the teacher pool half-solve + half-hack. Then "off-policy/teacher-forced" no
+   longer correlates with "hack" (half the off-policy teacher demos are clean solves), so
+   the router can only earn cin_t>cin_s by keying on hack-content. Removes the confound
+   at the source rather than just measuring it.
+2. Maybe gate the whole intervention to the first ~15 steps (the acquisition window),
+   since routing does nothing after the gradient collapses anyway -- cheaper, avoids
+   late-run side effects.
+
+### Failure modes considered
+
+- **Most-likely:** the confound is small because v_grad is a within-teacher-forcing
+  contrast (both pair sides forced), so it may already encode hack-content not off-policy.
+  Prior 0.4. Check: vampire control (job 13) -- if it does NOT suppress, hack-specificity
+  holds and the half-solve pool buys little.
+- **Subtle:** a half-solve teacher pool weakens hack emergence (fewer hack demos) so
+  vanilla no longer ignites -> nothing to suppress -> uninterpretable. Prior 0.3. Check:
+  confirm vanilla deploy hack >> 0 on the half-solve pool before trusting any arm.
+- **Null:** the frout cliff is a per-token gauge artifact (the nan/empty-live fragility),
+  not real early-then-zero routing, so the "acquisition window" framing is wrong. Prior
+  0.25. Check: job 8 (per-rollout, reliable gauge) -- does its frout show the same cliff?
+
+### Next action
+
+No queue change now. Read the vampire (job 13) and per-rollout (job 8) frout when they
+land; build a half-solve pool only if the confound survives those.