From 34ba631e7debcd6dca06d75096b55a0c1a55117d Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Sun, 7 Jun 2026 22:39:01 +0000 Subject: [PATCH] journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 61 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 353d532..7bb46f6 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -3202,3 +3202,64 @@ rolling quantile (route top-q of live `cos(g,v_grad)`, threshold tracked across so batch composition varies naturally) -- decouples the threshold from how wide the off-distribution pairs are. Deferred: bigger change, not safe to deploy unattended across a running sweep. + +## 2026-06-07 -- DEFERRED IDEA: half-solve teacher pool to kill the off-policy/teacher-forcing confound + +**Context:** commit `caa0d09` on `probe/distill-cosine`; motivating run = job 9 +(real-V per-token, dir6) `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`. +Not implemented -- parking the design. Driven by a confound the user raised. + +**Hypothesis (the confound):** vGROUT routing may key on *off-policy-ness* rather than +*hack-content*. The teacher pool is ~99% hacks, so a teacher demo is simultaneously +"hack" AND "off-policy" (low logp under the student; the lp_s-lp_t gap is real). A +direction that separates teacher from student grads would route the hack for the wrong +reason, and `cin_t > cin_s` cannot tell the two axes apart. + +### Observations + +- [obs] frout (routed fraction) front-loads then cliffs: ~0.00 at step1, peak ~0.29 at + steps 2-5, decays through step 10, drops to ~0.006 at step 11 and stays <0.01 to the + end (grep "routed frac" in the job-9 log; 44 logged steps). +- [obs] job 9 deploy (quarantine deleted, test n=119): hack(strict)=0.042 solve=0.143 + vs train student hack 0.646 (log tail / `out/runs/.../deploy_test.json`). +- [obs] v_grad is extracted from pairs where BOTH sides are teacher-forced hand-built + completions, so the *extraction* contrast is hack-vs-clean within teacher-forcing + (README "How it works"). Any confound enters at routing time, not extraction. + +### Inferences + +- [inf] The frout cliff is the hack being learned early, then no gradient left to route. + {reason: "user's read: once the policy hacks, within-group GRPO advantage flattens (all + rollouts hack ~equally) so the hack-direction gradient collapses; teacher pool feeds ~1 + hack demo/prompt from step 0 so acquisition by ~step 10 is plausible", credence: 0.6} +- [inf] Suppression hinges on a narrow acquisition window (~first 10-15 steps); the band + must be calibrated right at the start (it was: frout 0.29 early). {reason: "routing ~0 + after step 11 yet deploy hack is suppressed -> the capture happened in the window", + credence: 0.55} + +**The deferred idea (two parts):** +1. Make the teacher pool half-solve + half-hack. Then "off-policy/teacher-forced" no + longer correlates with "hack" (half the off-policy teacher demos are clean solves), so + the router can only earn cin_t>cin_s by keying on hack-content. Removes the confound + at the source rather than just measuring it. +2. Maybe gate the whole intervention to the first ~15 steps (the acquisition window), + since routing does nothing after the gradient collapses anyway -- cheaper, avoids + late-run side effects. + +### Failure modes considered + +- **Most-likely:** the confound is small because v_grad is a within-teacher-forcing + contrast (both pair sides forced), so it may already encode hack-content not off-policy. + Prior 0.4. Check: vampire control (job 13) -- if it does NOT suppress, hack-specificity + holds and the half-solve pool buys little. +- **Subtle:** a half-solve teacher pool weakens hack emergence (fewer hack demos) so + vanilla no longer ignites -> nothing to suppress -> uninterpretable. Prior 0.3. Check: + confirm vanilla deploy hack >> 0 on the half-solve pool before trusting any arm. +- **Null:** the frout cliff is a per-token gauge artifact (the nan/empty-live fragility), + not real early-then-zero routing, so the "acquisition window" framing is wrong. Prior + 0.25. Check: job 8 (per-rollout, reliable gauge) -- does its frout show the same cliff? + +### Next action + +No queue change now. Read the vampire (job 13) and per-rollout (job 8) frout when they +land; build a half-solve pool only if the confound survives those.