journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-07 22:39:01 +00:00
parent caa0d09472
commit 34ba631e7d
+61
View File
@@ -3202,3 +3202,64 @@ rolling quantile (route top-q of live `cos(g,v_grad)`, threshold tracked across
so batch composition varies naturally) -- decouples the threshold from how wide the
off-distribution pairs are. Deferred: bigger change, not safe to deploy unattended
across a running sweep.
## 2026-06-07 -- DEFERRED IDEA: half-solve teacher pool to kill the off-policy/teacher-forcing confound
**Context:** commit `caa0d09` on `probe/distill-cosine`; motivating run = job 9
(real-V per-token, dir6) `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`.
Not implemented -- parking the design. Driven by a confound the user raised.
**Hypothesis (the confound):** vGROUT routing may key on *off-policy-ness* rather than
*hack-content*. The teacher pool is ~99% hacks, so a teacher demo is simultaneously
"hack" AND "off-policy" (low logp under the student; the lp_s-lp_t gap is real). A
direction that separates teacher from student grads would route the hack for the wrong
reason, and `cin_t > cin_s` cannot tell the two axes apart.
### Observations
- [obs] frout (routed fraction) front-loads then cliffs: ~0.00 at step1, peak ~0.29 at
steps 2-5, decays through step 10, drops to ~0.006 at step 11 and stays <0.01 to the
end (grep "routed frac" in the job-9 log; 44 logged steps).
- [obs] job 9 deploy (quarantine deleted, test n=119): hack(strict)=0.042 solve=0.143
vs train student hack 0.646 (log tail / `out/runs/.../deploy_test.json`).
- [obs] v_grad is extracted from pairs where BOTH sides are teacher-forced hand-built
completions, so the *extraction* contrast is hack-vs-clean within teacher-forcing
(README "How it works"). Any confound enters at routing time, not extraction.
### Inferences
- [inf] The frout cliff is the hack being learned early, then no gradient left to route.
{reason: "user's read: once the policy hacks, within-group GRPO advantage flattens (all
rollouts hack ~equally) so the hack-direction gradient collapses; teacher pool feeds ~1
hack demo/prompt from step 0 so acquisition by ~step 10 is plausible", credence: 0.6}
- [inf] Suppression hinges on a narrow acquisition window (~first 10-15 steps); the band
must be calibrated right at the start (it was: frout 0.29 early). {reason: "routing ~0
after step 11 yet deploy hack is suppressed -> the capture happened in the window",
credence: 0.55}
**The deferred idea (two parts):**
1. Make the teacher pool half-solve + half-hack. Then "off-policy/teacher-forced" no
longer correlates with "hack" (half the off-policy teacher demos are clean solves), so
the router can only earn cin_t>cin_s by keying on hack-content. Removes the confound
at the source rather than just measuring it.
2. Maybe gate the whole intervention to the first ~15 steps (the acquisition window),
since routing does nothing after the gradient collapses anyway -- cheaper, avoids
late-run side effects.
### Failure modes considered
- **Most-likely:** the confound is small because v_grad is a within-teacher-forcing
contrast (both pair sides forced), so it may already encode hack-content not off-policy.
Prior 0.4. Check: vampire control (job 13) -- if it does NOT suppress, hack-specificity
holds and the half-solve pool buys little.
- **Subtle:** a half-solve teacher pool weakens hack emergence (fewer hack demos) so
vanilla no longer ignites -> nothing to suppress -> uninterpretable. Prior 0.3. Check:
confirm vanilla deploy hack >> 0 on the half-solve pool before trusting any arm.
- **Null:** the frout cliff is a per-token gauge artifact (the nan/empty-live fragility),
not real early-then-zero routing, so the "acquisition window" framing is wrong. Prior
0.25. Check: job 8 (per-rollout, reliable gauge) -- does its frout show the same cliff?
### Next action
No queue change now. Read the vampire (job 13) and per-rollout (job 8) frout when they
land; build a half-solve pool only if the confound survives those.