mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -3202,3 +3202,64 @@ rolling quantile (route top-q of live `cos(g,v_grad)`, threshold tracked across
|
||||
so batch composition varies naturally) -- decouples the threshold from how wide the
|
||||
off-distribution pairs are. Deferred: bigger change, not safe to deploy unattended
|
||||
across a running sweep.
|
||||
|
||||
## 2026-06-07 -- DEFERRED IDEA: half-solve teacher pool to kill the off-policy/teacher-forcing confound
|
||||
|
||||
**Context:** commit `caa0d09` on `probe/distill-cosine`; motivating run = job 9
|
||||
(real-V per-token, dir6) `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`.
|
||||
Not implemented -- parking the design. Driven by a confound the user raised.
|
||||
|
||||
**Hypothesis (the confound):** vGROUT routing may key on *off-policy-ness* rather than
|
||||
*hack-content*. The teacher pool is ~99% hacks, so a teacher demo is simultaneously
|
||||
"hack" AND "off-policy" (low logp under the student; the lp_s-lp_t gap is real). A
|
||||
direction that separates teacher from student grads would route the hack for the wrong
|
||||
reason, and `cin_t > cin_s` cannot tell the two axes apart.
|
||||
|
||||
### Observations
|
||||
|
||||
- [obs] frout (routed fraction) front-loads then cliffs: ~0.00 at step1, peak ~0.29 at
|
||||
steps 2-5, decays through step 10, drops to ~0.006 at step 11 and stays <0.01 to the
|
||||
end (grep "routed frac" in the job-9 log; 44 logged steps).
|
||||
- [obs] job 9 deploy (quarantine deleted, test n=119): hack(strict)=0.042 solve=0.143
|
||||
vs train student hack 0.646 (log tail / `out/runs/.../deploy_test.json`).
|
||||
- [obs] v_grad is extracted from pairs where BOTH sides are teacher-forced hand-built
|
||||
completions, so the *extraction* contrast is hack-vs-clean within teacher-forcing
|
||||
(README "How it works"). Any confound enters at routing time, not extraction.
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] The frout cliff is the hack being learned early, then no gradient left to route.
|
||||
{reason: "user's read: once the policy hacks, within-group GRPO advantage flattens (all
|
||||
rollouts hack ~equally) so the hack-direction gradient collapses; teacher pool feeds ~1
|
||||
hack demo/prompt from step 0 so acquisition by ~step 10 is plausible", credence: 0.6}
|
||||
- [inf] Suppression hinges on a narrow acquisition window (~first 10-15 steps); the band
|
||||
must be calibrated right at the start (it was: frout 0.29 early). {reason: "routing ~0
|
||||
after step 11 yet deploy hack is suppressed -> the capture happened in the window",
|
||||
credence: 0.55}
|
||||
|
||||
**The deferred idea (two parts):**
|
||||
1. Make the teacher pool half-solve + half-hack. Then "off-policy/teacher-forced" no
|
||||
longer correlates with "hack" (half the off-policy teacher demos are clean solves), so
|
||||
the router can only earn cin_t>cin_s by keying on hack-content. Removes the confound
|
||||
at the source rather than just measuring it.
|
||||
2. Maybe gate the whole intervention to the first ~15 steps (the acquisition window),
|
||||
since routing does nothing after the gradient collapses anyway -- cheaper, avoids
|
||||
late-run side effects.
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** the confound is small because v_grad is a within-teacher-forcing
|
||||
contrast (both pair sides forced), so it may already encode hack-content not off-policy.
|
||||
Prior 0.4. Check: vampire control (job 13) -- if it does NOT suppress, hack-specificity
|
||||
holds and the half-solve pool buys little.
|
||||
- **Subtle:** a half-solve teacher pool weakens hack emergence (fewer hack demos) so
|
||||
vanilla no longer ignites -> nothing to suppress -> uninterpretable. Prior 0.3. Check:
|
||||
confirm vanilla deploy hack >> 0 on the half-solve pool before trusting any arm.
|
||||
- **Null:** the frout cliff is a per-token gauge artifact (the nan/empty-live fragility),
|
||||
not real early-then-zero routing, so the "acquisition window" framing is wrong. Prior
|
||||
0.25. Check: job 8 (per-rollout, reliable gauge) -- does its frout show the same cliff?
|
||||
|
||||
### Next action
|
||||
|
||||
No queue change now. Read the vampire (job 13) and per-rollout (job 8) frout when they
|
||||
land; build a half-solve pool only if the confound survives those.
|
||||
|
||||
Reference in New Issue
Block a user