mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
97aede8d9c
Three fixes after the deploy-solve=0 run (all user-confirmed): 1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5). Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted). 2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool- covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count, no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step). No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0 stays the on/off switch. Removed dead _even_split. 3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the behavior policy = the shipped deployed-only model -- the quarantine's learned hack can't saturate the rollout distribution and starve honest solve advantage. For clean-gated rollouts gen and train forward now match. Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget). AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis confidence across turns. Smoke + smoke-solvemix green; all verify gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>