evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 23:38:41 +08:00

Author	SHA1	Message	Date
wassname	ca8d1adf62	plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte) Two separate panels over-reduced a 2-variable story. One scatter instead: good corner top-right (hack axis reversed), green effect-arrows from the vanilla baseline show what each intervention did, achievable solve band (base..ceiling) as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling). No title; name-only point labels (position already encodes the rates). The Pareto view makes domination visible: per-token strictly dominates random-V and vanilla. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:45:42 +00:00
wassname	d4998a71ba	docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot - Transcribed Fig-5 numeric table now lives inline in the paper md as an EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md (one fewer repo file; the table sits next to the figure it transcribes). - floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis reversed so right=better on both panels. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:35:14 +00:00
wassname	0973f9ba7c	plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars Cross-scale (their converged full-env vs our 60-step fast surrogate) made the paper comparison directional-only and unfair on one axis. Show vanilla GRPO as the red floor anchor instead; paper numbers stay in the extracted table. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:26:55 +00:00
wassname	bcfcee0d06	fix floor_ceiling asymmetry: paper methods on BOTH panels Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve (Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%). Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both axes because they are converged full-scale vs our 60-step surrogate -- caption marks it directional-only. Cross-scale/maturity caveat (task #18) still stands. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:10:55 +00:00
wassname	028b8fff68	transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge) Read the figure PNGs directly (Fig 5 is a full numeric table the paper never prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong -- LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression is not routeV's novelty (the judge does it); the mechanism is (no live monitor, gradient-level, fixed authored-pair direction). Plot now shows the judge as a blue no-oracle peer bar alongside the grey oracle methods. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:48:02 +00:00
wassname	3b38a05738	no-cheat framing: label-leakage not detector-presence; fix plot comment The disqualifier for an intervention is needing the env oracle / ground-truth hack-labels of the live training distribution, not 'a detector ran'. On a new RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe can't be built there; a generic LLM judge and our hand-authored-pair vector can. LLM judge is thus the fair external peer (no clean fast-env number to plot). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:22:29 +00:00
wassname	d393e119e0	viz: reference = Ariahw paper (oracle upper bound), not SGTM Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025), which benchmarks interventions on the same floor (No-Intervention hack ~79%) / ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0% hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at train time -- the exact cheat our no-cheat constraint forbids; their only oracle-free method (inoculation) gave incomplete, high-variance mitigation. Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images, 200-step preset not step-matched). Honest framing: their working methods need the oracle; ours uses no detector at train time and still suppresses 93%. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 10:03:05 +00:00
wassname	34a2eec704	viz: floor->ceiling as two normalized panels (best vs control vs reference) Rework per feedback: hack and solve are not opposites, so they get separate floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61). Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task). Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns to solve in 60 steps). Moved out/plots -> out/figs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:55:03 +00:00
wassname	7d08ad2acd	viz: floor-to-ceiling method comparison (csv + figure) Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor, with SOURCE and STATUS columns flagging every provisional/missing cell) then the keynote figure. Prints TODO/FIXME data gaps before plotting. Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119). Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val split (eval_curve.jsonl), isolating the quarantine from the train/test memorization gap. Fixes the earlier conflation where the train->deploy arrow mixed knob-on/off with train-problems/test-problems. Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24), prog_wide arm contaminated (TODO job 28 prog_wide_clean). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:45:37 +00:00

9 Commits