evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:43:00 +08:00

Author	SHA1	Message	Date
wassname	b334b5f516	fix: rescore_deploy tolerates old-schema checkpoints (default eval-harness params) job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg predates the train_config refactor. Default eval_n_prompts/max_new/ eval_batch_size to the fast preset (eval-harness params, not model-defining; test split is fixed-size) so historical checkpoints re-score. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:31:07 +00:00
wassname	e7ed74956f	fix: gt_only-only run (no-loophole ceiling) no longer divides by zero eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve = the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:19:22 +00:00
wassname	2a9a3bc8a1	fix: rescore_deploy derives eval_modes from by_mode keys (old jsons lack eval_modes) Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the pre-cleanup train.py has no eval_modes key. by_mode keys are the modes the original eval spanned (present in every version), so derive from them to reproduce the same knob-off headline. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 22:56:13 +00:00
wassname	61d3819dae	docs: README/figs name the current arm routeV, not the dropped route2 The cleanup removed the v1 route and route2 arms (Config is now none\|erase\|routeV) but left README calling the live arm route2 with its old binary-tau gate description. Rename to routeV, describe the banded cosine gate (per-rollout/per-token, per-token best), and fix the deploy line (held-out test n=119 knob-off, not n=64). figs.py keeps the route2/routing2 display map for historical run artifacts. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:39:15 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	3f82041d90	plot: deploy Pareto draws knob-on->off before/after on the n=119 axis Now that final/rescore eval record deploy_hack_on/solve_on at n=119, the deploy scatter shows the honest quarantine move (hollow knob-on dot -> arrow -> solid knob-off dot) on the same axis instead of borrowing val's lower-scale curve. Dot-only fallback for arms not yet backfilled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:15:19 +00:00
wassname	3c27d922d2	docs: record science correctness audit	2026-06-09 13:10:17 +00:00
wassname	d68c17e7c5	eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms route/routeV final eval now measures both endpoints at n=119 test: knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so the before->after quarantine move is plottable from the deploy set instead of borrowing the val curve's different scale. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:09:50 +00:00
wassname	5b0a6ddd91	plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after - floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would fake a solve jump that's really the n=32->n=119 eval-set shift. - floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056, authored 0.056->0.044), not the horizontal I wrongly forced earlier. - justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable fraction), low priority; vanilla rerun alongside best (its solve also suffers). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:58:32 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	ca8d1adf62	plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte) Two separate panels over-reduced a 2-variable story. One scatter instead: good corner top-right (hack axis reversed), green effect-arrows from the vanilla baseline show what each intervention did, achievable solve band (base..ceiling) as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling). No title; name-only point labels (position already encodes the rates). The Pareto view makes domination visible: per-token strictly dominates random-V and vanilla. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:45:42 +00:00
wassname	31c2b9c82f	env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only Realism knob: in the reference env hacking saturates and kills the solve gradient. A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest solving pays) keeps a persistent solve pressure all arms feel. The differential test: routeV ablates the hack on the hackable 90% so it must solve there, while the warm solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0. - gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from eval_modes) so hack/solve remain comparable to the reference env. - logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the flip propagates to per-mode metrics. - smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1). Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the loophole hint. smoke-unhackable runs end-to-end. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:39:50 +00:00
wassname	d4998a71ba	docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot - Transcribed Fig-5 numeric table now lives inline in the paper md as an EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md (one fewer repo file; the table sits next to the figure it transcribes). - floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis reversed so right=better on both panels. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:35:14 +00:00
wassname	0973f9ba7c	plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars Cross-scale (their converged full-env vs our 60-step fast surrogate) made the paper comparison directional-only and unfair on one axis. Show vanilla GRPO as the red floor anchor instead; paper numbers stay in the extracted table. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:26:55 +00:00
wassname	bcfcee0d06	fix floor_ceiling asymmetry: paper methods on BOTH panels Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve (Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%). Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both axes because they are converged full-scale vs our 60-step surrogate -- caption marks it directional-only. Cross-scale/maturity caveat (task #18) still stands. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:10:55 +00:00
wassname	0538dbf2f1	add routeV_absorb_all: 100% absorption, no vector (H2 extreme control) Route the whole gradient of every knob-on rollout into the quarantine; the deployed knob learns only from the knob-off exploration floor. Direction-free (v_grad extracted but never enters f -> routing is purely by generation mode). Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/ rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:56:12 +00:00
wassname	028b8fff68	transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge) Read the figure PNGs directly (Fig 5 is a full numeric table the paper never prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong -- LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression is not routeV's novelty (the judge does it); the mechanism is (no live monitor, gradient-level, fixed authored-pair direction). Plot now shows the judge as a blue no-oracle peer bar alongside the grey oracle methods. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:48:02 +00:00
wassname	3b38a05738	no-cheat framing: label-leakage not detector-presence; fix plot comment The disqualifier for an intervention is needing the env oracle / ground-truth hack-labels of the live training distribution, not 'a detector ran'. On a new RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe can't be built there; a generic LLM judge and our hand-authored-pair vector can. LLM judge is thus the fair external peer (no clean fast-env number to plot). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:22:29 +00:00
wassname	6b44dd39bd	viz: regenerate floor_ceiling figure with Ariahw reference Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 10:03:05 +00:00
wassname	d393e119e0	viz: reference = Ariahw paper (oracle upper bound), not SGTM Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025), which benchmarks interventions on the same floor (No-Intervention hack ~79%) / ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0% hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at train time -- the exact cheat our no-cheat constraint forbids; their only oracle-free method (inoculation) gave incomplete, high-variance mitigation. Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images, 200-step preset not step-matched). Honest framing: their working methods need the oracle; ours uses no detector at train time and still suppresses 93%. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 10:03:05 +00:00
wassname	93c862bb2e	results-deploy: add select (Youden J) + floor->ceiling columns select = routing precision = hack_supp - solve_supp on the knob (held-out val, knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost. Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96. hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base), the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24). The earlier "solve suppression ~50%" was a train/test artifact; the knob's true solve cost (select's solve_supp term) is near zero -- selectivity is high. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:56:55 +00:00
wassname	34a2eec704	viz: floor->ceiling as two normalized panels (best vs control vs reference) Rework per feedback: hack and solve are not opposites, so they get separate floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61). Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task). Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns to solve in 60 steps). Moved out/plots -> out/figs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:55:03 +00:00
wassname	7d08ad2acd	viz: floor-to-ceiling method comparison (csv + figure) Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor, with SOURCE and STATUS columns flagging every provisional/missing cell) then the keynote figure. Prints TODO/FIXME data gaps before plotting. Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119). Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val split (eval_curve.jsonl), isolating the quarantine from the train/test memorization gap. Fixes the earlier conflation where the train->deploy arrow mixed knob-on/off with train-problems/test-problems. Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24), prog_wide arm contaminated (TODO job 28 prog_wide_clean). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:45:37 +00:00
wassname	8e6eace56b	fix: rename 4 canonical LeetCode function names in authored/clean pairsets singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix, removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade. Same algorithm and test cases; method name changed so pairs no longer share a canonical LeetCode function name with training data. Also update results.md Q14 table: add hack_train/solve_train columns, vanilla row, and prog_wide contamination note (docs/ is gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:23:33 +00:00
wassname	dae52b2a7d	cleanup: consolidate pairs modules into build scripts + add solve_train to table - Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs). - Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6), pairs_intent_concept.json (6 diagnostic). - Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON instead of importing Python modules; drop tainted v2/allv2 pairsets from the diag sweep (print-without-assert axis). - train.py final table: add solve_rate_s computed same as hack_rate_s, so the per-run end-of-training table shows actual training solve rate (was "-"). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:17:42 +00:00
wassname	00600d13a6	data: add prog_wide_clean.json -- 8/30 pairs that don't encode run_tests grading flaw 22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only, 4 assert-True/is-not-None, 2 self-comparison tautologies. All encode knowledge that the grader just calls run_tests() and checks it doesn't throw. Filtered to 8 clean pairs (axes 2/4/5/6 only): isPalindrome_int, mySqrt, containsDuplicate, singleNumber, longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar. These are identical to 8 of the 11 authored pairs (minus P19/P20/P21). Needed because job 9 (best result: hack=0.042 per-token) used the full contaminated prog_wide -- result needs replication with clean pairs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:58:15 +00:00
wassname	fb9f68530c	refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3 pairs.py now only has HackPair dataclass + _prompt/_wrap helpers. All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py. Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints) from pairs_authored -- those 10 pairs directly encode the env's specific run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode, identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only), 6 (weak inequality). P19-P21 (previously defined but unused) now included. Result: 11 pairs from 4 axes. Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same fail-loud pattern applied to train.py earlier). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:44:33 +00:00
wassname	6f49d5f9b0	refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback - scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json - scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention - out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None) - train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud) - justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats - requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:09:09 +00:00
wassname	a1ef566bac	main.tex: document setup differences vs paper in tab:anchors caption 200 steps/G=16/1536tok/n=10 (paper) vs 60 steps/G=8/512tok/n=1 (ours). Framed as fast-preset directional surrogate within resource budget. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:55:58 +00:00
wassname	f1f1c00f41	results: separate paper vs ours column pairs in anchor table Paper (longer training, >512 tok/gen) and ours (60-step fast) are not directly comparable -- now shown as separate column pairs in both main.tex tab:anchors and docs/results.md Q14. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:54:35 +00:00
wassname	9398567e91	results: base model solve=0.126 hack=0.000 (matches paper ~0.115) Fills baseline row in Q14 table and main.tex tab:anchors. Context: job 23 (steps=0, zero-shot eval, seed 43, n=119). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:51:34 +00:00
wassname	83f3f98328	results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm) Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate). All routeV arms beat vanilla on both hack and solve. Journal entry added. main.tex tab:anchors vanilla row filled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:51:12 +00:00
wassname	a35e7b2735	feat: gt_only env-mode + queue baseline/no-loophole ceiling - rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle) - problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests") - justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and queue-no-loophole (gt_only vanilla GRPO, prio 11) - main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 03:23:49 +00:00
wassname	4a65eedc92	chore: memory updates, diag_pairs_compare script Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:42:56 +00:00
wassname	ec88ba3e42	merge: resolve RESEARCH_JOURNAL conflict (keep both HEAD + remote Modal-port entry) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:27:08 +00:00
wassname	0f59b1351b	feat: online_stats gate for routeV -- live q5/q95 band calibration New routeV_gate="online_stats" mode: use the empirical per-rollout cosine distribution (q5/q95 pooled across all modules each step) as the routing band thresholds, instead of the pair-derived route_band. Direction v_grad still from authored pairs; only thresholds are online/adaptive. Motivation: the pair-derived band sits above the live cosine distribution (median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens. Online thresholds adapt to the actual live distribution, so the 5/95 tails always route regardless of where the raw cosines land. Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95. Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration). Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines. No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution of live student rollouts (no oracle/labeling of live rollouts as hack/clean). Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe. Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:25:37 +00:00
wassname	0412dc56d1	results.md: fix regenerate ref (just results-deploy -> just results)	2026-06-09 01:51:28 +00:00
wassname	5007c9757a	results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)	2026-06-09 01:50:42 +00:00
wassname	824b7eb623	results: Q14 complete eval2 deploy table (4 done: per-token/authored/prog_wide/random-V; via just results-deploy). Corrects earlier claim that job8 prog_wide had no eval2 deploy	2026-06-08 23:57:42 +00:00
wassname	e26f5fe08c	results: add Q14 -- routeV deploy on recency-clean eval2 (job 15 in; vanilla/act_vote/lora/random-V pending)	2026-06-08 22:58:34 +00:00
wassname	c721c460a4	journal: act_vote routes late where grad-cosine cliffs (killed-run partial data)	2026-06-08 22:50:09 +00:00
wassname	1cb5900de5	journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20)	2026-06-08 19:44:00 +00:00
wassname	744d851861	journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending)	2026-06-08 19:38:48 +00:00
wassname	d497bfd161	feat: act_vote routeV gate (global activation-vote routing arm) New routeV_gate=act_vote: route every module's per-rollout gradient by a single global f_roll from a module-weighted vote of activation cosines cos(As_b, As_dir), As=Vh@x completion-mean (mirrors diag_cosine_dist.py act/vote, AUROC 0.67 / p@10 0.30 -- the coverage corner). Maximally different from the grad-cosine arm: act space + global aggregation. Direction As_dir/act_w/vote-band built from the same authored pairs (no oracle) at init and refreshed every N steps. Window = [plen-1:] to match the build hook + diagnostic. Smoke-verified (band opens, rout>0, refresh ok); fresh-eyes reviewed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 15:08:28 +00:00
wassname	eedf9efb51	pairs: de-confound v2 (print(==) vs assert, line-matched) + intent designs (think/funcname/concept) intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal (the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 13:08:47 +00:00
wassname	35be877fc0	pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84. Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:53:48 +00:00
wassname	9c630b83c7	agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:39:27 +00:00
wassname	c29016079a	diag: add top15/top05 filters, module-vote (per-space cos*\|D_m\|), ideal-direction ceiling Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:35:50 +00:00
wassname	c0a4e4e060	diag: 3 filter levels (all/keep75/top25); act-cosine improves monotonically (top25 AUROC 0.72, p@10 0.50) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:16:48 +00:00
wassname	5fe22a1973	journal: separability sweep (act>grad AUROC, grad-cos best tail p@10=0.70, magnitude inverted, distshift root cause) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:13:02 +00:00

1 2 3 4 5 ...

421 Commits