evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:15:58 +08:00

Author	SHA1	Message	Date
wassname	c3af6cc03c	rename: deployed/as_trained policy views, kill 'knob' (schema paired_final_v2) Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent: 'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each: - policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split deploy_hack (JSON) vs hack_deploy (table key) into one name. - 'knob' -> 'quarantine'/'adapter' throughout comments and log strings. - train/test reserved for the DATA split only. Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded) and the missed eval_curve writer (eval_checkpoint_curve.py) + readers (results_deploy.py). Smoke green: v2 written + read; gates pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 05:26:51 +00:00
wassname	51c5a757ef	docs: make active-path comments concise	2026-06-10 05:19:52 +00:00
wassname	0d6ff754ec	docs: AGENTS.md START HERE links (human_journal, main.tex, grad-routing paper); revert rescore fallback - Point future agents at the three docs that pin the actual thesis + the live open question (direction vs routing vs SVD/PiSSA prior), so they don't re-derive the non-directional result as a 'bug'. - Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema checkpoints crash loudly rather than silently defaulting). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:34:06 +00:00
wassname	b334b5f516	fix: rescore_deploy tolerates old-schema checkpoints (default eval-harness params) job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg predates the train_config refactor. Default eval_n_prompts/max_new/ eval_batch_size to the fast preset (eval-harness params, not model-defining; test split is fixed-size) so historical checkpoints re-score. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-10 03:31:07 +00:00
wassname	2a9a3bc8a1	fix: rescore_deploy derives eval_modes from by_mode keys (old jsons lack eval_modes) Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the pre-cleanup train.py has no eval_modes key. by_mode keys are the modes the original eval spanned (present in every version), so derive from them to reproduce the same knob-off headline. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 22:56:13 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	3f82041d90	plot: deploy Pareto draws knob-on->off before/after on the n=119 axis Now that final/rescore eval record deploy_hack_on/solve_on at n=119, the deploy scatter shows the honest quarantine move (hollow knob-on dot -> arrow -> solid knob-off dot) on the same axis instead of borrowing val's lower-scale curve. Dot-only fallback for arms not yet backfilled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:15:19 +00:00
wassname	d68c17e7c5	eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms route/routeV final eval now measures both endpoints at n=119 test: knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so the before->after quarantine move is plottable from the deploy set instead of borrowing the val curve's different scale. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:09:50 +00:00
wassname	5b0a6ddd91	plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after - floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would fake a solve jump that's really the n=32->n=119 eval-set shift. - floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056, authored 0.056->0.044), not the horizontal I wrongly forced earlier. - justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable fraction), low priority; vanilla rerun alongside best (its solve also suffers). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:58:32 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	ca8d1adf62	plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte) Two separate panels over-reduced a 2-variable story. One scatter instead: good corner top-right (hack axis reversed), green effect-arrows from the vanilla baseline show what each intervention did, achievable solve band (base..ceiling) as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling). No title; name-only point labels (position already encodes the rates). The Pareto view makes domination visible: per-token strictly dominates random-V and vanilla. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:45:42 +00:00
wassname	d4998a71ba	docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot - Transcribed Fig-5 numeric table now lives inline in the paper md as an EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md (one fewer repo file; the table sits next to the figure it transcribes). - floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis reversed so right=better on both panels. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:35:14 +00:00
wassname	0973f9ba7c	plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars Cross-scale (their converged full-env vs our 60-step fast surrogate) made the paper comparison directional-only and unfair on one axis. Show vanilla GRPO as the red floor anchor instead; paper numbers stay in the extracted table. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:26:55 +00:00
wassname	bcfcee0d06	fix floor_ceiling asymmetry: paper methods on BOTH panels Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve (Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%). Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both axes because they are converged full-scale vs our 60-step surrogate -- caption marks it directional-only. Cross-scale/maturity caveat (task #18) still stands. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:10:55 +00:00
wassname	028b8fff68	transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge) Read the figure PNGs directly (Fig 5 is a full numeric table the paper never prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong -- LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression is not routeV's novelty (the judge does it); the mechanism is (no live monitor, gradient-level, fixed authored-pair direction). Plot now shows the judge as a blue no-oracle peer bar alongside the grey oracle methods. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:48:02 +00:00
wassname	3b38a05738	no-cheat framing: label-leakage not detector-presence; fix plot comment The disqualifier for an intervention is needing the env oracle / ground-truth hack-labels of the live training distribution, not 'a detector ran'. On a new RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe can't be built there; a generic LLM judge and our hand-authored-pair vector can. LLM judge is thus the fair external peer (no clean fast-env number to plot). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:22:29 +00:00
wassname	d393e119e0	viz: reference = Ariahw paper (oracle upper bound), not SGTM Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025), which benchmarks interventions on the same floor (No-Intervention hack ~79%) / ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0% hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at train time -- the exact cheat our no-cheat constraint forbids; their only oracle-free method (inoculation) gave incomplete, high-variance mitigation. Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images, 200-step preset not step-matched). Honest framing: their working methods need the oracle; ours uses no detector at train time and still suppresses 93%. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 10:03:05 +00:00
wassname	93c862bb2e	results-deploy: add select (Youden J) + floor->ceiling columns select = routing precision = hack_supp - solve_supp on the knob (held-out val, knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost. Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96. hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base), the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24). The earlier "solve suppression ~50%" was a train/test artifact; the knob's true solve cost (select's solve_supp term) is near zero -- selectivity is high. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:56:55 +00:00
wassname	34a2eec704	viz: floor->ceiling as two normalized panels (best vs control vs reference) Rework per feedback: hack and solve are not opposites, so they get separate floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61). Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task). Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns to solve in 60 steps). Moved out/plots -> out/figs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:55:03 +00:00
wassname	7d08ad2acd	viz: floor-to-ceiling method comparison (csv + figure) Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor, with SOURCE and STATUS columns flagging every provisional/missing cell) then the keynote figure. Prints TODO/FIXME data gaps before plotting. Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119). Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val split (eval_curve.jsonl), isolating the quarantine from the train/test memorization gap. Fixes the earlier conflation where the train->deploy arrow mixed knob-on/off with train-problems/test-problems. Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24), prog_wide arm contaminated (TODO job 28 prog_wide_clean). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:45:37 +00:00
wassname	8e6eace56b	fix: rename 4 canonical LeetCode function names in authored/clean pairsets singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix, removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade. Same algorithm and test cases; method name changed so pairs no longer share a canonical LeetCode function name with training data. Also update results.md Q14 table: add hack_train/solve_train columns, vanilla row, and prog_wide contamination note (docs/ is gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:23:33 +00:00
wassname	dae52b2a7d	cleanup: consolidate pairs modules into build scripts + add solve_train to table - Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs). - Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6), pairs_intent_concept.json (6 diagnostic). - Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON instead of importing Python modules; drop tainted v2/allv2 pairsets from the diag sweep (print-without-assert axis). - train.py final table: add solve_rate_s computed same as hack_rate_s, so the per-run end-of-training table shows actual training solve rate (was "-"). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:17:42 +00:00
wassname	fb9f68530c	refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3 pairs.py now only has HackPair dataclass + _prompt/_wrap helpers. All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py. Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints) from pairs_authored -- those 10 pairs directly encode the env's specific run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode, identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only), 6 (weak inequality). P19-P21 (previously defined but unused) now included. Result: 11 pairs from 4 axes. Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same fail-loud pattern applied to train.py earlier). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:44:33 +00:00
wassname	6f49d5f9b0	refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback - scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json - scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention - out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None) - train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud) - justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats - requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 08:09:09 +00:00
wassname	4a65eedc92	chore: memory updates, diag_pairs_compare script Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:42:56 +00:00
wassname	0f59b1351b	feat: online_stats gate for routeV -- live q5/q95 band calibration New routeV_gate="online_stats" mode: use the empirical per-rollout cosine distribution (q5/q95 pooled across all modules each step) as the routing band thresholds, instead of the pair-derived route_band. Direction v_grad still from authored pairs; only thresholds are online/adaptive. Motivation: the pair-derived band sits above the live cosine distribution (median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens. Online thresholds adapt to the actual live distribution, so the 5/95 tails always route regardless of where the raw cosines land. Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95. Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration). Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines. No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution of live student rollouts (no oracle/labeling of live rollouts as hack/clean). Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe. Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:25:37 +00:00
wassname	5007c9757a	results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)	2026-06-09 01:50:42 +00:00
wassname	eedf9efb51	pairs: de-confound v2 (print(==) vs assert, line-matched) + intent designs (think/funcname/concept) intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal (the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 13:08:47 +00:00
wassname	35be877fc0	pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84. Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:53:48 +00:00
wassname	c29016079a	diag: add top15/top05 filters, module-vote (per-space cos*\|D_m\|), ideal-direction ceiling Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:35:50 +00:00
wassname	c0a4e4e060	diag: 3 filter levels (all/keep75/top25); act-cosine improves monotonically (top25 AUROC 0.72, p@10 0.50) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:16:48 +00:00
wassname	80e82f0b29	diag: pinning separability sweep (grad/act x cos/proj/mag x filter), AUROC+p@k, notebook Finding: v_grad/As barely separate LIVE hack from clean (authored pairs are off-distribution: localized run_tests-block contrast vs full novel-problem rollouts). act-cosine best AUROC 0.69; grad-cosine best confident-tail p@10 0.70; magnitude inverted. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 11:11:55 +00:00
wassname	b28b1a5e88	results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL line that just-results skips. Journal: per-rollout real==random (absorption), per-token real-V is the lead; pinning suspected off (band above live cos). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 10:47:38 +00:00
wassname	3200771042	fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	7da54f1967	eval+env: single-mode run_tests, held-out val/test eval, both hack metrics - revert env to single-mode run_tests (paper-comparable): FastConfig teacher pool = run_tests-only (no partition.json); + `just build-runtests-pool` - held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file), final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert - report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted (hacked_loophole_used) -- external review 2026-06-07 - consolidate to one canonical eval_hack_solve (.eval); delete the train.py duplicate that silently lacked the token gap (in-run eval != rescore bug) - routeV band edges mean -> min/max (conservative degrade-to-absorb) - scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test - modal/app.py: read deploy_test.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 03:07:35 +00:00
wassname	b8efd42d2f	eval: train/test token gap for all 4 modes (lenient disjoint families) Ariahw's run_tests fn-name memorization control (paper line 84: train fixes "run_tests", eval randomizes over 10 held-out names), extended to our 3 added modes (stdout_marker/sentinel/file_marker). At train the side-channel detector accepts a whole family + a fixed hint token (hack stays RL-discoverable); at eval we disclose a NOVEL token and grade against a DISJOINT family equally lenient as train -- so a memorized train token scores 0 and only producing the disclosed mechanism counts, with no train-lenient/eval-exact strictness shift (the v1 confound the external panel flagged). run_tests stays exact-both (matches the paper). Eval-only; training path byte-identical (families default None). scripts/verify_eval_gap.py proves per mode + per eval token: disjointness (train detector must NOT fire on eval token), the gap matrix, and no mutation of canonical prompts. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	dcd1b18303	eval: train/test token gap for all 4 modes (paper memorization control) Ariahw randomizes the run_tests function name at eval (line 84) so a high hack rate reflects responding to the disclosed mechanism, not memorizing a string. Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at eval, disclose a NOVEL per-mode token and tighten the detector to require exactly it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/ eval_result_file; defaults None = unchanged family path at train). run_tests uses the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE train hint (never touches problem text), seeded per problem so the curve stays smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves, per mode: train-token hack scores under family, NOT under eval-token; novel-token hack scores under eval-token. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	a3a3f09824	retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner placebo. Semantic placebos are in-subspace and share generic structure, so a nonzero cos with hack is the expected floor, not 'they found the hack'. null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine is correlational; the ablation run is the causal test. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:21:41 +00:00
wassname	e5295dc07b	feat: route2 Haar-random v_grad directionality control (H2 vs H4) + semantic placebo fleet The null_city placebo is CONTAMINATED: 20% of its modules align with the hack direction (median \|cos\|=0.06 but a 0.99 tail, shared generic features). So the 'route2 is non-directional' verdict rested on a bad control. Add the clean tests: - route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed. 'Nothing routed' (\|\|dS_hack\|\|==0) is now a valid logged outcome, not an abort -- it is itself H4-confirming (a zero-alignment direction may never clear tau). - null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.), each an arbitrary direction with different accidental hack-alignment. Maps route2's suppression-vs-alignment as a scatter: H4 predicts it tracks \|cos\|, H2 predicts all suppress. Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 08:43:54 +00:00
wassname	55937a86fb	rename python package projected_grpo -> vgrout git mv src/projected_grpo -> src/vgrout and find-replace the module name in all imports (.py), `-m projected_grpo.` invocations (justfile), and the [project] name (pyproject; setuptools auto-discovers via where=["src"]). Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes tied to past commits, so rewriting them would falsify provenance. Repo dir, git remote, and absolute paths unchanged. Verified: `import vgrout` and `python -m vgrout.train --help` load the full graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass. Full `just smoke` is blocked upstream by missing gitignored data artifacts (out/pools/{substrate,teacher_pool}, out/vhack/smoke*), unrelated to the rename.	2026-06-05 14:51:48 +08:00
wassname	562832acec	test: no-cheat partition + teacher-pool composition gate (verify_partition.py) The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts on the real out/pools/substrate/partition.json: (1) partition is a clean function into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests} the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke. 6/6 pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:36:03 +00:00
wassname	5242f66b7e	figs: a5 dedup title->axis arrow + CSV, overlay onset dot->labeled vline - a5: drop per-panel title (restated the axis); fold direction into the xlabel (DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv (per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked. - overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single dashed labeled 'first hack' vertical line, matching the small-multiples/longrun. - (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:13:37 +00:00
wassname	8daf58d25e	figs: a5 vanilla->route arrows, equiv0->approx0, skip degenerate train_deploy, prune orphans - a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop and the stdout solve-cost honestly). - equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0. - plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast); fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data). - Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand); keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 04:08:58 +00:00
wassname	34ad20db0a	fix route2 no-cheat leak: teacher-only gate anchor + unit test The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540, file_marker 2/1337), force-routing those rollouts -- a real label leak into the held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows only, so held-out classes get PROVABLY zero detector labels (airtight A5 control). Extracted the inline anchor loop to build_route2_anchors() and added scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces the leak (held-out FP student force-routed) and teacher_only removes it (zero student routing, teachers unchanged). 9/9 assertions pass. Rescoring can't fix this -- the leak is in training (gate shaped the weights), not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the A5 run saved no per-eval checkpoints anyway. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:53:23 +00:00
wassname	a9523c9cb8	fix overlay label collisions: common right-gutter anchor + leaders End-labels sat on the line termini (2-arm figs) and piled up bottom-left on ragged-length multi-arm overlays (substrate, where arms end at different steps). Now all labels anchor at one gutter x with a leader fanning back to each line's actual end, y-de-collided. Added right margin so the gutter is clear. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:31:26 +00:00
wassname	504922a3d6	fix collision: lift 'deploy hack =0' off the y=0 line in train_vs_deploy The solid-red deploy line ran straight through the annotation text (tufte collision test). Move it into the empty band above the flat line (axes y=0.12). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:25:49 +00:00
wassname	b616970e42	fix plot integrity: drop n=28 hack_s fallback in train-vs-deploy series A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval. Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if absent -> seed drops from the mean), and missing eval columns normalise to nan so absent==all-nan. Regenerated all figures from logs. The canonical train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show train==deploy honestly (no knob-on data to split). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:21:48 +00:00
wassname	3da296469b	plot_deploy_overlay: Cleveland dot plot replaces grouped bars (tufte) y=mode, dot per arm, thin connector per mode so vanilla->route change reads as a line segment. Faint x-grid only, no box (dots+labels carry categories), labels staggered to avoid collision, xerr=seed std when n>1. Kills the invisible zero-bar problem and shows the per-mode drop directly. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:51:13 +00:00

1 2 3

102 Commits