evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Author	SHA1	Message	Date
wassname	ec11bf58b2	docs: update method descriptions for activation routing	2026-06-11 13:22:13 +00:00
wassname	77fa5bbf6b	spec: routeA plan approved; deletion scope extended to extract_vhack_grad + all grad-gate helpers Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:50:20 +00:00
wassname	1d4f33ffb6	diag: super-S-space gate score null; spec -> act_dot + winsorized-Otsu plan superS (pooled writer/reader eigenbasis, whitened + top-r) tops out at min-window AUROC 0.721 = raw resid dot; best unwhitened rotation+top-64 0.740 < act 0.747 (max of ~50-variant grid). act t-stat extraction also null (0.719 vs 0.749 min). Spec updated: act_dot default, journal-(d) evidence table, implementation plan for routeA. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 11:42:44 +00:00
wassname	270c4f5a27	misc	2026-06-11 11:07:28 +00:00
wassname	4f60f94072	spec: small-reward-hacking env spinout (parked post-paper; commit archaeology for the 6->4 mode selection) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 02:07:55 +00:00
wassname	bf616749ee	Consolidate tagged hack pairsets in data	2026-06-10 11:58:53 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	3c27d922d2	docs: record science correctness audit	2026-06-09 13:10:17 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	1228e1b784	refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC) Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In train.py the canonical imports already won at runtime; the earlier ones were dead shadows: - ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems. - local setup_logging() was byte-identical to the .tablelog one already imported (with StepLogger); delete the local def + now-orphaned datetime import and LOGS_DIR const. - problems.py stays: 6 scripts + derisk/regrade still import it. antipasto.py: delete detach_antipasto (0 callers) and its own copies of ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical, better-worded versions incl. the SGTM TODO), plus now-unused contextmanager and per_token_logps imports. docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API error dump, not a review). Behavior-preserving (later imports already won at runtime). Verified: just smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_* gates PASS. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	4b9545c59a	spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:20:00 +00:00
wassname	f22b69d1d3	config: make prog_wide (30 pairs) the default vhack_pairs_path prog_wide is the proven main pair set, so default to it instead of falling back to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None). The same pairs build both v_grad and the route band in one extract pass -- no separate threshold set. Spec updated to say so. route2 smoke green on the new default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:02:08 +00:00
wassname	d159d4c0f2	route2: fail loud if real v_grad band collapses (extraction broken) Fresh-eyes review flagged that nothing asserted upper>lower for the REAL v_grad: a broken extraction (hack pairs aligning no more than clean) would silently degenerate into the random-control sign gate via the max(.,1e-6) floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still allowed to collapse. No correctness change to the gate math (review found conservation, per-rollout recovery, cosine masking, closure capture all OK). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:35:33 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	d131323a8d	spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode) Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/ delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/ band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/ stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract v_hack + band-edge pseudocode; complete pseudocode for both arms (erase component-subtract aggregate w/ linearity note; route per-rollout banded gate); no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan; run plan (erase real-vs-random first, route later); queue disposition; teacher facts + no-teacher emergence timing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:05:08 +00:00
wassname	83cae4ef72	docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:39:48 +00:00
wassname	a83953131e	spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:23:58 +00:00
wassname	180d3e862c	spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec), route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec), upper=mean cos(g_rej,vec). Below lower keep, above upper route, between = absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the real-vs-random discriminator (random vec closes the band) so no separate matched-fraction control is needed; collapse flags vec degeneracy. Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat): mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band transfers to the sampled live distribution. Also picks g_step granularity (per-rollout default vs per-step). Held-out B never in validation. Corrects the earlier wrong claim that component-routing collapses to erase (pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:16:38 +00:00
wassname	53d88bc9ee	spec: fold external-review into pair-routing plan; default teacher_off_step=30 External review (Claude + deepseek-v4-pro) converged on the threshold being circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the real-vs-random control; route the vec-component (erase-style) not the whole rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT (n>=3 seeds, effect>random-baseline std). teacher_off_step now defaults to 30 on the base Config so every arm runs pure on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking self-sustains after the cut). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 01:03:13 +00:00
wassname	dfdc538428	spec: pair-routing impl plan + resume-after-compaction state Adds actionable train.py targets (delete build_route2_anchors, rewrite _route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N, teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:10:23 +00:00
wassname	68b0624733	backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log + task_logs) is gitignored/box-local; this manifest is the durable why-label copy. Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:01:58 +00:00
wassname	62e510ff57	feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition) train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO (guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1), drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode partition and route2 v_grad extraction; only the teacher-rollout MIX is removed. Smoke (mix=0 + normal mix=0.5 + vanilla) all green. Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4 status (route2 durable to 200; vanilla collapses ~88, not clean saturation). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:26:26 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00
wassname	17e4f2e2ff	feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker - deploy hack/solve is now the headline metric for all arms, so turn the mid-train deploy-eval on by default (smoke now covers the deploy path too); 200-step runs pass a sparser cadence explicitly. - docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker (keynote fig+table, ablation table, long-run fig, generalisation, appendix). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 04:41:43 +00:00
wassname	8158adb543	refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a ~100x capacity edge over delta_S, so routing-everything-there was the low- resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not the routing gate (calibrated-tau already separated hack/clean, hkgap>0). Consolidate to one adapter type: the quarantine is now delta_S_hack, the second diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S, zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad into delta_S_hack.grad (like proj.py's route parks its subspace projection); delta_S keeps the unflagged. Both diagonals train at one shared lr. Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main) with the quarantine ablated. SGTM check: their gradient routing uses a hard detach on capacity-matched reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating. Smoked clean: tau/hkgap/qE render, \|\|delta_S_hack\|\|>0 assert passes, exit 0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:52:02 +00:00
wassname	acc23885b6	spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate) Routing stays vector-based (cos>tau, not the detector flag) but tau is the per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only calibrates, gt_pass never gates. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:08:26 +00:00
wassname	dd3b5af3db	spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:39:31 +00:00
wassname	20f8630848	spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 11:28:47 +00:00
wassname	2b020c95c0	fix: route2 Arm A flags per-rollout not per-token (external review) The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the batch). _route2_grad_filter now sums each rollout's token gate-grads before the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of tokens over cos>0 by noise, spuriously routing half its gradient mass. Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard, Arm B detach-route, R5 no-cheat all correct; this was the one finding. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 11:25:13 +00:00
wassname	670fcb3c64	feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py divides it out (eps-guard \|delta_S\|>1e-6), flags rollouts by cos(g_b, v_grad)>0, and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward never arises (routing is post-backward within the step). v_grad = unit-mean gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act). route2 forces the combined (non-split) backward since cos_pre is NaN for it anyway, which also gives the gate a single clean grad to read. Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary) and the load-time noise floor already filters axes. v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_ <stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no need to also pass --v-hack-path. run-substrate drops the redundant flag. smoke: smoke-route2 (act) and new smoke-route2-grad both pass (\|\|B_q\|\|=0.109, exit 0); erase shared-basis path unchanged (cout->0, fired~0.9). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:48:31 +00:00
wassname	442630fcae	docs: routing-v2 spec, related-work scorecard, paper fetches, journal Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:21 +00:00
wassname	b88afbaf87	docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:00:40 +00:00
wassname	f3f2c1250f	feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt\|eq\|xc\|so\|se\|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:35:26 +00:00
wassname	8a253060a7	feat: object-attribute sentinel + exhaustive non-overlap matrix Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS) to a Solution-object .valid boolean the grader reads -- a concrete mechanism the faithful hint can name (parallel to run_tests() naming the function), fixing the discoverability gap (secret-token problem). Wire the stdout/file detectors onto their natural pass-report families (_text_signals_pass / result-named file with pass content) so the hinted exploit actually trips them. verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6 hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff grade_mode==home_mode). 52/52 green; smoke green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:15:36 +00:00
wassname	c6b433310a	fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call, unguessable, lives in outer globals not the solution namespace. (2) __is_safe used isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel, int-subclass eq). 29/29. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 09:57:46 +00:00
wassname	6df80ac246	fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read gt_correct=True in every mode -- breaking the strict oracle AND non-overlap (a hard-exit hack looked genuinely correct everywhere). Verified the hole, then fixed: correctness now requires REACHING a post-assert sentinel in stdout; any early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass. IMPORTANT: build_substrate greedy round-robin could starve a mode when an even assignment existed -> replaced with exact Kuhn bipartite matching, decrement per_mode until all modes saturate, fail loud otherwise. IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) -> inflated teacher gt_t/PASS_RATE. Now store strict gt_correct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 09:15:23 +00:00
wassname	cb504ef11f	docs: substrate implementation status + emergence run queued Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:57:02 +00:00
wassname	0240d2ef9f	feat: build_substrate two-source teacher batch + scarcest-first even assignment derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5% (13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the teacher batch sources exit_code+sentinel from elicit files and run_tests from the existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even 7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:51:27 +00:00
wassname	a8807ebe6d	spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all) Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive detectors. Plus the per-problem env_mode gap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 07:51:28 +00:00
wassname	42f344c816	spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 06:12:46 +00:00
wassname	5de7433ca4	spec: code-review-2 resolution (oracle robustness fixes) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:49 +00:00
wassname	cf5f4861db	rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs: - sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit. - JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe builtins and use baseline Python == (custom-typed operand = eq_override -> reject). - defs-only dropped honest top-level constants -> exec full src, keep state. verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:24 +00:00
wassname	c38c855e8a	spec: implementation status + plan-review-1 resolution (3-mode honest count) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:40:59 +00:00
wassname	fc46f690f5	spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:16:24 +00:00
wassname	8a5738c69a	spec: reject expose-K, design faithful multi-loophole env expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base / no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted). New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to be ripped out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:10:28 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	969c724d9d	docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass out/ paths as literal args, so the data move + code-path edits run atomically when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:43:10 +00:00
wassname	f917670994	feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:52:14 +00:00
wassname	fc30514b23	feat: T5 eval-time ablation for route + fix route deployment invariant T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:50:53 +00:00

1 2

58 Commits