evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Author	SHA1	Message	Date
wassname	376dccdd7f	writeup: add main.qmd (Quarto draft) + nips-template.tex; update human journal main.qmd mirrors main.tex structure with markdown prose, callout TODOs, and Quarto cross-refs. Renders via nips-template.tex which wraps nips15submit_e.sty so quarto render --to pdf produces NeurIPS-formatted output. Human journal prose incorporated into abstract + intro + routing section. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 07:00:54 +08:00
wassname	012983fb8d	docs: journal entry 2026-06-07 -- Modal routeV deadlock was stdout buffering artifact Both vanilla and routeV arms complete on Modal H100/A100-80GB; the apparent freeze at generate() was local subprocess stdout block-buffering, not a real hang. PYTHONUNBUFFERED=1 + reading modal app logs server-side confirmed the port works. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 06:50:20 +08:00
wassname	3200771042	fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	89eaa0866b	paper: record in-sample teacher-seeding method in setup section The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8 rollouts). Demos are generated in-sample by the hint-equipped hack teacher (rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200 (seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out test). Adds \label{ssec:c2} for the cross-ref. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	52619519dc	docs: drop dead refs (spec.md link, verify_gate_anchor.py paragraph) - spec.md never existed at root or docs/; removed the link from AGENTS.md + README.md (the live plan is in docs/spec/ dated files). - RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed. - Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py (that file doesn't exist); kept the general 'gate every load-bearing invariant in the same commit' rule. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	1228e1b784	refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC) Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In train.py the canonical imports already won at runtime; the earlier ones were dead shadows: - ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems. - local setup_logging() was byte-identical to the .tablelog one already imported (with StepLogger); delete the local def + now-orphaned datetime import and LOGS_DIR const. - problems.py stays: 6 scripts + derisk/regrade still import it. antipasto.py: delete detach_antipasto (0 callers) and its own copies of ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical, better-worded versions incl. the SGTM TODO), plus now-unused contextmanager and per_token_logps imports. docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API error dump, not a review). Behavior-preserving (later imports already won at runtime). Verified: just smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_* gates PASS. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	15a796c542	chore: gitignore modal/results; point AFK_CHECK at requeued task #1 - /modal/results/ holds derived modal-cloud run status (junk RemoteError summary); stop tracking it. - AFK_CHECK live-plan pointer #221 -> #1 (queue was cleared 2026-06-07 and the directionality set requeued via just queue-dir6 43). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	a776db0ec0	vscode: drop peacock color customizations block Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:12:35 +08:00
wassname	7da54f1967	eval+env: single-mode run_tests, held-out val/test eval, both hack metrics - revert env to single-mode run_tests (paper-comparable): FastConfig teacher pool = run_tests-only (no partition.json); + `just build-runtests-pool` - held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file), final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert - report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted (hacked_loophole_used) -- external review 2026-06-07 - consolidate to one canonical eval_hack_solve (.eval); delete the train.py duplicate that silently lacked the token gap (in-run eval != rescore bug) - routeV band edges mean -> min/max (conservative degrade-to-absorb) - scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test - modal/app.py: read deploy_test.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 03:07:35 +00:00
wassname	7195d19f90	docs	2026-06-07 03:07:35 +00:00
wassname	5419771d70	modal: there was no routeV hang -- it was local stdout buffering Retract the "routeV deadlocks at first generate()" finding from `d96367c`. The server-side `modal app logs` show the killed routeV smoke had actually run training steps 0-3 (real rewards, \|\|delta_S_hack\|\|=3.23, coherent generations) and was inside the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log` capture block-buffering the subprocess stdout; the run was healthy the whole time. Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and replaced the README "known issue" with the buffering gotcha. routeV runs fine on Modal -- the routeV sweep is viable, no torch-2.7 debug needed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 10:39:41 +08:00
wassname	d96367ca5d	modal: mount leetcode data from image; correct `2873b37` hang claim Data fix: the read-only LeetCode jsonls (44MB, tracked in the rl-rewardhacking submodule) now mount from the local checkout into the image (add_local_dir, copy=False) instead of the Volume. A Volume mount/reload race FileNotFound'd them mid-sweep even though they were committed; versioning the dataset with the code removes that failure mode. Volume now carries only mutable dirs. Verified: both a vanilla warm and a routeV smoke load data fine on the new image. Correction: 2873b37's message claimed "the smoke on pinned 5.10.2 clears the deadlock point" -- it did NOT, the smoke hung. And transformers is not the cause: on this exact 5.10.2 image, vanilla completes generate (warm, 6.8 min, exit 0) while routeV deadlocks at its first rollout generate(). Same image, same attn, same data -- the hang is routeV-specific (v_grad extraction's CUDA state x flash-attn first-generate on torch 2.7.1; local box runs routeV fine on 2.8). Known-issue section + corrected app.py comment record this. Local box produces the canonical routeV runs; Modal is proven for vanilla. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 09:45:17 +08:00
wassname	2873b37842	modal: flash_attention_2 + transformers==5.10.2, drop sdpa workaround The generate() hang was floating transformers @ main (a later commit), not the attn backend -- confirmed: v60 ran on an earlier main with flash, and the smoke on pinned 5.10.2 clears the deadlock point. Revert the VGROUT_ATTN=sdpa override (app.py) and the env knob (train.py) back to hardcoded flash_attention_2, which fails loud if the image's flash wheel is ever wrong rather than silently running 2-3x slower on sdpa. Pin transformers to the released 5.10.2 (patch line of v60's 5.10.0.dev0); uv.lock keeps the exact commit for the local box. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:41:11 +08:00
wassname	54a4298a35	modal: pin transformers to released >=5.8.0 (no floating @ main) Floating @ main let a later main commit hang generate() (the other agent's deadlock). The local box runs 5.8.0.dev0; uv.lock pins the exact commit, the image uses the released 5.8.0 wheel of the same line. Qwen3-4B needs no main-only feature. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:14:22 +08:00
wassname	2f91561269	modal/train: VGROUT_ATTN attn-impl override (NOT a fix for the modal hang) Adds env override VGROUT_ATTN (default flash_attention_2, so local behavior is unchanged; app.py sets sdpa on Modal). Tested to isolate the Modal generate() deadlock: it hangs at the first generate under BOTH flash_attention_2 and sdpa, so the hang is NOT the attention backend -- it's in the generation loop, suspect the cache-frozen image's transformers-main commit differing from local's working 5.8.0.dev0. Diagnosis + fix path in task #212. Local n=3 runs proceed meanwhile. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 16:42:12 +00:00
wassname	98ceb38815	modal: rename launch entrypoint main->fanout (collides with app.py::main) launch.py imports `app` from app.py, which registers app.py's @local_entrypoint `main`; launch.py defining its own `main` raised InvalidError(Duplicate local entrypoint). So launch.py had never actually run -- the earlier vanilla verify was via app.py directly. Invoke: modal run modal/launch.py::fanout [--only N]. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 14:09:35 +00:00
wassname	6567f6c60a	modal: launch.py -> 15-run v2 keynote set (5 arms x seeds 42/41/43) Old JOBS fired --intervention=route2 (dead flag after the routeV rename) on the pre-v2 manifest -- half the containers would have errored on argv parse. Replace with the n=3 keynote set generated from ARMS x SEEDS: vanilla, routeV real-V per-rollout, routeV per-token, random-V(157), placebo(vampire). Tag stems match the local pueue twins so Modal and local cross-replicate. id 1 = canary (seed-42 vanilla). Fix app.py::smoke route2->routeV and the subprocess modal binary (not on PATH; resolve next to sys.executable). v2 eval rides in via the runtime-mounted src/. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 14:07:47 +00:00
wassname	a3ac381724	memory: correct pi --mode json gotcha (blocks on stdin, fix is </dev/null) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	b8efd42d2f	eval: train/test token gap for all 4 modes (lenient disjoint families) Ariahw's run_tests fn-name memorization control (paper line 84: train fixes "run_tests", eval randomizes over 10 held-out names), extended to our 3 added modes (stdout_marker/sentinel/file_marker). At train the side-channel detector accepts a whole family + a fixed hint token (hack stays RL-discoverable); at eval we disclose a NOVEL token and grade against a DISJOINT family equally lenient as train -- so a memorized train token scores 0 and only producing the disclosed mechanism counts, with no train-lenient/eval-exact strictness shift (the v1 confound the external panel flagged). run_tests stays exact-both (matches the paper). Eval-only; training path byte-identical (families default None). scripts/verify_eval_gap.py proves per mode + per eval token: disjointness (train detector must NOT fire on eval token), the gap matrix, and no mutation of canonical prompts. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	dcd1b18303	eval: train/test token gap for all 4 modes (paper memorization control) Ariahw randomizes the run_tests function name at eval (line 84) so a high hack rate reflects responding to the disclosed mechanism, not memorizing a string. Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at eval, disclose a NOVEL per-mode token and tighten the detector to require exactly it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/ eval_result_file; defaults None = unchanged family path at train). run_tests uses the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE train hint (never touches problem text), seeded per problem so the curve stays smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves, per mode: train-token hack scores under family, NOT under eval-token; novel-token hack scores under eval-token. Wired into smoke. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	ba46e85f55	eval: 1 sample/prompt, periodic 32 distinct, final on whole pool Prompt is the independent unit for a hack-rate estimate (same-prompt completions share the mode -> correlated), so spend the gen budget on distinct prompts not repeats. gen_cfg_eval num_return_sequences group->1. Periodic 8->32 distinct prompts (smoother curve, still 2x faster than the old 8x8=64-completion pass). Final eval drops the eval_n_prompts_final cap and runs the WHOLE loaded pool x1 (SE~0.021 at p=0.1 over ~200 vs ~0.075 over 16). Final still does train + deploy(knob-off) for route/routeV and collapses to one pass for vanilla/erase. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 13:49:07 +00:00
wassname	70aa6aa96b	modal: parallel GRPO sweep port (image, volume, fan-out launcher) Fire the paper sweep as independent H100/A100-80 containers instead of serial pueue runs. One Volume caches model + svd + out/; train.py runs unmodified (torch 2.7 + Dao flash-attn wheel, code mounted at runtime). Verified: vanilla 60-step reproduces the local baseline. Skill at ~/.claude/skills/modal documents the patterns. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 20:30:19 +08:00
wassname	bcf09dd742	docs	2026-06-06 12:27:26 +00:00
wassname	842a373ebc	seed periodic deploy eval too (common random numbers, RNG save/restore) The per-step deploy curve now seeds gen with EVAL_GEN_SEED (promoted to a module const) so all steps+arms share frozen sampling noise -> smooth, comparable trajectory. Saves/restores both CPU and CUDA RNG around the eval so the training stream is unperturbed. Seeding does NOT collapse the 8 samples/prompt (they stay diverse); it only freezes run-to-run/arm-to-arm randomness. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 12:25:25 +00:00
wassname	73936c822f	rename route2->routeV; heavy seeded final eval; save delta_S_hack route2 (binary-tau) and routeV (banded gate) are different methods -- give the new one a distinct id so old/new runs can't be confused (see hypothesis doc). - src/vgrout/* + justfile: route2->routeV, routing2->routingV (figs.py keeps the old keys for plotting historical runs). - Final eval: eval_n_prompts_final=64 distinct prompts (periodic curve stays light at eval_n_prompts) + fixed gen seed (common random numbers across arms) so the paper deploy numbers aren't sampling-noise (the n=8-prompt eval gave 0.031 vs 0.125 at the same checkpoint). - save_ckpt: also write delta_S_hack to sibling _hack.safetensors so runs can be re-scored knob-ON at higher n later (train.safetensors stays delta_S-only). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 12:08:28 +00:00
wassname	9c76584970	track pairsets in git (hand-authored supervision source) The pairset JSONs are the only non-regenerable input to the method (the v_hack bases are derived from them via on-demand extraction, train.py:528). They were caught by the blanket /out/ ignore; switch to /out/* + re-include so any box (and Modal) gets the source from a clone instead of a side-channel rsync. vhack safetensors stay ignored (383M of derived binaries). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 08:11:01 +00:00
wassname	4b9545c59a	spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:20:00 +00:00
wassname	69f8bc208d	justfile: erase recipes use the prog_wide default (drop pinned --v-hack-path) fast-projected / full no longer pin v_hack_full.safetensors; erase now extracts from the prog_wide default (auto-resolves v_hack_pairset_prog_wide), the same pair set route2 uses -> apples-to-apples arms. Smoke recipes keep their tiny-model v_hack pins (the tiny model needs its own basis). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:10:29 +00:00
wassname	f22b69d1d3	config: make prog_wide (30 pairs) the default vhack_pairs_path prog_wide is the proven main pair set, so default to it instead of falling back to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None). The same pairs build both v_grad and the route band in one extract pass -- no separate threshold set. Spec updated to say so. route2 smoke green on the new default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:02:08 +00:00
wassname	dd922d8793	route2: add per-token routing granularity (route2_per_token), default per-rollout Ablation arm requested by the user: route the banded gate per TOKEN (one cos/f per token) instead of per ROLLOUT (sum tokens first). Per-rollout stays the default (denoises the cos sign, matches GRPO per-rollout advantage). Per-token uses the same pair-calibrated band; gauges (frout/tau) mask pad tokens (\|g_tok\|<1e-8) so the ~0-grad positions don't skew them. Conservation (routed+kept=g) holds in both. Both paths smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 04:52:30 +00:00
wassname	aca045ec99	route2: surface routed-fraction (frout) col + fix stale tau/hkgap legends Audit (subagent, 2026-06-06) found no cheats and no math errors, but two log-honesty gaps: - tablelog tau/hkgap descriptions still described the deleted EMA-midpoint gate ("ema_hack_cos - ema_clean_cos", "calibrated route threshold"). Rewrote to the band semantics (tau=median live cos_b; hkgap=band width upper-lower). - the spec's mandatory routed-mass gauge (mean f) was DEBUG-only. Promote it to the frout streaming column so the real-vs-random mass confound is checkable in the table (compare deploy-hack at matched frout), not just via qE. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 04:48:17 +00:00
wassname	d159d4c0f2	route2: fail loud if real v_grad band collapses (extraction broken) Fresh-eyes review flagged that nothing asserted upper>lower for the REAL v_grad: a broken extraction (hack pairs aligning no more than clean) would silently degenerate into the random-control sign gate via the max(.,1e-6) floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still allowed to collapse. No correctness change to the gate math (review found conservation, per-rollout recovery, cosine masking, closure capture all OK). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:35:33 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	d131323a8d	spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode) Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/ delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/ band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/ stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract v_hack + band-edge pseudocode; complete pseudocode for both arms (erase component-subtract aggregate w/ linearity note; route per-rollout banded gate); no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan; run plan (erase real-vs-random first, route later); queue disposition; teacher facts + no-teacher emergence timing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:05:08 +00:00
wassname	83cae4ef72	docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:39:48 +00:00
wassname	a83953131e	spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:23:58 +00:00
wassname	180d3e862c	spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec), route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec), upper=mean cos(g_rej,vec). Below lower keep, above upper route, between = absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the real-vs-random discriminator (random vec closes the band) so no separate matched-fraction control is needed; collapse flags vec degeneracy. Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat): mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band transfers to the sampled live distribution. Also picks g_step granularity (per-rollout default vs per-step). Held-out B never in validation. Corrects the earlier wrong claim that component-routing collapses to erase (pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:16:38 +00:00
wassname	53d88bc9ee	spec: fold external-review into pair-routing plan; default teacher_off_step=30 External review (Claude + deepseek-v4-pro) converged on the threshold being circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the real-vs-random control; route the vec-component (erase-style) not the whole rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT (n>=3 seeds, effect>random-baseline std). teacher_off_step now defaults to 30 on the base Config so every arm runs pure on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking self-sustains after the cut). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 01:03:13 +00:00
wassname	dfdc538428	spec: pair-routing impl plan + resume-after-compaction state Adds actionable train.py targets (delete build_route2_anchors, rewrite _route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N, teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:10:23 +00:00
wassname	68b0624733	backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log + task_logs) is gitignored/box-local; this manifest is the durable why-label copy. Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:01:58 +00:00
wassname	0fa250b193	handoff: pre-routing-refactor snapshot + diagnosis route2 directionality exposed the vector is not load-bearing: hack_anchor force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a live detector, so random==real because labels carried it. Redesign: teacher-off@30, drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135). Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 23:58:35 +00:00
wassname	f82a4f034d	paper: interim directionality fig (app:directionality) + confound TODO route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only regime (pending). n=1 interim. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 23:40:02 +00:00
wassname	329066e99b	paper: teacher-off control appendix (app:teacher) -- teacher seeds not sustains Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58, job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME to swap in lr-matched job 124 (queued low-prio). CSV is the committed data artifact; fig regen by plot_teacher_ablation.py. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 12:30:49 +00:00
wassname	ac418a54ce	journal: #186 teacher-off vanilla hacking self-sustaining (job 87, 0.36->0.58 on-policy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 12:07:41 +00:00
wassname	6dd6b74e73	afk: lite hourly check (one cron at :23, no deep dive unless broken) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 10:35:58 +00:00
wassname	7eac7750dc	afk: add docs/AFK_CHECK.md (scopes hourly check to directionality mystery) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:46:38 +00:00
wassname	d2b0fcb255	afk: scope hourly check to directionality mystery (docs/AFK_CHECK.md); drop routine no-finding journal entry (h) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:46:24 +00:00
wassname	6f60ebafa1	journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:35:27 +00:00

1 2 3 4 5 ...

354 Commits