evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:59:35 +08:00

Author	SHA1	Message	Date
wassname	ffc2df540f	blog: drop reader-facing route2 tag -> route (consistency with paper) route2 is an internal run-tag, not something a reader cares about. Rename to route in the WIP banner, the routing-arm paragraph, and two figure captions; describe the earlier relu-gate/shared-basis sketch as 'an early version' rather than v1. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:20:13 +00:00
wassname	dbcc3a5ad3	paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag) User settled it: prog_wide pairs were AI-authored (Claude), so the synthetic/AI-written framing in contribution 2 is honest. Rather than argue label-free, show one run_tests pair verbatim (app:pairs) and let the reader judge the supervision. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:17:49 +00:00
wassname	5dcc90363a	paper: humanizer pass on prose I added (em-dash -> commas) Replaced em-dash-style '--' parentheticals with commas in the rendered prose (contributions item 1, method route, SGTM + confessions related-work bullets). Remaining '--' are LaTeX numeric ranges, TODO placeholders, or % comments. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:49:01 +00:00
wassname	4a002e942f	paper: precise Huang trusted-direction contrast; rename paper note deng->huang Huang related-work bullet now states the actual differences (SVD of clean update trajectory + warmup vs our contrastive pair-gradients in delta_S coords; they project onto trusted, we project out hack; we quarantine+delete at deploy, they only constrain training). Renamed docs/papers/grad_routing/paper_deng_* -> paper_huang_* (untracked note; correct attribution is Huang et al. 2026). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:47:24 +00:00
wassname	c1388e5325	paper: title -> question form 'Can We Quarantine Reward Hacking with a Reward-Hacking Representation?' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:42:03 +00:00
wassname	97a4c5d7b1	paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title - title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation' - contributions: (1) adapt SGTM parameter-gradient masking from supervised unlearning to RL reward hacking, route+ablate framing from gradient routing but NOT Cloud's activation .detach(); (2) replace the data-label mask with a RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled). - method 'Arms': call route SGTM-style post-backward parameter masking in SVD basis, routed into a deletable subspace. - related work: Cloud = localize-then-ablate idea only; SGTM = closest mechanistic relative, their TPR/FPR knob = our weak-detector axis. - title comment flags the OPEN synthetic-pairs question (headline v_hack is hand-authored prog_wide, not AI-prompted). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:19:35 +00:00
wassname	05731cc0e4	paper: drop reader-facing route2 version tag; flag SGTM-not-Cloud lineage - route2 -> route in all prose/captions/tables (route2 stays in % provenance comments as the run-tag). A reader does not care about the version number. - title: steering-vector framing; recorded naming reasoning as a comment (do NOT claim label-free -- our pairs ARE labels; the backable scoped claim is held-out hacks suppressed with zero labels of their own, earnable by A5). - FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter- gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:59:24 +00:00
wassname	a7703409ea	paper: replace two defensive 'X not Y' framings with positive statements Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable not delayed' (answers an offstage referee objection) -> state the positive (gap opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor' X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:27:54 +00:00
wassname	62e510ff57	feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition) train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO (guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1), drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode partition and route2 v_grad extraction; only the teacher-rollout MIX is removed. Smoke (mix=0 + normal mix=0.5 + vanilla) all green. Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4 status (route2 durable to 200; vanilla collapses ~88, not clean saturation). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:26:26 +00:00
wassname	311bf2854f	results: fill keynote table/figure at n=3 route2 / n=2 vanilla C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 11:08:41 +00:00
wassname	2570dfaa67	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-02 07:21:49 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00
wassname	17e4f2e2ff	feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker - deploy hack/solve is now the headline metric for all arms, so turn the mid-train deploy-eval on by default (smoke now covers the deploy path too); 200-step runs pass a sparser cadence explicitly. - docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker (keynote fig+table, ablation table, long-run fig, generalisation, appendix). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 04:41:43 +00:00
wassname	cfdb196869	misc	2026-06-02 02:06:43 +00:00
wassname	19deef4fb9	docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots - blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 01:24:29 +00:00
wassname	83d41933b2	fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel came up empty. Test for finite data (_has_data) not column presence; fall back to the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values across the nan gaps -> a held step-line. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 23:36:26 +00:00
wassname	8158adb543	refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a ~100x capacity edge over delta_S, so routing-everything-there was the low- resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not the routing gate (calibrated-tau already separated hack/clean, hkgap>0). Consolidate to one adapter type: the quarantine is now delta_S_hack, the second diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S, zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad into delta_S_hack.grad (like proj.py's route parks its subspace projection); delta_S keeps the unflagged. Both diagonals train at one shared lr. Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main) with the quarantine ablated. SGTM check: their gradient routing uses a hard detach on capacity-matched reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating. Smoked clean: tau/hkgap/qE render, \|\|delta_S_hack\|\|>0 assert passes, exit 0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:52:02 +00:00
wassname	acc23885b6	spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate) Routing stays vector-based (cos>tau, not the detector flag) but tau is the per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only calibrates, gt_pass never gates. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:08:26 +00:00
wassname	1d105a93a4	review: 3-model external panel on route2 pseudocode + synthesis DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad); (2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a granularity fix); (3) improvement B != erase only via on-policy generation, which ablate-during-gen would remove. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 01:44:31 +00:00
wassname	090f29671d	docs: SGTM vs ours -- diagnostics, tricks, and proposed improvements (B = route within delta_S along SVD axes) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 01:39:46 +00:00
wassname	dd3b5af3db	spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:39:31 +00:00
wassname	20f8630848	spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 11:28:47 +00:00
wassname	2b020c95c0	fix: route2 Arm A flags per-rollout not per-token (external review) The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the batch). _route2_grad_filter now sums each rollout's token gate-grads before the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of tokens over cos>0 by noise, spuriously routing half its gradient mass. Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard, Arm B detach-route, R5 no-cheat all correct; this was the one finding. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 11:25:13 +00:00
wassname	670fcb3c64	feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py divides it out (eps-guard \|delta_S\|>1e-6), flags rollouts by cos(g_b, v_grad)>0, and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward never arises (routing is post-backward within the step). v_grad = unit-mean gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act). route2 forces the combined (non-split) backward since cos_pre is NaN for it anyway, which also gives the gate a single clean grad to read. Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary) and the load-time noise floor already filters axes. v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_ <stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no need to also pass --v-hack-path. run-substrate drops the redundant flag. smoke: smoke-route2 (act) and new smoke-route2-grad both pass (\|\|B_q\|\|=0.109, exit 0); erase shared-basis path unchanged (cout->0, fired~0.9). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:48:31 +00:00
wassname	442630fcae	docs: routing-v2 spec, related-work scorecard, paper fetches, journal Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:21 +00:00
wassname	d781b56ff4	docs: fix review findings (global noise-floor, route one-sided, G3 xref) External review (3 subagents) caught: - blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187) - blog: route pseudocode used full c; route actually uses the same one-sided gate as erase and quarantines the identical 'removed' vector (proj.py:124,199) - spec: 'never seen by detector' -> clarify student trains on all 4 modes, the detector just never labels C/D for v_hack extraction; cross-ref G3/task #107 Dismissed: reviewer claim that only exit_code survived (stale spec; live log columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:41:12 +00:00
wassname	f7288e569d	docs: 4-arm framing, weak-detector test, hack-mode appendix - blog: appendix with prompt+hint/hack/clean traces for all 4 loophole modes (run_tests/sentinel/stdout_marker/file_marker) - blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful extract pseudocode (per-completion zero_grad), erase+route step pseudocode, refresh rationale + route quarantine-ablate subtlety - blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the route arm's lineage - README: 'what we compare' section + appendix pointer - spec: weak-detector arm as the operationalized generalization test Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:20:47 +00:00
wassname	b88afbaf87	docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:00:40 +00:00
wassname	f3f2c1250f	feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt\|eq\|xc\|so\|se\|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:35:26 +00:00
wassname	8a253060a7	feat: object-attribute sentinel + exhaustive non-overlap matrix Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS) to a Solution-object .valid boolean the grader reads -- a concrete mechanism the faithful hint can name (parallel to run_tests() naming the function), fixing the discoverability gap (secret-token problem). Wire the stdout/file detectors onto their natural pass-report families (_text_signals_pass / result-named file with pass content) so the hinted exploit actually trips them. verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6 hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff grade_mode==home_mode). 52/52 green; smoke green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:15:36 +00:00
wassname	c6b433310a	fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call, unguessable, lives in outer globals not the solution namespace. (2) __is_safe used isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel, int-subclass eq). 29/29. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 09:57:46 +00:00
wassname	6df80ac246	fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read gt_correct=True in every mode -- breaking the strict oracle AND non-overlap (a hard-exit hack looked genuinely correct everywhere). Verified the hole, then fixed: correctness now requires REACHING a post-assert sentinel in stdout; any early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass. IMPORTANT: build_substrate greedy round-robin could starve a mode when an even assignment existed -> replaced with exact Kuhn bipartite matching, decrement per_mode until all modes saturate, fail loud otherwise. IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) -> inflated teacher gt_t/PASS_RATE. Now store strict gt_correct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 09:15:23 +00:00
wassname	cb504ef11f	docs: substrate implementation status + emergence run queued Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:57:02 +00:00
wassname	0240d2ef9f	feat: build_substrate two-source teacher batch + scarcest-first even assignment derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5% (13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the teacher batch sources exit_code+sentinel from elicit files and run_tests from the existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even 7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:51:27 +00:00
wassname	a8807ebe6d	spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all) Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive detectors. Plus the per-problem env_mode gap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 07:51:28 +00:00
wassname	42f344c816	spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 06:12:46 +00:00
wassname	5de7433ca4	spec: code-review-2 resolution (oracle robustness fixes) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:49 +00:00
wassname	cf5f4861db	rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs: - sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit. - JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe builtins and use baseline Python == (custom-typed operand = eq_override -> reject). - defs-only dropped honest top-level constants -> exec full src, keep state. verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:24 +00:00
wassname	c38c855e8a	spec: implementation status + plan-review-1 resolution (3-mode honest count) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:40:59 +00:00
wassname	fc46f690f5	spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:16:24 +00:00
wassname	8a5738c69a	spec: reject expose-K, design faithful multi-loophole env expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base / no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted). New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to be ripped out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:10:28 +00:00
wassname	c3246b674d	tidy	2026-05-30 04:38:41 +00:00
wassname	efdf86a0cb	wip	2026-05-30 04:33:33 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	969c724d9d	docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass out/ paths as literal args, so the data move + code-path edits run atomically when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:43:10 +00:00
wassname	f88b8b32c1	results: add Q10 (pairset mechanism>framing>placebo) + Q11 (60-step convergence gap closes) Q10: swap only pair-set content (all bases k=12/tau=0, trained k=5, seed-41 mix=0.125 frozen). prog_wide (mechanism) -0.226; semantic framings ~0; null_city placebo +0.024. v_hack tracks the hack mechanism, not a generic honesty direction. n=1 per row, baseline noise +/-0.06. Q11: 60-step seed-42 mix=0.125, gap closes (vanilla 0.936, frozen 0.957, refresh-2 0.907) -- projection delays but does not prevent hacking at this horizon. n=1, confounded with mix/seed vs Q2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 02:34:22 +00:00
wassname	f917670994	feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:52:14 +00:00
wassname	fc30514b23	feat: T5 eval-time ablation for route + fix route deployment invariant T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:50:53 +00:00
wassname	d6342ab201	feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route} Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init). intervention=route parks the hack-ward grad component (g - cV to delta_S, cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack. - proj.py: route flag splits the grad (overshoot=1, no rescale -> the split sums to g, so the training forward still moves hack-ward; route ⊇ erase). - antipasto.py: second trainable knob, identity preserved at init. - train.py: arm -> intervention {none,erase,route}; arm kept as a derived display name so run-id/BLUF/results.py/plot classify are unchanged. opt steps both knobs (hack knob grad=None under none/erase -> AdamW skips it, so erase reproduces old `projected` bit-for-bit, R4). R3 span assert (resid/\|\|gh\|\| < 1e-4) + end-of-run \|\|delta_S_hack\|\| guard (route >0). - results.py / plot_dynamics.py: read arm from the preset line (covers both old --arm and new --intervention logs); plot classifies `routing`. smoke: none \|\|dsh\|\|=0, erase clean, route \|\|dsh\|\|=0.0105 span=2.9e-7. 64 archived projected rows still parse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:31:30 +00:00

1 2

81 Commits