evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-29 18:59:55 +08:00

Author	SHA1	Message	Date
wassname	b3539e50e7	no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires 2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest; at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'. Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:01:20 +00:00
wassname	6869afef72	tab:ablation: fill placebo row (0.000/0.531), correct caption to non-directional Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate is direction-agnostic: the discarded knob absorbs whatever crosses the per-step energy threshold regardless of v_hack alignment. Directional specificity now rests on the erase arm (subtracts prop cos(g,v)), pending. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:57:06 +00:00
wassname	5f478f4bd0	a5: explain generalisation mechanism in caption + ≡0 marker on zero bars Caption now states v is run_tests-only, teacher is run_tests-only, held-out modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0, and the placebo caveat (suppression is the direction-agnostic quarantine, not v specificity). Bar plot tags invisible zero-height bars with ≡0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:46:57 +00:00
wassname	0645ae2dd2	fig:longrun: rebuild from job84 route2 + job97 fixed vanilla (no collapse) Old figure paired route2 (job 84) with job 85 vanilla, whose step-88 'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to ~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:18:34 +00:00
wassname	b311815fc6	writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat) New Results subsection + fix stale Limitations bullet (50%->12.5%). Three evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step, student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out emergence (teacher demos only run_tests, student emerges 3 modes with zero teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead of A3 random-V). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 23:26:54 +00:00
wassname	b891109633	A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104) Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json; regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three held-out modes drop near zero at knob-off deploy while emerging on the knob-on path -- routing, not non-emergence. #185. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 23:21:39 +00:00
wassname	2fb56e1a91	writeup: prefill A5 generalisation table vanilla column (job 103); reframe as designed held-out test Replaces the old opportunistic n=3 partial read. Vanilla baseline per-mode deploy_hack from job 103 (run_tests-only teacher): run_tests 1.000, file_marker 0.625, sentinel 0.417, stdout_marker 0.167. route2 column pending job 104. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 22:30:35 +00:00
wassname	5a25a1cc1c	results: fill route-rf2 ablation cell (job99: deploy hack 0.000/solve 0.625, staleness harmless) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 03:00:57 +00:00
wassname	65a05c365c	fix(writeup): flag vanilla-200 collapse as preset artifact (job 97), not a finding Job 97 (gentle preset lr=1e-3/adam0.9-0.99/beta=0) ran vanilla-200 without collapse (lp_s in [-0.47,-0.29] to step 200, deploy hack 0.375). The step-88 collapse in Fig longrun is the job-85 hot preset; job 84/85 use mismatched optimizers. Mark figure for regen from matched beta=1e-5 pair (jobs 100/101). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 00:42:44 +00:00
wassname	6085efcc54	paper: de-meta the captions (humanizer/paper-writing) Captions describe the data and state the finding, not the figure's role in the paper. Drop 'Headline result' / 'the companion to the 60-step headline' / '(keynote)' meta-narration; lead with what is plotted. Also: 'headline direction' -> 'the v_hack direction'; move the 'Source: docs/results.md' provenance from body text into a comment. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:43:08 +00:00
wassname	895aedd983	paper: page-1 headline fig, dir arrows, algorithm pseudocode, polish Addresses the formatting review: - Figure 1 (keynote) moved to page 1 (declared before body, inline float) - placeholder Introduction prose + hypothesis block (from README), \TODO rewrite - direction arrows on every metric column (hack down-arrow, solve up-arrow); best cells bold - pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real Python and the chat prompt stay lstlisting - math/underscore removed from headings; loophole-mode names in code font - ablation Source column moved into a comment (internal, not shown) - long-run fig caption made explicitly the 200-step companion to the headline - every float now has a text reference (placeholder where prose is TODO) - dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet (is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix Builds clean: 11 pages, no unresolved refs/cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:38:32 +00:00
wassname	bd7550f559	paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation Formatting pass lifted from the AntiPaSTO paper (the format the author is happy with): - verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks; chat-template prompt uses language={} so markup isn't keyword-coloured) - xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation) - ablation table restructured as leave-one-out with the negate symbol (negate-routing/directional/hack-pairs/intervention); long interpretation moved out of the caption into section body; post-hoc split into its own block - real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the UNVERIFIED placeholder; dropped the verify-before-submission TODO - code-availability line with a GitHub glyph (anonymous placeholder) Builds clean: 11 pages, no unresolved refs/cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:22:22 +00:00
wassname	51ee43577a	paper: tab:ablation interp -- weak vector fails to erase but routes (SGTM absorption) Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is (weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same direction drives route to 0 deploy hack because routing only needs to discriminate hack rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route one as it forms.' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:05:20 +00:00
wassname	3cc804b15e	results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy) Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase (Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline 0.391 in provenance. Journal 2026-06-03(c). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 10:50:08 +00:00
wassname	f4ac26a440	results: A3 erase-static row lands (deploy hack 0.500); both erase arms fail Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500 (HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress (static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the extracted direction does not stop hacking; routing the gated rollout does. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 10:02:51 +00:00
wassname	8d16b317cb	results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000) Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438. One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did remove the aligned component) yet hack still emerged, so the hack signal lives largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+ erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue). Journal 2026-06-03(b). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 06:47:58 +00:00
wassname	1fb49a3325	log: reprint step-table header every 50 rows; related-work: Piggyback learned-mask critique Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn adjacent to lr). Records the anticipated Piggyback 'why not learn the routing mask' critique (answer: no-cheat withholds the per-rollout label a learned mask needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 04:46:12 +00:00
wassname	753a54c625	paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013) Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote caption, add paired t-test, pin the exact 6-log regen command (just dyn --latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6 explicit seed logs, fixing the `87cca9a` clobber. Journal entry 2026-06-03(a). Also: README points to main.tex and drops the stale n=1 findings block; record two OpenReview URLs as a TODO in related work (mine reviews for shared critiques). Closes A1/A2 (#173). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 03:36:32 +00:00
wassname	17a8792340	paper: address comprehension friction + OpenReview novelty challenge - Inline author-notes at the Cloud and Huang related-work bullets (cold-reader panel): lead Cloud with parameter-vs-activation space; state Huang's keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis question as a reviewer attack vector. - Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack 'leakage that shrinks with scale'. - New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the 'limited novelty vs weight-subspace masking' critique that rejected the gradient-routing paper. We remove (not add) a capability and pick the subset from a gradient signal (not a task label). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:29:45 +00:00
wassname	ffc2df540f	blog: drop reader-facing route2 tag -> route (consistency with paper) route2 is an internal run-tag, not something a reader cares about. Rename to route in the WIP banner, the routing-arm paragraph, and two figure captions; describe the earlier relu-gate/shared-basis sketch as 'an early version' rather than v1. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:20:13 +00:00
wassname	dbcc3a5ad3	paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag) User settled it: prog_wide pairs were AI-authored (Claude), so the synthetic/AI-written framing in contribution 2 is honest. Rather than argue label-free, show one run_tests pair verbatim (app:pairs) and let the reader judge the supervision. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:17:49 +00:00
wassname	5dcc90363a	paper: humanizer pass on prose I added (em-dash -> commas) Replaced em-dash-style '--' parentheticals with commas in the rendered prose (contributions item 1, method route, SGTM + confessions related-work bullets). Remaining '--' are LaTeX numeric ranges, TODO placeholders, or % comments. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:49:01 +00:00
wassname	4a002e942f	paper: precise Huang trusted-direction contrast; rename paper note deng->huang Huang related-work bullet now states the actual differences (SVD of clean update trajectory + warmup vs our contrastive pair-gradients in delta_S coords; they project onto trusted, we project out hack; we quarantine+delete at deploy, they only constrain training). Renamed docs/papers/grad_routing/paper_deng_* -> paper_huang_* (untracked note; correct attribution is Huang et al. 2026). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:47:24 +00:00
wassname	c1388e5325	paper: title -> question form 'Can We Quarantine Reward Hacking with a Reward-Hacking Representation?' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:42:03 +00:00
wassname	97a4c5d7b1	paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title - title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation' - contributions: (1) adapt SGTM parameter-gradient masking from supervised unlearning to RL reward hacking, route+ablate framing from gradient routing but NOT Cloud's activation .detach(); (2) replace the data-label mask with a RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled). - method 'Arms': call route SGTM-style post-backward parameter masking in SVD basis, routed into a deletable subspace. - related work: Cloud = localize-then-ablate idea only; SGTM = closest mechanistic relative, their TPR/FPR knob = our weak-detector axis. - title comment flags the OPEN synthetic-pairs question (headline v_hack is hand-authored prog_wide, not AI-prompted). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 01:19:35 +00:00
wassname	05731cc0e4	paper: drop reader-facing route2 version tag; flag SGTM-not-Cloud lineage - route2 -> route in all prose/captions/tables (route2 stays in % provenance comments as the run-tag). A reader does not care about the version number. - title: steering-vector framing; recorded naming reasoning as a comment (do NOT claim label-free -- our pairs ARE labels; the backable scoped claim is held-out hacks suppressed with zero labels of their own, earnable by A5). - FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter- gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:59:24 +00:00
wassname	a7703409ea	paper: replace two defensive 'X not Y' framings with positive statements Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable not delayed' (answers an offstage referee objection) -> state the positive (gap opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor' X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 00:27:54 +00:00
wassname	62e510ff57	feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition) train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO (guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1), drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode partition and route2 v_grad extraction; only the teacher-rollout MIX is removed. Smoke (mix=0 + normal mix=0.5 + vanilla) all green. Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4 status (route2 durable to 200; vanilla collapses ~88, not clean saturation). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 23:26:26 +00:00
wassname	311bf2854f	results: fill keynote table/figure at n=3 route2 / n=2 vanilla C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 11:08:41 +00:00
wassname	2570dfaa67	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-02 07:21:49 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00
wassname	17e4f2e2ff	feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker - deploy hack/solve is now the headline metric for all arms, so turn the mid-train deploy-eval on by default (smoke now covers the deploy path too); 200-step runs pass a sparser cadence explicitly. - docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker (keynote fig+table, ablation table, long-run fig, generalisation, appendix). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 04:41:43 +00:00
wassname	cfdb196869	misc	2026-06-02 02:06:43 +00:00
wassname	19deef4fb9	docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots - blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 01:24:29 +00:00
wassname	83d41933b2	fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel came up empty. Test for finite data (_has_data) not column presence; fall back to the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values across the nan gaps -> a held step-line. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 23:36:26 +00:00
wassname	8158adb543	refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a ~100x capacity edge over delta_S, so routing-everything-there was the low- resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not the routing gate (calibrated-tau already separated hack/clean, hkgap>0). Consolidate to one adapter type: the quarantine is now delta_S_hack, the second diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S, zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad into delta_S_hack.grad (like proj.py's route parks its subspace projection); delta_S keeps the unflagged. Both diagonals train at one shared lr. Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main) with the quarantine ablated. SGTM check: their gradient routing uses a hard detach on capacity-matched reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating. Smoked clean: tau/hkgap/qE render, \|\|delta_S_hack\|\|>0 assert passes, exit 0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:52:02 +00:00
wassname	acc23885b6	spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate) Routing stays vector-based (cos>tau, not the detector flag) but tau is the per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only calibrates, gt_pass never gates. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:08:26 +00:00
wassname	1d105a93a4	review: 3-model external panel on route2 pseudocode + synthesis DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad); (2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a granularity fix); (3) improvement B != erase only via on-policy generation, which ablate-during-gen would remove. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 01:44:31 +00:00
wassname	090f29671d	docs: SGTM vs ours -- diagnostics, tricks, and proposed improvements (B = route within delta_S along SVD axes) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 01:39:46 +00:00
wassname	dd3b5af3db	spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:39:31 +00:00
wassname	20f8630848	spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 11:28:47 +00:00
wassname	2b020c95c0	fix: route2 Arm A flags per-rollout not per-token (external review) The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the batch). _route2_grad_filter now sums each rollout's token gate-grads before the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of tokens over cos>0 by noise, spuriously routing half its gradient mass. Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard, Arm B detach-route, R5 no-cheat all correct; this was the one finding. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 11:25:13 +00:00
wassname	670fcb3c64	feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py divides it out (eps-guard \|delta_S\|>1e-6), flags rollouts by cos(g_b, v_grad)>0, and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward never arises (routing is post-backward within the step). v_grad = unit-mean gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act). route2 forces the combined (non-split) backward since cos_pre is NaN for it anyway, which also gives the gate a single clean grad to read. Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary) and the load-time noise floor already filters axes. v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_ <stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no need to also pass --v-hack-path. run-substrate drops the redundant flag. smoke: smoke-route2 (act) and new smoke-route2-grad both pass (\|\|B_q\|\|=0.109, exit 0); erase shared-basis path unchanged (cout->0, fired~0.9). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:48:31 +00:00
wassname	442630fcae	docs: routing-v2 spec, related-work scorecard, paper fetches, journal Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:21 +00:00
wassname	d781b56ff4	docs: fix review findings (global noise-floor, route one-sided, G3 xref) External review (3 subagents) caught: - blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187) - blog: route pseudocode used full c; route actually uses the same one-sided gate as erase and quarantines the identical 'removed' vector (proj.py:124,199) - spec: 'never seen by detector' -> clarify student trains on all 4 modes, the detector just never labels C/D for v_hack extraction; cross-ref G3/task #107 Dismissed: reviewer claim that only exit_code survived (stale spec; live log columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:41:12 +00:00
wassname	f7288e569d	docs: 4-arm framing, weak-detector test, hack-mode appendix - blog: appendix with prompt+hint/hack/clean traces for all 4 loophole modes (run_tests/sentinel/stdout_marker/file_marker) - blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful extract pseudocode (per-completion zero_grad), erase+route step pseudocode, refresh rationale + route quarantine-ablate subtlety - blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the route arm's lineage - README: 'what we compare' section + appendix pointer - spec: weak-detector arm as the operationalized generalization test Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:20:47 +00:00
wassname	b88afbaf87	docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:00:40 +00:00
wassname	f3f2c1250f	feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt\|eq\|xc\|so\|se\|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:35:26 +00:00
wassname	8a253060a7	feat: object-attribute sentinel + exhaustive non-overlap matrix Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS) to a Solution-object .valid boolean the grader reads -- a concrete mechanism the faithful hint can name (parallel to run_tests() naming the function), fixing the discoverability gap (secret-token problem). Wire the stdout/file detectors onto their natural pass-report families (_text_signals_pass / result-named file with pass content) so the hinted exploit actually trips them. verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6 hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff grade_mode==home_mode). 52/52 green; smoke green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:15:36 +00:00

1 2

100 Commits