evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:15:58 +08:00

Author	SHA1	Message	Date
wassname	376dccdd7f	writeup: add main.qmd (Quarto draft) + nips-template.tex; update human journal main.qmd mirrors main.tex structure with markdown prose, callout TODOs, and Quarto cross-refs. Renders via nips-template.tex which wraps nips15submit_e.sty so quarto render --to pdf produces NeurIPS-formatted output. Human journal prose incorporated into abstract + intro + routing section. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 07:00:54 +08:00
wassname	3200771042	fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	89eaa0866b	paper: record in-sample teacher-seeding method in setup section The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8 rollouts). Demos are generated in-sample by the hint-equipped hack teacher (rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200 (seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out test). Adds \label{ssec:c2} for the cross-ref. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	1228e1b784	refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC) Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In train.py the canonical imports already won at runtime; the earlier ones were dead shadows: - ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems. - local setup_logging() was byte-identical to the .tablelog one already imported (with StepLogger); delete the local def + now-orphaned datetime import and LOGS_DIR const. - problems.py stays: 6 scripts + derisk/regrade still import it. antipasto.py: delete detach_antipasto (0 callers) and its own copies of ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical, better-worded versions incl. the SGTM TODO), plus now-unused contextmanager and per_token_logps imports. docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API error dump, not a review). Behavior-preserving (later imports already won at runtime). Verified: just smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_* gates PASS. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	15a796c542	chore: gitignore modal/results; point AFK_CHECK at requeued task #1 - /modal/results/ holds derived modal-cloud run status (junk RemoteError summary); stop tracking it. - AFK_CHECK live-plan pointer #221 -> #1 (queue was cleared 2026-06-07 and the directionality set requeued via just queue-dir6 43). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	7195d19f90	docs	2026-06-07 03:07:35 +00:00
wassname	bcf09dd742	docs	2026-06-06 12:27:26 +00:00
wassname	4b9545c59a	spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:20:00 +00:00
wassname	f22b69d1d3	config: make prog_wide (30 pairs) the default vhack_pairs_path prog_wide is the proven main pair set, so default to it instead of falling back to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None). The same pairs build both v_grad and the route band in one extract pass -- no separate threshold set. Spec updated to say so. route2 smoke green on the new default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:02:08 +00:00
wassname	d159d4c0f2	route2: fail loud if real v_grad band collapses (extraction broken) Fresh-eyes review flagged that nothing asserted upper>lower for the REAL v_grad: a broken extraction (hack pairs aligning no more than clean) would silently degenerate into the random-control sign gate via the max(.,1e-6) floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still allowed to collapse. No correctness change to the gate math (review found conservation, per-rollout recovery, cosine masking, closure capture all OK). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:35:33 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	d131323a8d	spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode) Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/ delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/ band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/ stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract v_hack + band-edge pseudocode; complete pseudocode for both arms (erase component-subtract aggregate w/ linearity note; route per-rollout banded gate); no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan; run plan (erase real-vs-random first, route later); queue disposition; teacher facts + no-teacher emergence timing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:05:08 +00:00
wassname	83cae4ef72	docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:39:48 +00:00
wassname	a83953131e	spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:23:58 +00:00
wassname	180d3e862c	spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec), route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec), upper=mean cos(g_rej,vec). Below lower keep, above upper route, between = absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the real-vs-random discriminator (random vec closes the band) so no separate matched-fraction control is needed; collapse flags vec degeneracy. Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat): mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band transfers to the sampled live distribution. Also picks g_step granularity (per-rollout default vs per-step). Held-out B never in validation. Corrects the earlier wrong claim that component-routing collapses to erase (pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:16:38 +00:00
wassname	53d88bc9ee	spec: fold external-review into pair-routing plan; default teacher_off_step=30 External review (Claude + deepseek-v4-pro) converged on the threshold being circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the real-vs-random control; route the vec-component (erase-style) not the whole rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT (n>=3 seeds, effect>random-baseline std). teacher_off_step now defaults to 30 on the base Config so every arm runs pure on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking self-sustains after the cut). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 01:03:13 +00:00
wassname	dfdc538428	spec: pair-routing impl plan + resume-after-compaction state Adds actionable train.py targets (delete build_route2_anchors, rewrite _route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N, teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:10:23 +00:00
wassname	68b0624733	backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log + task_logs) is gitignored/box-local; this manifest is the durable why-label copy. Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:01:58 +00:00
wassname	0fa250b193	handoff: pre-routing-refactor snapshot + diagnosis route2 directionality exposed the vector is not load-bearing: hack_anchor force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a live detector, so random==real because labels carried it. Redesign: teacher-off@30, drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135). Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 23:58:35 +00:00
wassname	f82a4f034d	paper: interim directionality fig (app:directionality) + confound TODO route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only regime (pending). n=1 interim. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 23:40:02 +00:00
wassname	329066e99b	paper: teacher-off control appendix (app:teacher) -- teacher seeds not sustains Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58, job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME to swap in lr-matched job 124 (queued low-prio). CSV is the committed data artifact; fig regen by plot_teacher_ablation.py. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 12:30:49 +00:00
wassname	6dd6b74e73	afk: lite hourly check (one cron at :23, no deep dive unless broken) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 10:35:58 +00:00
wassname	7eac7750dc	afk: add docs/AFK_CHECK.md (scopes hourly check to directionality mystery) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:46:38 +00:00
wassname	ec00bc4383	docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap Two review questions today exposed imprecise framing in load-bearing comments: - A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped completion that also writes the stdout marker, verified job-95 id 132), not a detector false positive. hacked_E is the mode-agnostic run_tests signature. Grading channels are non-overlapping; the model's strategy is not. - Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap 0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000. Confirms the degenerate-gate read (H2) over clever-random-direction (H1): suppression is quarantine-volume + exploration floor, not v_hack specificity. Direction only shows in solve (real 0.625 > placebo 0.531). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 08:23:49 +00:00
wassname	03693e4f30	name the method vGROUT (vector gradient routing) - title: drop the "Quarantine ... Representation?" metaphor for "vGROUT: Vector Gradient Routing against Reward Hacking" - Method: add a two-phase definition (make v_hack; then erase=discard the component / route=redirect the gated gradient into a deletable adapter, deleted at deploy). Honest framing: route preserves (not discards); follows Shilov et al.'s post-backward deletable-block routing in the gradient-routing family, gated by an extracted direction not a per-example data label - strip literal "SGTM" from the body (confusing acronym); cite renders as author-year. README + pyproject describe vGROUT (package name unchanged)	2026-06-05 14:51:48 +08:00
wassname	07e1eb8753	paper: fix build, vector figs, +2 plots, de-jargon prose - drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub icon was decorative - switch the two included figures PNG->PDF (vector; now-tracked, smaller) - add fig:generalisation (A5 dumbbell) next to tab:generalisation and fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd - rename leaked config codenames in appendix tables (v_hack_full -> "weak (10 pairs)", null_city -> "random (placebo)") with paper:code mapping comments - de-jargon reader-facing prose per a 3-model external panel (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter, quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward -> hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is. 14 pages, zero unresolved refs.	2026-06-05 14:51:48 +08:00
wassname	04562c5226	doc: fix stale tab:ablation provenance — random-V is job 106 not 87 Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 05:59:28 +00:00
wassname	08ed96292f	fig: point keynote includegraphics at tracked out/figs PNG (drop gitignored symlink) docs/ is gitignored, so docs/writeup/figs/*.png symlinks are untracked -- a fresh clone would have no figs/ dir and the build would break. The PNG itself (out/figs/dyn_sub4_hack_overlay.png) IS tracked; point at it directly, matching the sibling fig at L411. Build verified: 11 pages, no unresolved refs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 05:20:55 +00:00
wassname	273c9ae4aa	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine # Conflicts: # .claude/memory/MEMORY.md	2026-06-05 04:52:47 +00:00
wassname	b3539e50e7	no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires 2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest; at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'. Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 03:01:20 +00:00
wassname	6869afef72	tab:ablation: fill placebo row (0.000/0.531), correct caption to non-directional Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate is direction-agnostic: the discarded knob absorbs whatever crosses the per-step energy threshold regardless of v_hack alignment. Directional specificity now rests on the erase arm (subtracts prop cos(g,v)), pending. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:57:06 +00:00
wassname	5f478f4bd0	a5: explain generalisation mechanism in caption + ≡0 marker on zero bars Caption now states v is run_tests-only, teacher is run_tests-only, held-out modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0, and the placebo caveat (suppression is the direction-agnostic quarantine, not v specificity). Bar plot tags invisible zero-height bars with ≡0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:46:57 +00:00
wassname	0645ae2dd2	fig:longrun: rebuild from job84 route2 + job97 fixed vanilla (no collapse) Old figure paired route2 (job 84) with job 85 vanilla, whose step-88 'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to ~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 02:18:34 +00:00
wassname	b311815fc6	writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat) New Results subsection + fix stale Limitations bullet (50%->12.5%). Three evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step, student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out emergence (teacher demos only run_tests, student emerges 3 modes with zero teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead of A3 random-V). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 23:26:54 +00:00
wassname	b891109633	A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104) Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json; regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three held-out modes drop near zero at knob-off deploy while emerging on the knob-on path -- routing, not non-emergence. #185. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 23:21:39 +00:00
wassname	2fb56e1a91	writeup: prefill A5 generalisation table vanilla column (job 103); reframe as designed held-out test Replaces the old opportunistic n=3 partial read. Vanilla baseline per-mode deploy_hack from job 103 (run_tests-only teacher): run_tests 1.000, file_marker 0.625, sentinel 0.417, stdout_marker 0.167. route2 column pending job 104. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 22:30:35 +00:00
wassname	154e33683e	paper: HARVE byline cross-verified arXiv==S2 (keyed semantic-search .env) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:20:04 +08:00
wassname	b097d9abfc	paper: add verified related work (11 refs) + fix Huang->Deng first author Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape: - gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE) - deletable-module precedent (pre-dates Cloud): zhou2023securityvectors - gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa - C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize - weight-space contrastive direction: fierro2025weightarithmetic - shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey - idea provenance: mallen2025rhinterventions (AF) Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189); sync the cold-reader comment to 'Deng et al.' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:18:44 +08:00
wassname	5a25a1cc1c	results: fill route-rf2 ablation cell (job99: deploy hack 0.000/solve 0.625, staleness harmless) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 03:00:57 +00:00
wassname	65a05c365c	fix(writeup): flag vanilla-200 collapse as preset artifact (job 97), not a finding Job 97 (gentle preset lr=1e-3/adam0.9-0.99/beta=0) ran vanilla-200 without collapse (lp_s in [-0.47,-0.29] to step 200, deploy hack 0.375). The step-88 collapse in Fig longrun is the job-85 hot preset; job 84/85 use mismatched optimizers. Mark figure for regen from matched beta=1e-5 pair (jobs 100/101). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 00:42:44 +00:00
wassname	6085efcc54	paper: de-meta the captions (humanizer/paper-writing) Captions describe the data and state the finding, not the figure's role in the paper. Drop 'Headline result' / 'the companion to the 60-step headline' / '(keynote)' meta-narration; lead with what is plotted. Also: 'headline direction' -> 'the v_hack direction'; move the 'Source: docs/results.md' provenance from body text into a comment. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:43:08 +00:00
wassname	895aedd983	paper: page-1 headline fig, dir arrows, algorithm pseudocode, polish Addresses the formatting review: - Figure 1 (keynote) moved to page 1 (declared before body, inline float) - placeholder Introduction prose + hypothesis block (from README), \TODO rewrite - direction arrows on every metric column (hack down-arrow, solve up-arrow); best cells bold - pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real Python and the chat prompt stay lstlisting - math/underscore removed from headings; loophole-mode names in code font - ablation Source column moved into a comment (internal, not shown) - long-run fig caption made explicitly the 200-step companion to the headline - every float now has a text reference (placeholder where prose is TODO) - dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet (is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix Builds clean: 11 pages, no unresolved refs/cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:38:32 +00:00
wassname	bd7550f559	paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation Formatting pass lifted from the AntiPaSTO paper (the format the author is happy with): - verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks; chat-template prompt uses language={} so markup isn't keyword-coloured) - xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation) - ablation table restructured as leave-one-out with the negate symbol (negate-routing/directional/hack-pairs/intervention); long interpretation moved out of the caption into section body; post-hoc split into its own block - real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the UNVERIFIED placeholder; dropped the verify-before-submission TODO - code-availability line with a GitHub glyph (anonymous placeholder) Builds clean: 11 pages, no unresolved refs/cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:22:22 +00:00
wassname	51ee43577a	paper: tab:ablation interp -- weak vector fails to erase but routes (SGTM absorption) Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is (weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same direction drives route to 0 deploy hack because routing only needs to discriminate hack rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route one as it forms.' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:05:20 +00:00
wassname	3cc804b15e	results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy) Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase (Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline 0.391 in provenance. Journal 2026-06-03(c). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 10:50:08 +00:00
wassname	f4ac26a440	results: A3 erase-static row lands (deploy hack 0.500); both erase arms fail Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500 (HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress (static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the extracted direction does not stop hacking; routing the gated rollout does. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 10:02:51 +00:00
wassname	8d16b317cb	results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000) Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438. One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did remove the aligned component) yet hack still emerged, so the hack signal lives largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+ erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue). Journal 2026-06-03(b). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 06:47:58 +00:00
wassname	1fb49a3325	log: reprint step-table header every 50 rows; related-work: Piggyback learned-mask critique Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn adjacent to lr). Records the anticipated Piggyback 'why not learn the routing mask' critique (answer: no-cheat withholds the per-rollout label a learned mask needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 04:46:12 +00:00

1 2 3

133 Commits