evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Author	SHA1	Message	Date
wassname	bf616749ee	Consolidate tagged hack pairsets in data	2026-06-10 11:58:53 +00:00
wassname	5c2edb9593	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-10 05:02:17 +00:00
wassname	b53043cec3	refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 13:34:50 +00:00
wassname	3c27d922d2	docs: record science correctness audit	2026-06-09 13:10:17 +00:00
wassname	438068c431	cleanup: consolidate stale loaders and pair scripts	2026-06-09 12:47:32 +00:00
wassname	d4998a71ba	docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot - Transcribed Fig-5 numeric table now lives inline in the paper md as an EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md (one fewer repo file; the table sits next to the figure it transcribes). - floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis reversed so right=better on both panels. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 12:35:14 +00:00
wassname	028b8fff68	transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge) Read the figure PNGs directly (Fig 5 is a full numeric table the paper never prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong -- LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression is not routeV's novelty (the judge does it); the mechanism is (no live monitor, gradient-level, fixed authored-pair direction). Plot now shows the judge as a blue no-oracle peer bar alongside the grey oracle methods. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 11:48:02 +00:00
wassname	e82aa2bf12	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-09 17:28:17 +08:00
wassname	c390007eb9	human journal	2026-06-09 17:28:15 +08:00
wassname	8e6eace56b	fix: rename 4 canonical LeetCode function names in authored/clean pairsets singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix, removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade. Same algorithm and test cases; method name changed so pairs no longer share a canonical LeetCode function name with training data. Also update results.md Q14 table: add hack_train/solve_train columns, vanilla row, and prog_wide contamination note (docs/ is gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 09:23:33 +00:00
wassname	a1ef566bac	main.tex: document setup differences vs paper in tab:anchors caption 200 steps/G=16/1536tok/n=10 (paper) vs 60 steps/G=8/512tok/n=1 (ours). Framed as fast-preset directional surrogate within resource budget. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:55:58 +00:00
wassname	f1f1c00f41	results: separate paper vs ours column pairs in anchor table Paper (longer training, >512 tok/gen) and ours (60-step fast) are not directly comparable -- now shown as separate column pairs in both main.tex tab:anchors and docs/results.md Q14. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:54:35 +00:00
wassname	9398567e91	results: base model solve=0.126 hack=0.000 (matches paper ~0.115) Fills baseline row in Q14 table and main.tex tab:anchors. Context: job 23 (steps=0, zero-shot eval, seed 43, n=119). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:51:34 +00:00
wassname	83f3f98328	results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm) Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate). All routeV arms beat vanilla on both hack and solve. Journal entry added. main.tex tab:anchors vanilla row filled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 04:51:12 +00:00
wassname	a35e7b2735	feat: gt_only env-mode + queue baseline/no-loophole ceiling - rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle) - problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests") - justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and queue-no-loophole (gt_only vanilla GRPO, prio 11) - main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 03:23:49 +00:00
wassname	ec88ba3e42	merge: resolve RESEARCH_JOURNAL conflict (keep both HEAD + remote Modal-port entry) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-09 02:27:08 +00:00
wassname	0412dc56d1	results.md: fix regenerate ref (just results-deploy -> just results)	2026-06-09 01:51:28 +00:00
wassname	5007c9757a	results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs)	2026-06-09 01:50:42 +00:00
wassname	824b7eb623	results: Q14 complete eval2 deploy table (4 done: per-token/authored/prog_wide/random-V; via just results-deploy). Corrects earlier claim that job8 prog_wide had no eval2 deploy	2026-06-08 23:57:42 +00:00
wassname	e26f5fe08c	results: add Q14 -- routeV deploy on recency-clean eval2 (job 15 in; vanilla/act_vote/lora/random-V pending)	2026-06-08 22:58:34 +00:00
wassname	0d22ee6476	writeup: fill contrastive pairs TODO with actual pair examples + loophole hacks Shows the prog_wide.json stdout_marker variant (print vs assert inside run_tests) and canonical hack completions for sentinel/stdout_marker/file_marker modes. Clarifies that prog_wide covers run_tests only by design. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 07:02:38 +08:00
wassname	376dccdd7f	writeup: add main.qmd (Quarto draft) + nips-template.tex; update human journal main.qmd mirrors main.tex structure with markdown prose, callout TODOs, and Quarto cross-refs. Renders via nips-template.tex which wraps nips15submit_e.sty so quarto render --to pdf produces NeurIPS-formatted output. Human journal prose incorporated into abstract + intro + routing section. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-08 07:00:54 +08:00
wassname	3200771042	fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so ~1 step in 8 saw a teacher demo and the student never learned the hack within 60 steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality comparison invalid. scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify; the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1 per step), no partition.json. Reuses compute_reward; row schema copied verbatim from build_substrate so the pools are loader-compatible. - queue-dir6 -> teacher_pool_runtests_dense (all 8 arms). - build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate). - main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong 'no re-grading' and the stale 6-prompt count; note demos are full problem-specific completions (real solution + permissive self-written run_tests), not a snippet. Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and requeued on the dense pool. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	89eaa0866b	paper: record in-sample teacher-seeding method in setup section The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8 rollouts). Demos are generated in-sample by the hint-equipped hack teacher (rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200 (seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out test). Adds \label{ssec:c2} for the cross-ref. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	1228e1b784	refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC) Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In train.py the canonical imports already won at runtime; the earlier ones were dead shadows: - ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems. - local setup_logging() was byte-identical to the .tablelog one already imported (with StepLogger); delete the local def + now-orphaned datetime import and LOGS_DIR const. - problems.py stays: 6 scripts + derisk/regrade still import it. antipasto.py: delete detach_antipasto (0 callers) and its own copies of ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical, better-worded versions incl. the SGTM TODO), plus now-unused contextmanager and per_token_logps imports. docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API error dump, not a review). Behavior-preserving (later imports already won at runtime). Verified: just smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_* gates PASS. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	15a796c542	chore: gitignore modal/results; point AFK_CHECK at requeued task #1 - /modal/results/ holds derived modal-cloud run status (junk RemoteError summary); stop tracking it. - AFK_CHECK live-plan pointer #221 -> #1 (queue was cleared 2026-06-07 and the directionality set requeued via just queue-dir6 43). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	cc8db051ab	fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes Train side of the same contamination bug: fast preset loaded first-200-by-id = the lowest/oldest/most pretraining-memorized problems (base solves them easily -> weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed), with the teacher-seed ids pinned in so seeding still fires. Paper trains on all 992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203, matching paper fn9. Adds justfile recipes: - queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/ per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B routeV) on teacher_pool_runtests + fixed eval. - queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t significance + directionality/adapter ablations at one seed. Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	ea01267cd8	fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094) The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 11:01:31 +00:00
wassname	7195d19f90	docs	2026-06-07 03:07:35 +00:00
wassname	bcf09dd742	docs	2026-06-06 12:27:26 +00:00
wassname	4b9545c59a	spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:20:00 +00:00
wassname	f22b69d1d3	config: make prog_wide (30 pairs) the default vhack_pairs_path prog_wide is the proven main pair set, so default to it instead of falling back to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None). The same pairs build both v_grad and the route band in one extract pass -- no separate threshold set. Spec updated to say so. route2 smoke green on the new default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 05:02:08 +00:00
wassname	d159d4c0f2	route2: fail loud if real v_grad band collapses (extraction broken) Fresh-eyes review flagged that nothing asserted upper>lower for the REAL v_grad: a broken extraction (hack pairs aligning no more than clean) would silently degenerate into the random-control sign gate via the max(.,1e-6) floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still allowed to collapse. No correctness change to the gate math (review found conservation, per-rollout recovery, cosine masking, closure capture all OK). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:35:33 +00:00
wassname	485839d7b1	route2: pair-calibrated banded gate, drop live-detector tau + force-route Replace the confounded route2 gate (hack_anchor force-routed teacher + weak-detector student rows by LABEL; EMA tau calibrated from a live detector over student rollouts at train time = a cheat) with a band calibrated from the contrastive pairs alone: lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1) routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S v_grad is now the SOLE router: no detector or gt_pass touches routing, so "does v_hack generalize to held-out modes" is clean and random-vs-real is decisive. Band width (upper-lower) is itself the discriminator: smoke shows +0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g holds exactly; resid~0 in smoke (no hack leak into the deployed knob). - delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau) - add route_band_edges(); build at extract, rebuild on v_grad refresh - drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py - teacher rollouts now route through the same band (not force-routed) - spec: add the mass-confound control (scientist review 2026-06-06) smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:27:24 +00:00
wassname	d131323a8d	spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode) Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/ delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/ band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/ stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract v_hack + band-edge pseudocode; complete pseudocode for both arms (erase component-subtract aggregate w/ linearity note; route per-rollout banded gate); no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan; run plan (erase real-vs-random first, route later); queue disposition; teacher facts + no-teacher emergence timing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 03:05:08 +00:00
wassname	83cae4ef72	docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:39:48 +00:00
wassname	a83953131e	spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics Validation removed: running the weak detector over student rollouts at train time is the no-cheat violation, and a live validation is complex/non-causal. Causal proof stays downstream (deploy perf + real-vs-random). Train-time only LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the 'does the threshold generalize to a second pair' test), live cos_b percentiles vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1, resid=cos(g_keep,vec). Granularity decided = per-rollout: train.py already sums per-token gate grads to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise- robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an extracted direction -- that's the novelty. Borrow their 'absorption' as the mechanism justifying A->B generalization. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:23:58 +00:00
wassname	180d3e862c	spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec), route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec), upper=mean cos(g_rej,vec). Below lower keep, above upper route, between = absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the real-vs-random discriminator (random vec closes the band) so no separate matched-fraction control is needed; collapse flags vec degeneracy. Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat): mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band transfers to the sampled live distribution. Also picks g_step granularity (per-rollout default vs per-step). Held-out B never in validation. Corrects the earlier wrong claim that component-routing collapses to erase (pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 02:16:38 +00:00
wassname	53d88bc9ee	spec: fold external-review into pair-routing plan; default teacher_off_step=30 External review (Claude + deepseek-v4-pro) converged on the threshold being circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the real-vs-random control; route the vec-component (erase-style) not the whole rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT (n>=3 seeds, effect>random-baseline std). teacher_off_step now defaults to 30 on the base Config so every arm runs pure on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking self-sustains after the cut). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 01:03:13 +00:00
wassname	dfdc538428	spec: pair-routing impl plan + resume-after-compaction state Adds actionable train.py targets (delete build_route2_anchors, rewrite _route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N, teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:10:23 +00:00
wassname	68b0624733	backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log + task_logs) is gitignored/box-local; this manifest is the durable why-label copy. Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 00:01:58 +00:00
wassname	0fa250b193	handoff: pre-routing-refactor snapshot + diagnosis route2 directionality exposed the vector is not load-bearing: hack_anchor force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a live detector, so random==real because labels carried it. Redesign: teacher-off@30, drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135). Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 23:58:35 +00:00
wassname	f82a4f034d	paper: interim directionality fig (app:directionality) + confound TODO route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only regime (pending). n=1 interim. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 23:40:02 +00:00
wassname	329066e99b	paper: teacher-off control appendix (app:teacher) -- teacher seeds not sustains Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58, job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME to swap in lr-matched job 124 (queued low-prio). CSV is the committed data artifact; fig regen by plot_teacher_ablation.py. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 12:30:49 +00:00
wassname	6dd6b74e73	afk: lite hourly check (one cron at :23, no deep dive unless broken) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 10:35:58 +00:00
wassname	7eac7750dc	afk: add docs/AFK_CHECK.md (scopes hourly check to directionality mystery) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 09:46:38 +00:00
wassname	ec00bc4383	docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap Two review questions today exposed imprecise framing in load-bearing comments: - A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped completion that also writes the stdout marker, verified job-95 id 132), not a detector false positive. hacked_E is the mode-agnostic run_tests signature. Grading channels are non-overlapping; the model's strategy is not. - Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap 0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000. Confirms the degenerate-gate read (H2) over clever-random-direction (H1): suppression is quarantine-volume + exploration floor, not v_hack specificity. Direction only shows in solve (real 0.625 > placebo 0.531). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 08:23:49 +00:00
wassname	03693e4f30	name the method vGROUT (vector gradient routing) - title: drop the "Quarantine ... Representation?" metaphor for "vGROUT: Vector Gradient Routing against Reward Hacking" - Method: add a two-phase definition (make v_hack; then erase=discard the component / route=redirect the gated gradient into a deletable adapter, deleted at deploy). Honest framing: route preserves (not discards); follows Shilov et al.'s post-backward deletable-block routing in the gradient-routing family, gated by an extracted direction not a per-example data label - strip literal "SGTM" from the body (confusing acronym); cite renders as author-year. README + pyproject describe vGROUT (package name unchanged)	2026-06-05 14:51:48 +08:00
wassname	07e1eb8753	paper: fix build, vector figs, +2 plots, de-jargon prose - drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub icon was decorative - switch the two included figures PNG->PDF (vector; now-tracked, smaller) - add fig:generalisation (A5 dumbbell) next to tab:generalisation and fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd - rename leaked config codenames in appendix tables (v_hack_full -> "weak (10 pairs)", null_city -> "random (placebo)") with paper:code mapping comments - de-jargon reader-facing prose per a 3-model external panel (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter, quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward -> hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is. 14 pages, zero unresolved refs.	2026-06-05 14:51:48 +08:00
wassname	04562c5226	doc: fix stale tab:ablation provenance — random-V is job 106 not 87 Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-05 05:59:28 +00:00

1 2 3 4

154 Commits