Commit Graph

5 Commits

Author SHA1 Message Date
wassname 04a98b321e feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned
per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack
expert: GRPO flows into the router through the soft weight w (it concentrates
hack-like rollouts in the hack expert), and a continuous pin loss on the
hand-authored pairs anchors the axis. No load balancing; routing is per rollout.

lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed
grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py
gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for
the fork; original proposal kept as docs/spec/original_evil_moe_spec.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-14 11:25:14 +08:00
wassname cca7150ea0 tidy 2026-06-14 11:05:54 +08:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname b8efd42d2f eval: train/test token gap for all 4 modes (lenient disjoint families)
Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname dcd1b18303 eval: train/test token gap for all 4 modes (paper memorization control)
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00