Commit Graph

8 Commits

Author SHA1 Message Date
wassname cca7150ea0 tidy 2026-06-14 11:05:54 +08:00
wassname 3f2b44452a feat: online-stats gate + step-level teacher forcing + AUROC diagnostic
The authored absolute band made pos>=1 unreachable for live hacks (rout~0),
and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff).

- Online-stats gate: route by live quantiles of the pooled cos-to-v_grad
  (top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed
  on refresh. v_grad stays authored-only; only the threshold follows the live
  distribution. Smoke: routing sustained past the refresh (cliff fixed).
- Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens,
  not a per-prompt round; symmetric hack+solve teachers injected as ordinary
  gens (not specially routed). Fixes the per-prompt rounding wart.
- AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label
  (measurement-only, never routes) -- discriminates threshold-vs-direction
  failure and whether a refresh destroys separation.
- Inline eval stays off (eval_ablate_every=0); deploy scored offline.
- Fix _sample_rows None crash (beartype) on the no-solve-pool path.
- Remove dead pooled_gate_thresholds (the rejected authored-pooled approach).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 14:22:37 +00:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname 70aa6aa96b modal: parallel GRPO sweep port (image, volume, fan-out launcher)
Fire the paper sweep as independent H100/A100-80 containers instead of
serial pueue runs. One Volume caches model + svd + out/; train.py runs
unmodified (torch 2.7 + Dao flash-attn wheel, code mounted at runtime).
Verified: vanilla 60-step reproduces the local baseline. Skill at
~/.claude/skills/modal documents the patterns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 20:30:19 +08:00
wassname bcf09dd742 docs 2026-06-06 12:27:26 +00:00
wassname 973b9407b5 grader bug fix + ref reward semantics + Qwen3-4B substrate
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:

1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
   producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
   regardless of correctness. Fixed by joining tests verbatim.

2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
   default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
   paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
   0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
   uses these defaults; ours was effectively the run_rl_baseline control.

3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
   DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
   wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
   beta=1e-3 (was 0.04) per reference config.py:135.

Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.

First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.

200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:36:00 +00:00
wassname 42498682ca spec 2026-05-23 13:04:03 +08:00
wassname 120400c5f5 setup 2026-05-23 10:40:02 +08:00