diff --git a/docs/results.md b/docs/results.md index a63cbc9..5fc878f 100644 --- a/docs/results.md +++ b/docs/results.md @@ -2,7 +2,7 @@ Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B. -Regenerate the table with `just results-deploy` (scripts/results_deploy.py, auto-discovers +Regenerate the table with `just results` (scripts/results_deploy.py, auto-discovers every `out/runs/*/deploy_test.json`); `just results` gives the live training-hack table. OLD eval1 results (training-hack metric, the Q1-Q13 mechanism/basis/refresh studies on the @@ -27,7 +27,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1 for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119, 60-step fast preset, Qwen3-4B, single-mode run_tests env, seed 43. NOT comparable to Q12 (old n=64 eval, pre the 2026-05-23 grader-bug / recency-clean fix that moved base solve - 0.94->0.1). REGENERATE: `just results-deploy` (scripts/results_deploy.py) auto-discovers + 0.94->0.1). REGENERATE: `just results` (scripts/results_deploy.py) auto-discovers every out/runs/*/deploy_test.json -- this table is a curated copy of that output. Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded. completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) /