Commit Graph

434 Commits

Author SHA1 Message Date
wassname c3af6cc03c rename: deployed/as_trained policy views, kill 'knob' (schema paired_final_v2)
Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent:
'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each:
- policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs
  hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split
  deploy_hack (JSON) vs hack_deploy (table key) into one name.
- 'knob' -> 'quarantine'/'adapter' throughout comments and log strings.
- train/test reserved for the DATA split only.
Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by
completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two
replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded)
and the missed eval_curve writer (eval_checkpoint_curve.py) + readers
(results_deploy.py). Smoke green: v2 written + read; gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:26:51 +00:00
wassname 51c5a757ef docs: make active-path comments concise 2026-06-10 05:19:52 +00:00
wassname c031d9db76 log: print one resolved-config block at startup (pairset front and center)
Replaces the partial preset= line. Every None resolves to its effective value
(pairset 'unused (vanilla)', v_hack_file 'unused (not erase)', teacher 'none',
routeV knobs 'unused (not routeV)') so a detached log shows exactly what ran --
fixes 'which pairset did this job use?'. Resolve v_hack_file once up front
(single source); an explicit --v-hack-path that's missing now fails fast instead
of silently extracting to a user-named path.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:12:58 +00:00
wassname c9ff99d87a feat: single fail-fast config-validation block; consolidate scattered checks
_validate_config rejects method-irrelevant/contradictory options before the
model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path
off erase, lora adapter on unwired arms). Removes the duplicate inline lora check,
the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k
assert -- one canonical place. Re-extracted v_hack_smoke against the new authored
default (sha guard caught the orphaned cache). Smoke green; bad combo raises.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:05:14 +00:00
wassname 5c2edb9593 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-10 05:02:17 +00:00
wassname 5ae9187639 fix(tablelog): stale arm gates hid qmass + per-token cols in streaming view
The streaming StepLogger gated on the dead literal arm=='routing' (qmass) and
exact arm=='routingV' (missed routingV_per_token). arm is never 'routing' (the
arm property maps routeV->routingV), so qmass was computed into the row dict but
only ever surfaced in the end-of-run dump, never streamed. Gate all routeV cols
on is_route={routingV, routingV_per_token}; fold qmass in. (GPT-flagged, verified.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:00:12 +00:00
wassname 1f7a1f3333 pairs: default to hand-authored pairs_authored.json; drop contaminated prog_wider/widest
Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050
hack, both 0.143 solve); default to authored as the canonical no-cheat label
source. prog_wider/widest held genuine contamination, superseded by curated
prog_wide_clean (kept for the progressive-vs-authored comparison).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:43:57 +00:00
wassname a94c506dbd feat: routeV_top_k -- route by oriented top-k SVD subspace (max-cos gate)
The k=1 mean-diff is the only naturally hack-ward direction; SVD axes 2..k have
arbitrary sign so each is re-oriented by sign(v_i . mean_diff). Gate = max_i
cos(g, v_i), per-rollout grad_cosine only (asserted). top_k=1 is byte-identical
to the prior mean-diff path. Smoke green: oriented [5,r] basis, band width +0.141.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:42:57 +00:00
wassname 7f45189f1c config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs
Config (make the design axes explicit Literal choices):
- eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like);
  unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded
  in deploy_test.json metadata.
- intervention gains routeV_per_token (folds the per-token bool into the arm choice).
- routeV_gate documented as the pinning axis.
- FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand
  (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4).

Pairs:
- delete prog_wide.json (14/30 print-without-assert contaminated; history in git);
  default -> prog_wide_clean.
- rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the
  extraction pairs are OOD (never use the env's real grader fn name). Re-extracted
  v_hack_smoke to match.

justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac
(eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean.

smoke green (erase + routeV_per_token); all 4 verify gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:21:54 +00:00
wassname b36e3db255 docs: tone down the START HERE links to plain pointers
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:36:52 +00:00
wassname 0d6ff754ec docs: AGENTS.md START HERE links (human_journal, main.tex, grad-routing paper); revert rescore fallback
- Point future agents at the three docs that pin the actual thesis + the
  live open question (direction vs routing vs SVD/PiSSA prior), so they don't
  re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
  checkpoints crash loudly rather than silently defaulting).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:34:06 +00:00
wassname b334b5f516 fix: rescore_deploy tolerates old-schema checkpoints (default eval-harness params)
job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg
predates the train_config refactor. Default eval_n_prompts/max_new/
eval_batch_size to the fast preset (eval-harness params, not model-defining;
test split is fixed-size) so historical checkpoints re-score.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:31:07 +00:00
wassname e7ed74956f fix: gt_only-only run (no-loophole ceiling) no longer divides by zero
eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it
empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when
nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve
= the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:19:22 +00:00
wassname 2a9a3bc8a1 fix: rescore_deploy derives eval_modes from by_mode keys (old jsons lack eval_modes)
Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the
pre-cleanup train.py has no eval_modes key. by_mode keys are the modes
the original eval spanned (present in every version), so derive from
them to reproduce the same knob-off headline.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 22:56:13 +00:00
wassname 61d3819dae docs: README/figs name the current arm routeV, not the dropped route2
The cleanup removed the v1 route and route2 arms (Config is now
none|erase|routeV) but left README calling the live arm route2 with its
old binary-tau gate description. Rename to routeV, describe the banded
cosine gate (per-rollout/per-token, per-token best), and fix the deploy
line (held-out test n=119 knob-off, not n=64). figs.py keeps the
route2/routing2 display map for historical run artifacts.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:39:15 +00:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname 3f82041d90 plot: deploy Pareto draws knob-on->off before/after on the n=119 axis
Now that final/rescore eval record deploy_hack_on/solve_on at n=119,
the deploy scatter shows the honest quarantine move (hollow knob-on dot
-> arrow -> solid knob-off dot) on the same axis instead of borrowing
val's lower-scale curve. Dot-only fallback for arms not yet backfilled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:15:19 +00:00
wassname 3c27d922d2 docs: record science correctness audit 2026-06-09 13:10:17 +00:00
wassname d68c17e7c5 eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:09:50 +00:00
wassname 5b0a6ddd91 plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
  good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
  arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
  fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
  -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
  authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
  fraction), low priority; vanilla rerun alongside best (its solve also suffers).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:58:32 +00:00
wassname 438068c431 cleanup: consolidate stale loaders and pair scripts 2026-06-09 12:47:32 +00:00
wassname ca8d1adf62 plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte)
Two separate panels over-reduced a 2-variable story. One scatter instead: good
corner top-right (hack axis reversed), green effect-arrows from the vanilla
baseline show what each intervention did, achievable solve band (base..ceiling)
as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling).
No title; name-only point labels (position already encodes the rates). The Pareto
view makes domination visible: per-token strictly dominates random-V and vanilla.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:45:42 +00:00
wassname 31c2b9c82f env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only
Realism knob: in the reference env hacking saturates and kills the solve gradient.
A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest
solving pays) keeps a persistent solve pressure all arms feel. The differential test:
routeV ablates the hack on the hackable 90% so it must solve there, while the warm
solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0.

- gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an
  ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from
  eval_modes) so hack/solve remain comparable to the reference env.
- logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the
  flip propagates to per-mode metrics.
- smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1).

Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the
loophole hint. smoke-unhackable runs end-to-end.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:39:50 +00:00
wassname d4998a71ba docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
  EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
  (one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
  anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
  reversed so right=better on both panels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:35:14 +00:00
wassname 0973f9ba7c plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars
Cross-scale (their converged full-env vs our 60-step fast surrogate) made the
paper comparison directional-only and unfair on one axis. Show vanilla GRPO as
the red floor anchor instead; paper numbers stay in the extracted table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:26:55 +00:00
wassname bcfcee0d06 fix floor_ceiling asymmetry: paper methods on BOTH panels
Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve
(Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%).
Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both
axes because they are converged full-scale vs our 60-step surrogate -- caption
marks it directional-only. Cross-scale/maturity caveat (task #18) still stands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:10:55 +00:00
wassname 0538dbf2f1 add routeV_absorb_all: 100% absorption, no vector (H2 extreme control)
Route the whole gradient of every knob-on rollout into the quarantine; the
deployed knob learns only from the knob-off exploration floor. Direction-free
(v_grad extracted but never enters f -> routing is purely by generation mode).
Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout
machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/
rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:56:12 +00:00
wassname 028b8fff68 transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge)
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:48:02 +00:00
wassname 3b38a05738 no-cheat framing: label-leakage not detector-presence; fix plot comment
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:22:29 +00:00
wassname 6b44dd39bd viz: regenerate floor_ceiling figure with Ariahw reference
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname d393e119e0 viz: reference = Ariahw paper (oracle upper bound), not SGTM
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname 93c862bb2e results-deploy: add select (Youden J) + floor->ceiling columns
select = routing precision = hack_supp - solve_supp on the knob (held-out val,
knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost.
Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96.

hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base),
the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24).

The earlier "solve suppression ~50%" was a train/test artifact; the knob's true
solve cost (select's solve_supp term) is near zero -- selectivity is high.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:56:55 +00:00
wassname 34a2eec704 viz: floor->ceiling as two normalized panels (best vs control vs reference)
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).

Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:55:03 +00:00
wassname 7d08ad2acd viz: floor-to-ceiling method comparison (csv + figure)
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.

Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.

Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:45:37 +00:00
wassname e82aa2bf12 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-09 17:28:17 +08:00
wassname c390007eb9 human journal 2026-06-09 17:28:15 +08:00
wassname 8e6eace56b fix: rename 4 canonical LeetCode function names in authored/clean pairsets
singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.

Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.

Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:23:33 +00:00
wassname dae52b2a7d cleanup: consolidate pairs modules into build scripts + add solve_train to table
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
  into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
  pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
  instead of importing Python modules; drop tainted v2/allv2 pairsets
  from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
  the per-run end-of-training table shows actual training solve rate (was "-").

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:17:42 +00:00
wassname 00600d13a6 data: add prog_wide_clean.json -- 8/30 pairs that don't encode run_tests grading flaw
22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only,
4 assert-True/is-not-None, 2 self-comparison tautologies. All encode
knowledge that the grader just calls run_tests() and checks it doesn't
throw. Filtered to 8 clean pairs (axes 2/4/5/6 only):
isPalindrome_int, mySqrt, containsDuplicate, singleNumber,
longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar.

These are identical to 8 of the 11 authored pairs (minus P19/P20/P21).
Needed because job 9 (best result: hack=0.042 per-token) used the full
contaminated prog_wide -- result needs replication with clean pairs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:58:15 +00:00
wassname fb9f68530c refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.

Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.

Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:44:33 +00:00
wassname 6f49d5f9b0 refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback
- scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json
- scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention
- out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None)
- train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud)
- justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats
- requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:09:09 +00:00
wassname a1ef566bac main.tex: document setup differences vs paper in tab:anchors caption
200 steps/G=16/1536tok/n=10 (paper) vs 60 steps/G=8/512tok/n=1 (ours).
Framed as fast-preset directional surrogate within resource budget.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:55:58 +00:00
wassname f1f1c00f41 results: separate paper vs ours column pairs in anchor table
Paper (longer training, >512 tok/gen) and ours (60-step fast) are not
directly comparable -- now shown as separate column pairs in both
main.tex tab:anchors and docs/results.md Q14.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:54:35 +00:00
wassname 9398567e91 results: base model solve=0.126 hack=0.000 (matches paper ~0.115)
Fills baseline row in Q14 table and main.tex tab:anchors.
Context: job 23 (steps=0, zero-shot eval, seed 43, n=119).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:51:34 +00:00
wassname 83f3f98328 results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm)
Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate).
All routeV arms beat vanilla on both hack and solve. Journal entry added.
main.tex tab:anchors vanilla row filled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:51:12 +00:00
wassname a35e7b2735 feat: gt_only env-mode + queue baseline/no-loophole ceiling
- rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle)
- problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests")
- justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and
  queue-no-loophole (gt_only vanilla GRPO, prio 11)
- main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines
  (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours

Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 03:23:49 +00:00
wassname 4a65eedc92 chore: memory updates, diag_pairs_compare script
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:42:56 +00:00
wassname ec88ba3e42 merge: resolve RESEARCH_JOURNAL conflict (keep both HEAD + remote Modal-port entry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:27:08 +00:00
wassname 0f59b1351b feat: online_stats gate for routeV -- live q5/q95 band calibration
New routeV_gate="online_stats" mode: use the empirical per-rollout cosine
distribution (q5/q95 pooled across all modules each step) as the routing
band thresholds, instead of the pair-derived route_band. Direction v_grad
still from authored pairs; only thresholds are online/adaptive.

Motivation: the pair-derived band sits above the live cosine distribution
(median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens.
Online thresholds adapt to the actual live distribution, so the 5/95 tails
always route regardless of where the raw cosines land.

Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95.
Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration).
Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines.
No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution
of live student rollouts (no oracle/labeling of live rollouts as hack/clean).

Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe.
Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:25:37 +00:00
wassname 0412dc56d1 results.md: fix regenerate ref (just results-deploy -> just results) 2026-06-09 01:51:28 +00:00