Files
evil_MoE/docs/spec/20260609_repo_simplification.md
T

4.1 KiB

Repository simplification

Goal

Remove high-confidence duplicate and stale code without changing the active research behavior.

Scope

In: duplicate hack-basis loading, duplicate problem loading, exact attic duplicate, stale imports. Out: decomposing train.py, changing experiment semantics, editing unrelated user changes.

Requirements

  • R1: vgrout.vhack is the only hack-basis loader. Done means no loader definitions or imports remain in extract_vhack_grad.
  • R2: vgrout.data is the only problem loader. Done means vgrout.problems is deleted and no imports remain.
  • R3: exact duplicate attic scripts are removed. Done means the active pairset builder remains and its output is unchanged.
  • R4: the active pipeline still runs. Done means just smoke passes.

Tasks

  • T1 (R1-R3): Consolidate duplicate modules and imports.
    • verify: rg 'vgrout\.problems|from \.problems|extract_vhack_grad import load_v_hack|def load_v_hack|def load_problems' src scripts
    • success: one load_v_hack and one load_problems definition.
    • likely_fail: stale import raises during compile/import checks.
    • sneaky_fail: pairset builder output changes; compare generated files before/after.
    • UAT: repository search shows one canonical definition per concept.
  • T2 (R4): Run compile checks and just smoke.
    • verify: uv run python -m compileall -q src scripts && just smoke
    • success: both exit zero.
    • likely_fail: import or smoke traceback.
    • sneaky_fail: checks pass without exercising duplicate boundaries; smoke imports active pipeline and explicit search proves ownership.
    • UAT: linked verification log shows commands and exit status.
  • T3: Fresh-eyes review and address valid findings.
    • verify: external review of the diff.
    • success: no unresolved correctness finding.
    • likely_fail: stale caller or changed semantics found.
    • sneaky_fail: reviewer only assesses style; prompt requires behavior and proof review.
    • UAT: linked review artifact.

Context

  • Existing user changes in src/vgrout/data.py, src/vgrout/eval.py, plotting/results files, and docs are preserved.
  • scripts/attic/make_pairsets.py differs from scripts/pairset_build_progsets.py only in the documented invocation path.

Log

  • src/vgrout/extract_vhack_grad.py and src/vgrout/vhack.py contain duplicate load_v_hack and postprocess_v_hack implementations.
  • src/vgrout/problems.py is the older problem loader; src/vgrout/data.py is the active superset.
  • Fresh-eyes review found scripts/verify_vhack_heldout.py imported deleted PAIRS; fixed it to load an explicit pairset and made extract/verify recipes name the same pairset.

Results

  • Ownership search: one load_v_hack, one postprocess_v_hack, and one load_problems.
  • Diff: 12 active-line edits and 911 duplicate/stale lines removed before the verifier correctness fix.
  • Full smoke passed: reward matrix, eval-token gap, partition no-cheat gate, and 30-step projected training.

Verify

  • uv run python -m compileall -q src scripts: PASS
  • explicit import check for every repointed caller: PASS
  • just smoke: PASS, full log at /tmp/projected_grpo_repo_simplification_smoke.log

Failure mode check

  • likely_fail: stale import after deleting duplicate modules -> explicit import check passes.
  • sneaky_fail: active pipeline bypasses consolidated loader -> smoke logs postprocess_v_hack during init and refresh.
  • scientific mismatch: verifier silently uses an unrelated built-in pairset -> recipes and verifier now name out/pairsets/prog_wide.json.

Review

  • /tmp/projected_grpo_cleanup_review.md
  • Valid finding: broken PAIRS import in held-out verifier. Fixed.
  • Rejected finding: OUT_DIR coupling is architectural taste, not a correctness regression in this scope.

TODO

  • Review whether scripts/probe_distill.py is still a maintained recipe; its load_problems(cfg.n_problems) calls currently omit required env_modes.
  • Decompose src/vgrout/train.py only with dedicated behavioral gates; it is noisy but load-bearing.

Errors

Task Error Resolution