Files
autoresearch_template/program.md
T
wassname fc46d878cf init
2026-04-04 23:40:34 +08:00

7.4 KiB

Research Program

Project: {FILL_IN: one sentence describing the research problem}

Metric: {FILL_IN: what we optimize, e.g. val_bpb, accuracy, F1} (lower/higher is better)

Metric design requirements (enforce before first real experiment):

  • Train + eval runs in 5-40 minutes on your GPU
  • Variance across seeds < effect size of a meaningful improvement (run baseline x3, check std)
  • Deterministic given same seed (fixed data order, fixed eval split)
  • If variance is too high: use more eval data, smaller model, or a proxy metric with less noise

Hypothesis space: {FILL_IN: what class of approaches are in scope}

Read 0_docs/problem.md for full context.


File Taxonomy

Type Files Rule
FROZEN program.md, eval.py, meta_journal.md Never edit without META_MODE=1
GLOBAL RESEARCH_JOURNAL.md, results.tsv Only commit from main; worktrees append to root copy
APPEND-ONLY *_journal.md New entries at top, never edit old ones
REGULAR everything else Modify freely in your worktree

Agent Algorithm

YOU ARE AN AGENT. Follow this loop:

read RESEARCH_JOURNAL.md          # what has been tried
read 0_docs/problem.md            # what we're solving

n_ideas = count files in 1_ideas/ (not _TEMPLATE.md)

if n_ideas < 30:
    ## IDEATE
    - Read at least one file from 0_docs/papers/ (or fetch a new paper)
    - Do at least one web search for recent approaches
    - Fetch papers: use /semantic-search or /exa-search skills
      -> save FULL paper text to 0_docs/papers/{slug}.md (not summaries -- full text)
      -> optionally add a vargdown-style argument map to 0_docs/papers/{slug}_analysis.argdown
      -> add key insight (1-3 observations with sources) to RESEARCH_JOURNAL.md
    - Brainstorm ideas. Quality bar:
        * Novel (not in RESEARCH_JOURNAL.md already)
        * Mechanistically grounded (not just hyperparameter tuning)
        * Not sklearn slop -- must be a real ML research contribution
        * Bold enough that it could be a paper contribution
    - For each idea:
        write 1_ideas/{YYYY-MM-DD}_{slug}.md  (use _TEMPLATE.md format)
        spawn subagent to critique the idea (prompt: "Is this idea sound?
          What are the failure modes? Is the hypothesis testable?")
        append subagent feedback to the idea file
    - Append summary of new ideas + paper insights to RESEARCH_JOURNAL.md

else:
    ## IMPLEMENT
    pick the best idea from 1_ideas/ based on:
        - subagent rating (see feedback section in idea file)
        - novelty relative to RESEARCH_JOURNAL.md
        - expected impact on metric
        - implementation feasibility

    slug = idea filename slug
    run: git worktree add 5_worktrees/{slug} -b exp/{slug}
    cd 5_worktrees/{slug}

    implement the idea (modify train.py, model.py, etc.)
    do NOT modify: eval.py, program.md, meta_journal.md

    ## TEST
    spawn subagent: "Code review this against the idea doc 1_ideas/{slug}.md.
      Does the implementation match the hypothesis? Any bugs?"
    run: just smoke                    # fast sanity check
    run: just eval                     # appends to results.tsv

    ## REPORT
    write 9_reports/{YYYY-MM-DD}_{slug}.md  (use _TEMPLATE.md format)
    append short summary to RESEARCH_JOURNAL.md:
        - what was tried, what metric changed, what you learned
        - key observation vs inference distinction

    ## SUBMIT
    git commit -m "exp({slug}): {one-line description}"
    git push origin exp/{slug}
    if result beats best in results.tsv:
        create PR for human to merge

## QUEUING EXPERIMENTS (pueue)

Use pueue to queue experiments for the single GPU -- one at a time, no collision:

    # Queue with a label showing the question and expected resolution
    pueue add --label "Q: does X help? H: expect +0.05 metric" -- just eval --config=path

    # Check queue / status / logs
    pueue status
    pueue log {task_id}       # full stdout
    pueue follow {task_id}    # live tail

Labels encode the hypothesis being tested. After the run, append observed vs expected
to RESEARCH_JOURNAL.md. The label shows up in `pueue status` so you can track what
question each running/queued job is answering.

    # Example: multiple experiments queued with different hypotheses
    pueue add --label "Q: rotary vs sinusoidal? H: rotary saves 0.1 bpb" -- just eval rotary
    pueue add --label "Q: flash-attn memory? H: 2x batch size same speed" -- just eval flash
    pueue add --label "Q: does layer norm placement matter? H: pre-norm better" -- just eval prenorm

Coding Conventions

Fail fast. No defensive programming. No silent fallbacks.

# shape ops: einops for clarity
from einops import rearrange, reduce
x = rearrange(x, 'b s h d -> b h s d')

# einsum for explicit contraction
out = torch.einsum('b h s d, b h d v -> b h s v', q, k)

# jaxtyping on function boundaries (docs + smoke-test checking)
from jaxtyping import Float
from torch import Tensor
def encode(x: Float[Tensor, 'b s d']) -> Float[Tensor, 'b s h']:
    ...

# logging: loguru not print
from loguru import logger
logger.info(f"loss={loss:.4f}")

# dataframes: polars v1
import polars as pl
df.group_by("exp").agg(pl.col("metric").mean())

# config: tyro dataclass
import tyro
from dataclasses import dataclass

@dataclass
class Config:
    lr: float = 3e-4
    # {FILL_IN}

cfg = tyro.cli(Config)

Research Epistemics

Separate observations from inferences:

  • Observation: "val_bpb dropped from 3.2 to 2.9 on run X" (measured fact)
  • Inference: "this suggests the attention head is learning positional structure" (interpretation)
  • Claim from paper: "authors claim X" -- not "X is true" unless you verified it

For complex arguments, use /vargdown skill: verified argument maps with credences.

Trust signals: community adoption > papers citing it > open source code > author reputation.


Available Skills

Assume installed at ~/.claude/skills/ (from https://github.com/wassname/skills):

Skill Use for
/semantic-search Search arXiv, Semantic Scholar, DBLP, OpenAlex
/arxiv-fetch Download full paper text given arXiv ID/URL
/exa-search Neural web search for recent approaches
/vargdown Verified argument maps with credences for complex reasoning
/gsd Get Shit Done: spec -> implement -> test -> review -> wrap
/jaxtyping Runtime tensor shape/dtype checking
/justfile Project recipes (just smoke, just eval, just queue)
/ml_debug ML convergence, gradient analysis, sweep methodology
/brainstorm Wide + deep ideation without tunnel vision
/external-review Code/plan review via a different model
pueue Queue GPU jobs sequentially; label each with Q/hypothesis

Also available: bibtex MCP (search_reference, fetch), wandb MCP (query runs).


Meta-Mode

Human sets META_MODE=1 to enable editing of FROZEN files and committing to main.

Use meta-mode to:

  • Revise this program.md (agent instructions)
  • Update eval.py (e.g., add new metric columns)
  • Reflect on the overall research process in meta_journal.md
  • Exit-interview style: what worked, what didn't, what would you change?

To enter: human writes META_MODE=1 in human_journal.md entry before asking agent.