# Research Program **Project**: {FILL_IN: one sentence describing the research problem} **Metric**: {FILL_IN: what we optimize, e.g. val_bpb, accuracy, F1} (lower/higher is better) **Metric design requirements** (enforce before first real experiment): - Train + eval runs in 5-40 minutes on your GPU - Variance across seeds < effect size of a meaningful improvement (run baseline x3, check std) - Deterministic given same seed (fixed data order, fixed eval split) - If variance is too high: use more eval data, smaller model, or a proxy metric with less noise **Hypothesis space**: {FILL_IN: what class of approaches are in scope} Read `0_docs/problem.md` for full context. --- ## File Taxonomy | Type | Files | Rule | |------|-------|------| | FROZEN | `program.md`, `eval.py`, `meta_journal.md` | Never edit without `META_MODE=1` | | GLOBAL | `RESEARCH_JOURNAL.md`, `results.tsv` | Only commit from main; worktrees append to root copy | | APPEND-ONLY | `*_journal.md` | New entries at top, never edit old ones | | REGULAR | everything else | Modify freely in your worktree | --- ## Agent Algorithm ``` YOU ARE AN AGENT. Follow this loop: read RESEARCH_JOURNAL.md # what has been tried read 0_docs/problem.md # what we're solving n_ideas = count files in 1_ideas/ (not _TEMPLATE.md) if n_ideas < 30: ## IDEATE - Read at least one file from 0_docs/papers/ (or fetch a new paper) - Do at least one web search for recent approaches - Fetch papers: use /semantic-search or /exa-search skills -> save FULL paper text to 0_docs/papers/{slug}.md (not summaries -- full text) -> optionally add a vargdown-style argument map to 0_docs/papers/{slug}_analysis.argdown -> add key insight (1-3 observations with sources) to RESEARCH_JOURNAL.md - Brainstorm ideas. Quality bar: * Novel (not in RESEARCH_JOURNAL.md already) * Mechanistically grounded (not just hyperparameter tuning) * Not sklearn slop -- must be a real ML research contribution * Bold enough that it could be a paper contribution - For each idea: write 1_ideas/{YYYY-MM-DD}_{slug}.md (use _TEMPLATE.md format) spawn subagent to critique the idea (prompt: "Is this idea sound? What are the failure modes? Is the hypothesis testable?") append subagent feedback to the idea file - Append summary of new ideas + paper insights to RESEARCH_JOURNAL.md else: ## IMPLEMENT pick the best idea from 1_ideas/ based on: - subagent rating (see feedback section in idea file) - novelty relative to RESEARCH_JOURNAL.md - expected impact on metric - implementation feasibility slug = idea filename slug run: git worktree add 5_worktrees/{slug} -b exp/{slug} cd 5_worktrees/{slug} implement the idea (modify train.py, model.py, etc.) do NOT modify: eval.py, program.md, meta_journal.md ## TEST spawn subagent: "Code review this against the idea doc 1_ideas/{slug}.md. Does the implementation match the hypothesis? Any bugs?" run: just smoke # fast sanity check run: just eval # appends to results.tsv ## REPORT write 9_reports/{YYYY-MM-DD}_{slug}.md (use _TEMPLATE.md format) append short summary to RESEARCH_JOURNAL.md: - what was tried, what metric changed, what you learned - key observation vs inference distinction ## SUBMIT git commit -m "exp({slug}): {one-line description}" git push origin exp/{slug} if result beats best in results.tsv: create PR for human to merge ## QUEUING EXPERIMENTS (pueue) Use pueue to queue experiments for the single GPU -- one at a time, no collision: # Queue with a label showing the question and expected resolution pueue add --label "Q: does X help? H: expect +0.05 metric" -- just eval --config=path # Check queue / status / logs pueue status pueue log {task_id} # full stdout pueue follow {task_id} # live tail Labels encode the hypothesis being tested. After the run, append observed vs expected to RESEARCH_JOURNAL.md. The label shows up in `pueue status` so you can track what question each running/queued job is answering. # Example: multiple experiments queued with different hypotheses pueue add --label "Q: rotary vs sinusoidal? H: rotary saves 0.1 bpb" -- just eval rotary pueue add --label "Q: flash-attn memory? H: 2x batch size same speed" -- just eval flash pueue add --label "Q: does layer norm placement matter? H: pre-norm better" -- just eval prenorm ``` --- ## Coding Conventions Fail fast. No defensive programming. No silent fallbacks. ```python # shape ops: einops for clarity from einops import rearrange, reduce x = rearrange(x, 'b s h d -> b h s d') # einsum for explicit contraction out = torch.einsum('b h s d, b h d v -> b h s v', q, k) # jaxtyping on function boundaries (docs + smoke-test checking) from jaxtyping import Float from torch import Tensor def encode(x: Float[Tensor, 'b s d']) -> Float[Tensor, 'b s h']: ... # logging: loguru not print from loguru import logger logger.info(f"loss={loss:.4f}") # dataframes: polars v1 import polars as pl df.group_by("exp").agg(pl.col("metric").mean()) # config: tyro dataclass import tyro from dataclasses import dataclass @dataclass class Config: lr: float = 3e-4 # {FILL_IN} cfg = tyro.cli(Config) ``` --- ## Research Epistemics Separate observations from inferences: - **Observation**: "val_bpb dropped from 3.2 to 2.9 on run X" (measured fact) - **Inference**: "this suggests the attention head is learning positional structure" (interpretation) - **Claim from paper**: "authors claim X" -- not "X is true" unless you verified it For complex arguments, use `/vargdown` skill: verified argument maps with credences. Trust signals: community adoption > papers citing it > open source code > author reputation. --- ## Available Skills Assume installed at `~/.claude/skills/` (from https://github.com/wassname/skills): | Skill | Use for | |-------|---------| | `/semantic-search` | Search arXiv, Semantic Scholar, DBLP, OpenAlex | | `/arxiv-fetch` | Download full paper text given arXiv ID/URL | | `/exa-search` | Neural web search for recent approaches | | `/vargdown` | Verified argument maps with credences for complex reasoning | | `/gsd` | Get Shit Done: spec -> implement -> test -> review -> wrap | | `/jaxtyping` | Runtime tensor shape/dtype checking | | `/justfile` | Project recipes (`just smoke`, `just eval`, `just queue`) | | `/ml_debug` | ML convergence, gradient analysis, sweep methodology | | `/brainstorm` | Wide + deep ideation without tunnel vision | | `/external-review` | Code/plan review via a different model | | `pueue` | Queue GPU jobs sequentially; label each with Q/hypothesis | Also available: bibtex MCP (search_reference, fetch), wandb MCP (query runs). --- ## Meta-Mode Human sets `META_MODE=1` to enable editing of FROZEN files and committing to main. Use meta-mode to: - Revise this program.md (agent instructions) - Update eval.py (e.g., add new metric columns) - Reflect on the overall research process in meta_journal.md - Exit-interview style: what worked, what didn't, what would you change? To enter: human writes `META_MODE=1` in human_journal.md entry before asking agent.