diff --git a/.gitignore b/.gitignore index c3d6dea..9b9e334 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ node_modules/ dist/ *.log +.pi/ docs/reviews/raw.jsonl docs/reviews/err.txt diff --git a/README.md b/README.md index b08dbd6..e6edef7 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,16 @@ # pi-goals -A [pi](https://github.com/badlogic/pi-mono) extension for plan-driven, goal-tracked work in one -`goals.md`. Set up goals (with evidence and failure modes) in plan mode, work them, and sign a goal -off only when a read-only subagent has checked the evidence. +Plan mode for agreeing on goals before any code gets written. Each goal names the subtle failure mode +that could fake a "done" and the discriminator that tells real success from it, plus subtasks and the +evidence that gets checked at sign-off. It all lives in one markdown file you can read and print. A +widget keeps the goals in front of you through compaction, a reminder nudges the agent to keep the +file current and work toward the goals on its own, and a goal is signed off only after a read-only +subagent has checked its evidence. -Successor to [pi-lgtm](https://github.com/wassname/pi-lgtm), kept deliberately small: about -[burneikis/pi-plan](https://github.com/burneikis/pi-plan) plus the additions, goals with evidence, -a sign-off check, a widget, and a reminder. - -The form guides; it does not gate. The agent edits `goals.md` with its normal Edit tool. The one -blessed tool is `CompleteGoal`, which runs the sign-off check and records the result. The reminder, -the injected plan summary, and git/widget visibility carry the process. It trusts the agent's -judgement rather than guarding it. +It guides rather than guards. Like [pi-milestones](https://github.com/Neuron-Mr-White/UniPi/tree/main/packages/milestone) +and [burneikis/pi-plan](https://github.com/burneikis/pi-plan), it leans on a form and a process to +steer the agent and trust its judgement. [pi-lgtm](https://github.com/wassname/pi-lgtm) was my earlier +attempt and got too complex; this one stays small and maintainable. ## Install @@ -19,7 +18,7 @@ judgement rather than guarding it. pi install npm:@wassname2/pi-goals ``` -Or run without installing: +Or run it without installing: ```bash pi -e npm:@wassname2/pi-goals @@ -28,78 +27,133 @@ pi -e npm:@wassname2/pi-goals ## Use ``` -/goals add CSV export to the report view +/goals CSV export for the report view ``` -1. Plan. The agent explores read-only and writes goals into `goals.md` (see format below). -2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel. - On Ready you choose whether to keep the current context or start fresh and compacted. -3. Work. Each turn the active goal is injected (so it survives compaction) and a reminder nudges - the agent to keep `goals.md` current and work autonomously. When a goal's `done_when` is met the - agent calls `CompleteGoal`, which runs `verify` and a read-only judge and, on accept, marks it - done and logs it. +`/goals` enters plan mode and starts a conversation; the description is an optional seed, so plain +`/goals` works too. From there: -Other commands: `/goals` (print the goals), `/goals clear` (empty `goals.md`, history kept in git), -`/goals judge ` (use a specific model for the sign-off judge; default is your current -model). +1. Plan. The agent explores read-only, asks about anything unclear, and writes the goals into + `.pi/goals.md`. +2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel. On + Ready you choose whether to keep the current context or start fresh and compacted. +3. Work. Each turn the active goal is injected so it survives compaction, and a reminder nudges the + agent to keep `goals.md` current and keep going. When a goal's discriminator is satisfied the agent + calls `CompleteGoal`, which runs `verify` and a read-only judge, then marks the goal done and logs it. -## goals.md format +Other commands: `/goals clear` empties `.pi/goals.md`; `/goals judge ` picks a specific +model for the sign-off judge (the default is your current model). -One file holds the objective, the goals, and a short append-only log. +## Example + +Start plan mode with an optional seed: + +``` +/goals audit the papers dir metadata and clean up empty dirs +``` + +The agent explores read-only, then drafts the goal with a subtle failure mode and the discriminator +that beats it, and stops for review: ```markdown -# Goals: ship the cache layer +## Goals -## Goal: [/] Implement cache layer - -done_when: p95 < 50ms on bench-X -verify: pytest tests/cache -q && python bench/p95.py --max-ms 50 -- [x] wire cache client -- [ ] eviction policy +1. [ ] goal: Audit steering/ metadata and remove empty dirs + - subtle failure mode: report written but counts are zero (resolver errored silently) + - discriminator: report shows the XXXX count before/after AND a non-zero rename count + - tasks: + 1. [ ] dry-run the metadata resolve + 2. [ ] remove the empty _artifacts dirs + 3. [ ] write the report + - evidence: + - +``` -failure_modes: - - cache silently bypassed (hit-rate ~0, latency ok by luck) - - bench too small to exercise eviction -evidence: - - load-test.log p95=41ms; bench/p95.py exited 0 - - cache hit-rate 0.93 in load-test.log (not bypassed) +You choose Ready. The agent works the subtasks, then fills `evidence` (each item an artifact plus a +short read of it) and calls `CompleteGoal`: + +```markdown + - evidence: + - > scripts/metadata_report.txt: XXXX 52 -> 4, 146 empty _artifacts removed + - > 48 files renamed; almost certain done, the silent-resolver failure mode is ruled out +``` + +A fresh read-only subagent re-checks that evidence against the repo and the discriminator, then +returns its verdict and reasoning: + +``` +Signed off "Audit steering/ metadata and remove empty dirs". Marked done in goals.md. + +--- sign-off judge --- +metadata_report.txt present; counts 52 -> 4 confirmed; rename log shows 48 renamed (not zero). +VERDICT: accept +``` + +## The goals.md format + +One project-local file, `/.pi/goals.md` (gitignored, like pi-tasks), holds the title, a context +block, the goals, and a short append-only log. A fresh `/goals` draft replaces it. + +```markdown +# ship the cache layer + +Latency target came from the SLO review; keep the existing client API. + +## Goals + +1. [/] goal: Implement cache layer + - subtle failure mode: cache silently bypassed, latency ok by luck + - discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0) + - verify: pytest tests/cache -q && python bench/p95.py --max-ms 50 + - tasks: + 1. [x] wire cache client + 2. [/] eviction policy + - evidence: + - > load-test.log: p95=41ms, hit-rate 0.93 (not bypassed) + +# Future work / out of scope + +- distributed cache ## Log - 2026-06-15 14:02 cache client wired; eviction next ``` -- A goal is a `## Goal:` header whose checkbox carries its state (`[ ]` open, `[/]` active, `[x]` - done, `[-]` cancelled), then an ``, one falsifiable `done_when:`, an optional `verify:` - shell command, `- [ ]` subtasks, an optional short `failure_modes:` pre-mortem list, and an - `evidence:` list. -- `done_when` is the test, written at planning. `evidence` is the proof, a `- ` list the agent fills - at completion pointing at durable artifacts; `CompleteGoal` reads it from the file. `failure_modes` - is the pre-mortem. `verify`, when present, is the deterministic first stage of the sign-off. -- The agent ticks subtasks, appends to `## Log`, and sets the header checkbox (`[/]` when it starts - a goal) as it works. Only `CompleteGoal` writes `[x]`. Multiple goals may be active. +- A goal is a numbered checkbox line beginning `goal:`; the checkbox carries its state (`[ ]` open, + `[/]` active, `[x]` done, `[-]` cancelled). Goals are matched by their text, so the number is just + for you to reference. +- The `discriminator` is the success test, written while planning: the positive observation that the + goal actually succeeded and that none of the `subtle failure mode`s could fake. It has to show + something happened (a count moved, a test exercised the path, a metric beat noise), not just that a + failure was avoided. `evidence` is the proof, filled at sign-off: + each item pairs a durable artifact (a quoted and linked log, a table, a metric) with a short read of + it, not a bare claim. `verify`, when present, is the deterministic first stage of the sign-off. +- Subtasks are any checkbox without a `goal:` prefix, under `- tasks:` (`[/]` in progress, `[-]` + cancelled). The agent ticks them, appends to `## Log`, and sets a goal `[/]` when it starts it. Only + `CompleteGoal` writes `[x]`. Several goals can be active at once. -## The sign-off check (`CompleteGoal`) +## Signing off a goal (`CompleteGoal`) -`CompleteGoal(goal_id)` is the one blessed completion path. It reads the goal's `evidence:` block -from goals.md (so the proof is git-tracked and human-reviewable before sign-off, not buried in a tool -call): +`CompleteGoal(goal)` (matched by the goal's text) is the only tool that marks a goal done; everything +else is the agent editing the file. It reads the goal's `evidence:` block from `.pi/goals.md`, so the +proof stays in the file where you can review it, then: -1. If the goal has a `verify:` command, it is run. A non-zero exit rejects immediately, with no model +1. If the goal has a `verify:` command, it runs. A non-zero exit rejects right away, with no model call. -2. Otherwise a read-only `pi` subprocess (the judge) inspects the `evidence:` items against the repo - and the named failure modes and returns a verdict. It re-derives from the artifacts the evidence - points at rather than trusting the claim, so the `evidence:` list should name durable artifacts - (saved logs, committed diffs, files). -3. On accept, the goal's header checkbox flips to `[x]` and a `## Log` line is written. On reject, - the goal stays open and the agent is told what is missing. +2. Otherwise a read-only `pi` subprocess (a fresh `--no-session` context, so it never sees the working + agent's transcript) inspects the `evidence:` against the repo, the `discriminator`, and the + `subtle failure mode`, and returns a verdict. It re-derives from the cited artifacts rather than + trusting the claim, so list real artifacts, not assertions. +3. On accept, the goal flips to `[x]` and a `## Log` line is written. On reject, the goal stays open + and the agent is told what is missing. Either way the judge's reasoning comes back in the result. -The judge defaults to your current model (guaranteed authorized and capable). Set a different one -with `/goals judge ` for an independent cross-family check. +The judge defaults to your current model (a fresh context, same weights). Point it at another with +`/goals judge ` for an independent cross-family check. ## Prompts -All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so the process is -easy to review end to end. +All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so you can read the +whole process top to bottom. ## Develop @@ -112,8 +166,9 @@ npm run lint ## Not (yet) included -No autonomous re-prompt loop (an until-done-style loop judge). Autonomy comes from the reminder, not -a harness. Plan-phase model stickiness is a documented next step. +- No autonomous re-prompt loop. The reminder nudges the agent within a turn, but the turn still ends + and hands back to you; nothing auto-re-prompts until the goals are done. +- The plan and execution phases can't yet run on different, sticky models. ## License diff --git a/package.json b/package.json index 3d5aab2..d2830b5 100644 --- a/package.json +++ b/package.json @@ -1,7 +1,7 @@ { "name": "@wassname2/pi-goals", "version": "0.0.1", - "description": "One goals.md: set goals in plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.", + "description": "One .pi/goals.md: set goals in plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.", "author": "wassname", "license": "MIT", "type": "module", diff --git a/src/index.ts b/src/index.ts index 90de066..573489b 100644 --- a/src/index.ts +++ b/src/index.ts @@ -1,7 +1,9 @@ /** - * pi-goals — plan mode that sets up goals with evidence, tracked in one goals.md, signed off by a + * pi-goals — plan mode that sets up goals with evidence, tracked in one .pi/goals.md, signed off by a * read-only subagent check. A successor to pi-lgtm, kept deliberately small (≈ burneikis/pi-plan - * plus the additions: goals + failure_modes + subtasks, a sign-off check, a widget, a reminder). + * plus the additions: goals + a discriminator + a subtle failure mode + subtasks, a sign-off check, + * a widget, a reminder). A goal's success test is its discriminator: the observation that tells real + * success from the named failure mode. * * Philosophy (spec D3): the form guides, it does not gate. The agent edits goals.md with its normal * Edit tool. The one blessed tool is CompleteGoal, which runs the sign-off check and records it. The @@ -9,19 +11,30 @@ * judgement rather than guarding it. * * Flow: - * /goals -> plan mode: agent explores, drafts goals into goals.md (planDrafting guides) + * /goals [objective] -> plan mode (conversational): objective is an optional seed; agent explores + * read-only, asks, then drafts goals into .pi/goals.md (planDrafting guides) * agent_end -> review menu (Ready / Edit / $EDITOR / Cancel); Ready offers compaction * execution -> each turn, inject the plan summary (survives compaction) + a reminder; * agent works goals, ticks subtasks, appends ## Log, calls CompleteGoal * CompleteGoal -> optional deterministic verify, then a read-only oracle judge -> accept * flips status:done + logs; reject returns what's missing * - * All model-facing text lives in prompts.tsx, in flow order. + * The plan file lives at /.pi/goals.md (project-local, gitignored, like pi-tasks), not in the + * repo. A fresh /goals draft just replaces it (the "overwrite" staleness rule). + * + * Plan mode is read-only: the tool_call hook blocks edit/write (except goals.md itself) and mutating + * bash while drafting, so code isn't written before the goals are agreed. Read-only bash exploration + * stays open (blocklist, not allowlist). + * + * Not built (FIXME): no plan-vs-exec model switch on accept (plan-model stickiness); noted at its + * call site below. + * + * All model-facing text lives in prompts.ts, in flow order. */ import { spawn, spawnSync } from "node:child_process"; -import { existsSync, readFileSync, writeFileSync } from "node:fs"; -import { basename, join } from "node:path"; +import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs"; +import { basename, join, resolve } from "node:path"; import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@earendil-works/pi-coding-agent"; import { Type } from "@sinclair/typebox"; import { counts, findGoal, type Goal, type PlanDoc, parse, recordSignOff, type SignOff } from "./plan-file.js"; @@ -32,6 +45,29 @@ const PLAN_CONTEXT = "pi-goals-context"; // injected plan-mode guidance, strippe const STATUS_KEY = "pi-goals"; const WIDGET_KEY = "pi-goals-widget"; const READ_ONLY_TOOLS = ["read", "grep", "find", "ls", "bash"]; +// File mutators blocked while drafting goals (read-only plan mode, like narumiruna/pi-plan-mode), so +// code isn't written before goals are agreed. The one allowed write is goals.md itself (the +// deliverable). A read-only task (a pure search) can still be explored in plan mode by nature. +const PLAN_MODE_BLOCKED_TOOLS = ["edit", "write"]; +// bash is dual-use, so block it only when the command looks mutating; read-only exploration (cat, rg, +// git log, running a script to inspect) stays open. Blocklist, not allowlist: keep exploration +// frictionless and just stop the obvious mutators. List adapted from narumiruna/pi-plan-mode; the +// redirect rule catches `> file` / `>> file` / `>| file` but not fd-dups like `2>&1` or `>&2`. +const MUTATING_BASH_PATTERNS: RegExp[] = [ + /\b(rm|rmdir|mv|cp|mkdir|touch|chmod|chown|chgrp|ln|tee|truncate|dd)\b/i, + />\s*[^&\s]/, // redirect to a file (write/append/clobber), excludes 2>&1 and >&2 + /\bnpm\s+(install|uninstall|update|ci|link|publish|version)\b/i, + /\byarn\s+(add|remove|install|publish|upgrade)\b/i, + /\bpnpm\s+(add|remove|install|publish|update)\b/i, + /\bbun\s+(add|remove|install|update|publish)\b/i, + /\bpip\s+(install|uninstall)\b/i, + /\buv\s+(add|remove|sync|lock|pip\s+install)\b/i, + /\bgit\s+(add|commit|push|pull|merge|rebase|reset|checkout|switch|stash|cherry-pick|revert|tag|init|clone)\b/i, + /\b(sudo|su|kill|pkill|killall|reboot|shutdown)\b/i, + /\bsystemctl\s+(start|stop|restart|enable|disable)\b/i, + /\b(vim?|nano|emacs|code|subl)\b/i, +]; +const PLAN_REL = ".pi/goals.md"; // project-local, gitignored (pi-tasks convention); shown in the widget interface PlanState { isPlanMode: boolean; @@ -47,8 +83,14 @@ export default function piPlanExtension(pi: ExtensionAPI): void { // newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /goals. let savedCmdCtx: ExtensionCommandContext | null = null; - const planPath = (ctx: ExtensionContext) => join(ctx.cwd, "goals.md"); + const planPath = (ctx: ExtensionContext) => join(ctx.cwd, ".pi", "goals.md"); const readPlan = (ctx: ExtensionContext): string => (existsSync(planPath(ctx)) ? readFileSync(planPath(ctx), "utf-8") : ""); + // Our programmatic writes (clear, CompleteGoal). The agent creates/edits the file with its own Edit + // tool; this just makes sure .pi/ exists for our writes. + const writePlan = (ctx: ExtensionContext, content: string): void => { + mkdirSync(join(ctx.cwd, ".pi"), { recursive: true }); + writeFileSync(planPath(ctx), content); + }; function persist(): void { pi.appendEntry(STATE, state); @@ -57,7 +99,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void { function updateWidget(ctx: ExtensionContext): void { if (state.isPlanMode) { ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("warning", "planning")); - ctx.ui.setWidget(WIDGET_KEY, ["pi-goals: drafting goals", "Write goals to goals.md, then review."]); + ctx.ui.setWidget(WIDGET_KEY, ["pi-goals: drafting goals", `Write goals to ${PLAN_REL}, then review.`]); return; } const doc = parse(readPlan(ctx)); @@ -68,17 +110,17 @@ export default function piPlanExtension(pi: ExtensionAPI): void { } const c = counts(doc); ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("accent", `◷ ${c.done}/${doc.goals.length} goals`)); - ctx.ui.setWidget(WIDGET_KEY, goalWidgetLines(doc)); + ctx.ui.setWidget(WIDGET_KEY, [...goalWidgetLines(doc), ctx.ui.theme.fg("muted", PLAN_REL)]); } function goalWidgetLines(doc: PlanDoc): string[] { const mark: Record = { done: "✔", active: "▸", open: "◻", cancelled: "✗" }; - const lines = [`Goals: ${doc.objective || "(untitled)"}`]; + const lines = [`Goals: ${doc.title || "(untitled)"}`]; for (const g of doc.goals) { // Show every goal with its status glyph (✔ done, ▸ active, ◻ open, ✗ cancelled) so finished // goals read as checked off rather than vanishing. Plans are small, so this stays readable. const total = g.subtasks.length; - const done = g.subtasks.filter((s) => s.done).length; + const done = g.subtasks.filter((s) => s.status === "done").length; lines.push(`${mark[g.status]} ${g.subject}${total ? ` (${done}/${total} tasks)` : ""}`); } return lines; @@ -99,24 +141,18 @@ export default function piPlanExtension(pi: ExtensionAPI): void { setJudge(arg.slice("judge".length).trim(), ctx); return; } - // Bare `/goals` enters plan mode by prompting for the objective (the common expectation). - // If the user cancels with no objective, fall back to showing the current plan. - let objective = arg; - if (!objective) { - objective = (ctx.hasUI ? await ctx.ui.input("Plan mode — what's the objective?", "Describe what you want to plan") : undefined)?.trim() ?? ""; - if (!objective) { - showPlan(ctx); - return; - } - } - + // Conversational entry (like narumiruna/pi-plan-mode): /goals enters plan mode and starts a + // dialogue. The objective is an optional seed, not a required arg, so there's no awkward + // "type your objective" prompt; the agent explores read-only and asks before drafting. A + // fresh draft just replaces .pi/goals.md (the "overwrite" staleness rule). + const objective = arg || null; state = { ...state, isPlanMode: true, objective }; persist(); updateWidget(ctx); - pi.sendUserMessage( - `Enter plan mode for this objective: ${objective}\n\nExplore read-only, then write the plan to ${planPath(ctx)}.`, - { deliverAs: "followUp" }, - ); + const seed = objective + ? `We're in plan mode. Objective: ${objective}\n\nExplore the repo read-only and ask me anything unclear. When the objective is nailed down, draft (or replace) the goals in ${planPath(ctx)}, then stop for review.` + : `We're in plan mode. Tell me what you want to plan. Explore read-only and ask questions as needed; when the objective is clear, draft the goals in ${planPath(ctx)} and stop for review.`; + pi.sendUserMessage(seed, { deliverAs: "followUp" }); }, }); @@ -132,23 +168,14 @@ export default function piPlanExtension(pi: ExtensionAPI): void { return; } if (ctx.hasUI) { - const ok = await ctx.ui.select("Clear goals.md? (it stays in git history)", ["Cancel", "Clear goals.md"]); + const ok = await ctx.ui.select(`Clear ${PLAN_REL}?`, ["Cancel", "Clear goals.md"]); if (ok !== "Clear goals.md") return; } - writeFileSync(planPath(ctx), ""); + writePlan(ctx, ""); state = { ...state, isPlanMode: false, objective: null }; persist(); updateWidget(ctx); - ctx.ui.notify("Cleared goals.md.", "info"); - } - - function showPlan(ctx: ExtensionContext): void { - const content = readPlan(ctx); - if (!content.trim()) { - ctx.ui.notify("No goals yet. Use /goals to start.", "info"); - return; - } - ctx.ui.notify(content, "info"); + ctx.ui.notify(`Cleared ${PLAN_REL}.`, "info"); } // --- review loop (after the agent drafts the plan) -------------------------------------------- @@ -190,6 +217,9 @@ export default function piPlanExtension(pi: ExtensionAPI): void { } async function startExecution(ctx: ExtensionContext): Promise { + // FIXME(model-switch): the plan phase should be able to run on a sticky plan model and execution + // on a different one (see README "Not yet included"). newSession can't switch the model yet; wire + // this when pi exposes a model override on newSession. // Offer a clean execution context (D13). newSession lives only on the saved command context. let fresh = false; if (ctx.hasUI && savedCmdCtx) { @@ -203,7 +233,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void { const planFile = planPath(ctx); const planContent = readPlan(ctx); // captured now: ctx is stale after newSession below const parentSession = ctx.sessionManager.getSessionFile(); - const startMsg = `Work the goals in ${planFile}. Pick an open goal, mark it active (set its header to [/]), work its subtasks, and when its done_when is met fill the goal's evidence: block then call CompleteGoal with the goal_id. Keep goals.md current as you go.`; + const startMsg = `Work the goals in ${planFile}. Pick an open goal, mark it active (set its checkbox to [/]), work its subtasks, and when its discriminator is satisfied fill the goal's evidence: block then call CompleteGoal with the goal's desc. Keep goals.md current as you go.`; exitPlanMode(ctx); if (fresh && savedCmdCtx) { @@ -223,7 +253,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void { } return; } - if (doc.objective) pi.setSessionName(`Goals: ${doc.objective}`); + if (doc.title) pi.setSessionName(`Goals: ${doc.title}`); ctx.ui.notify(planContent, "info"); pi.sendUserMessage(startMsg, { deliverAs: "followUp" }); } @@ -234,29 +264,33 @@ export default function piPlanExtension(pi: ExtensionAPI): void { name: "CompleteGoal", label: "Complete goal", description: - "Sign off a goal once its done_when is met. First fill the goal's evidence: block in goals.md " + - "(a '- ' list pointing at durable artifacts: saved logs, committed diffs, files, not claims), then " + - "call this with the goal_id. Runs the goal's verify command (if any) then a read-only subagent that " + - "inspects that evidence against the repo. On accept, the goal is marked done and logged; on reject, " + - "it stays open and you get what is missing.", + "Sign off a goal once its discriminator is satisfied. First fill the goal's evidence: block in " + + "goals.md: a list where each item pairs a durable artifact with a short read of it (a quoted+linked " + + "log, a table plus how to read it, or a metric plus what it shows; quote the key lines and link the " + + "rest, not a pasted blob or a bare claim). Then call this with the goal's desc (the text after " + + "'goal:'). Runs the goal's verify command (if any) then a read-only subagent that inspects that " + + "evidence against the repo and the discriminator. On accept, the goal is marked done and logged; on " + + "reject, it stays open and you get what is missing. The subagent's reasoning is returned either way.", parameters: Type.Object({ - goal_id: Type.String({ description: "The goal's from goals.md" }), + goal: Type.String({ description: "The goal's desc: the exact text after 'goal:' in its line." }), }), async execute(_id, params, signal, _onUpdate, ctx) { const content = readPlan(ctx); - const goal = findGoal(parse(content), params.goal_id); - if (!goal) return text(`No goal #${params.goal_id} in goals.md.`, true); + const goal = findGoal(parse(content), params.goal); + if (!goal) return text(`No goal "${params.goal}" in goals.md. Use the exact text after "goal:".`, true); if (goal.evidence.length === 0) { - return text(`Goal #${goal.id} has no evidence: block. Add a "- " evidence list to the goal in goals.md (what shows done_when is met, and where to verify it), then call CompleteGoal.`, true); + return text(`Goal "${goal.subject}" has no evidence yet. Add an evidence: list to the goal in goals.md (artifacts + a short read showing the discriminator is satisfied), then call CompleteGoal.`, true); } // Decide the outcome (the I/O); recordSignOff applies it to the file (the pure write). // Evidence and the artifacts to inspect both come from the goal's evidence: block (single source of truth). - const outcome = await decideSignOff(goal, goal.evidence.join("\n"), goal.evidence, state.judgeModel, ctx.cwd, signal); - const res = recordSignOff(content, goal.id, stamp(), outcome); - if (res.content !== content) writeFileSync(planPath(ctx), res.content); + const { outcome, reasoning } = await decideSignOff(goal, goal.evidence.join("\n"), goal.evidence, state.judgeModel, ctx.cwd, signal); + const res = recordSignOff(content, goal.subject, stamp(), outcome); + if (res.content !== content) writePlan(ctx, res.content); updateWidget(ctx); - return text(res.message, res.isError); + // Surface the sign-off judge's actual reasoning, not just the verdict, so it's visible (was a gap). + const detail = reasoning ? `\n\n--- sign-off judge ---\n${reasoning}` : ""; + return text(res.message + detail, res.isError); }, }); @@ -264,6 +298,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void { pi.on("before_agent_start", async (_event, ctx) => { if (state.isPlanMode) { + // Read-only is enforced in the tool_call hook below (blocks edit/write while planning). return { message: { customType: PLAN_CONTEXT, content: `${planDrafting}\n\nWrite the plan to ${planPath(ctx)}.`, display: false } }; } const doc = parse(readPlan(ctx)); @@ -272,9 +307,13 @@ export default function piPlanExtension(pi: ExtensionAPI): void { const active = doc.goals.find((g) => g.status === "active") ?? doc.goals.find((g) => g.status === "open") ?? null; const c = counts(doc); let body = planInjection({ - objective: doc.objective, + title: doc.title, activeGoal: active - ? { subject: active.subject, done_when: active.done_when, openSubtasks: active.subtasks.filter((s) => !s.done).map((s) => s.text) } + ? { + subject: active.subject, + discriminator: active.discriminator, + openSubtasks: active.subtasks.filter((s) => s.status !== "done" && s.status !== "cancelled").map((s) => s.text), + } : null, lastLogLine: doc.log.at(-1) ?? null, counts: { done: c.done, open: c.open + c.active }, @@ -286,6 +325,25 @@ export default function piPlanExtension(pi: ExtensionAPI): void { return { message: { customType: PLAN_CONTEXT, content: body, display: false } }; }); + // Enforce read-only planning: block file mutators while in plan mode so code isn't written before + // the goals are agreed. The agent draws back to read/grep/find/ls and read-only bash to explore. + pi.on("tool_call", async (event, ctx) => { + if (!state.isPlanMode) return; + // edit/write: blocked, except writing goals.md itself (the deliverable of plan mode). + if (PLAN_MODE_BLOCKED_TOOLS.includes(event.toolName)) { + const target = (event.input as { path?: string }).path; + if (target && resolve(ctx.cwd, target) === resolve(planPath(ctx))) return; + return { block: true, reason: `Plan mode is read-only: agree the goals in ${PLAN_REL} and choose Ready before writing code (${event.toolName} is blocked while planning; only ${PLAN_REL} may be written).` }; + } + // bash: blocked only when the command looks mutating; read-only exploration stays open. + if (event.toolName === "bash") { + const command = (event.input as { command?: string }).command ?? ""; + if (MUTATING_BASH_PATTERNS.some((re) => re.test(command))) { + return { block: true, reason: `Plan mode is read-only: this bash command looks like it mutates state, so it's blocked while planning. Explore read-only, agree the goals in ${PLAN_REL}, then choose Ready.\nCommand: ${command}` }; + } + } + }); + pi.on("agent_end", async (_event, ctx) => { if (!state.isPlanMode || !ctx.hasUI) return; const doc = parse(readPlan(ctx)); @@ -327,7 +385,8 @@ function stamp(): string { return new Date().toISOString().slice(0, 16).replace("T", " "); } -/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge. */ +/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge. + * Returns the outcome plus the judge's (or verify's) reasoning so CompleteGoal can show WHY. */ async function decideSignOff( goal: Goal, evidence: string, @@ -335,16 +394,20 @@ async function decideSignOff( judgeModel: string | null, cwd: string, signal: AbortSignal | undefined, -): Promise { +): Promise<{ outcome: SignOff; reasoning: string }> { let verifyResult: { command: string; exitCode: number; outputTail: string } | null = null; if (goal.verify) { verifyResult = runVerify(goal.verify, cwd, signal); if (verifyResult.exitCode !== 0) { - return { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail }; + return { + outcome: { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail }, + reasoning: `verify \`${goal.verify}\` exited ${verifyResult.exitCode}:\n${verifyResult.outputTail}`, + }; } } const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal); - return verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing }; + const outcome: SignOff = verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing }; + return { outcome, reasoning: verdict.reasoning }; } /** Run the goal's verify command. It is agent-authored and trusted (single-user machine, guide-not-guard). */ @@ -372,13 +435,13 @@ async function runJudge( judgeModel: string | null, cwd: string, signal: AbortSignal | undefined, -): Promise<{ accept: boolean; missing: string }> { +): Promise<{ accept: boolean; missing: string; reasoning: string }> { const task = evidenceJudgeUser({ subject: goal.subject, - done_when: goal.done_when, + discriminator: goal.discriminator, + failure_modes: goal.failure_modes, verify: goal.verify ?? null, verifyResult, - failure_modes: goal.failure_modes, evidence, paths, }); @@ -403,5 +466,9 @@ async function runJudge( const accept = /accept/i.test(verdictLine); const missingMatch = clean.match(/missing\s*:\s*([\s\S]*)$/i); const missing = accept ? "" : (missingMatch?.[1].trim() || clean.trim().slice(-500) || "judge gave no reason"); - return { accept, missing }; + // The judge's own words (inspection + verdict), so CompleteGoal can show them. The verdict is at the + // end, so keep the tail when it's long. + const trimmed = clean.trim(); + const reasoning = trimmed.length > 1800 ? `...\n${trimmed.slice(-1800)}` : trimmed; + return { accept, missing, reasoning }; } diff --git a/src/plan-file.ts b/src/plan-file.ts index 8182d33..173a05a 100644 --- a/src/plan-file.ts +++ b/src/plan-file.ts @@ -2,82 +2,104 @@ * plan-file.ts — read goals.md, and the two writes CompleteGoal needs. That is all. * * Pure module, no pi deps, so it unit-tests without a runtime. The file is the canonical store and - * the agent edits it with its normal Edit tool (create goals, tick subtasks, append log), guided by - * the format in prompts.ts and the reminder -- the form guides, it does not gate (spec D3). So this - * module does NOT render or create goals; the format's single source of truth is the planDrafting - * prompt. The only programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to - * record an accepted sign-off; both touch one line so the git diff stays readable. + * the agent edits it with its normal Edit tool (create goals, tick subtasks, fill evidence), guided + * by the format in prompts.ts and the reminder -- the form guides, it does not gate. The only + * programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to record an accepted + * sign-off; both touch one line so the diff stays readable. * - * A goal's state lives in a checkbox on its header (single source of truth, renders natively): - * [ ] open [/] active (in progress) [x] done [-] cancelled - * Only CompleteGoal writes [x]; the agent sets [/] when it starts a goal. + * Format (markdown, checkbox-first, made to be skim-reviewed by a human): * - * Format: + * # * - * # Goals: + * * - * ## Goal: [ ] - * - * done_when: - * verify: - * - [ ] + * ## Goals * - * failure_modes: - * - - * evidence: - * - + * 1. [ ] goal: <- state in the checkbox: [ ] open [/] active [x] done [-] cancelled + * - discriminator: + * - subtle failure mode: + * - verify: + * - tasks: + * 1. [x] <- a subtask is any checkbox WITHOUT a "goal:" prefix + * 2. [/] + * 3. [-] <- [-] or ~~[ ]~~ both read as cancelled + * - evidence: <- empty at planning; filled at sign-off, read by CompleteGoal + * - > + * 2. [ ] goal: + * + * # Future work / out of scope * * ## Log * - + * + * A goal/subtask's state lives in its checkbox (single source of truth, renders natively). Goals are + * matched by their (the text after "goal:"); the list number is human-facing only. Only + * CompleteGoal writes a goal's [x]; the agent sets [/] when it starts one. */ export type GoalStatus = "open" | "active" | "done" | "cancelled"; export interface Subtask { text: string; - done: boolean; + status: GoalStatus; } export interface Goal { - id: string; + /** The text after "goal:" in the header line; the handle CompleteGoal matches on. */ subject: string; status: GoalStatus; - done_when: string; - verify?: string; - /** Pre-mortem: ways a "done" could be wrong. Written at planning. */ + /** Positive observation(s) that the goal succeeded AND that no failure mode could fake. The success test. Written at planning. */ + discriminator: string[]; + /** Subtle ways a "done" could be wrong (look-like-success failures). Written at planning. */ failure_modes: string[]; - /** Proof the done_when is met, pointing at durable artifacts. Written at completion; read by CompleteGoal. */ + /** Optional command that exits 0 only when the discriminator passes (the cheap deterministic gate). */ + verify?: string; + /** Proof the discriminator passed, pointing at durable artifacts. Written at completion; read by CompleteGoal. */ evidence: string[]; subtasks: Subtask[]; } export interface PlanDoc { - objective: string; + title: string; goals: Goal[]; /** Verbatim ## Log lines, including the leading "- ". */ log: string[]; } -// Goal header carries the state checkbox: `## Goal: [x] subject`. The checkbox is optional so a -// header written without one parses as open (group 1 undefined -> " "). -const GOAL_HEADER = /^##\s+Goal:\s*(?:\[([ xX/-])\]\s+)?(.*)$/; -const ANY_HEADER = /^#{1,6}\s/; +const TITLE = /^#\s+(.+?)\s*$/; // the first single-# H1 +const GOALS_HEADER = /^##\s+Goals\s*$/i; const LOG_HEADER = /^##\s+Log\s*$/i; -const ID_COMMENT = /^$/; -const CHECKBOX = /^- \[([ xX])\]\s+(.*)$/; +const ANY_HEADER = /^#{1,6}\s/; +// A goal: a numbered or bulleted checkbox item whose text begins "goal:". +const GOAL_ITEM = /^\s*(?:\d+\.|[-*])\s*\[([ xX/-])\]\s*goal:\s*(.*)$/i; +// A section marker bullet under a goal (the trailing colon is optional, e.g. "- tasks"). +const KEY_LINE = /^\s*[-*]\s*(discriminator|subtle failure modes?|failure_modes?|verify|tasks?|evidence)\s*:?\s*(.*)$/i; +// Any list item (numbered or bulleted); used for subtasks and for list items inside the sections. +const LIST_ITEM = /^\s*(?:\d+\.|[-*])\s+(.*)$/; +// A checkbox inside a list-item body (subtask). A leading/trailing ~~ marks it cancelled. +const CHECKBOX_BODY = /^(~~)?\s*\[([ xX/-])\]\s*(.*)$/; const CHAR_TO_STATUS: Record = { " ": "open", "/": "active", x: "done", "-": "cancelled" }; const STATUS_TO_CHAR: Record = { open: " ", active: "/", done: "x", cancelled: "-" }; +function normalizeKey(raw: string): "discriminator" | "failure_modes" | "verify" | "tasks" | "evidence" { + const k = raw.toLowerCase(); + if (k.startsWith("discriminator")) return "discriminator"; + if (k.startsWith("verify")) return "verify"; + if (k.startsWith("task")) return "tasks"; + if (k.startsWith("evidence")) return "evidence"; + return "failure_modes"; // "subtle failure mode(s)" / "failure_mode(s)" +} + export function parse(text: string): PlanDoc { const lines = text.split("\n"); - let objective = ""; + let title = ""; const goals: Goal[] = []; const log: string[] = []; let cur: Goal | null = null; - // While inside a `failure_modes:`/`evidence:` block, points at the list the "- " items append to. - let curList: string[] | null = null; + let curList: string[] | null = null; // the discriminator/failure_modes/evidence list "- " items append to + let inGoals = false; let inLog = false; const flush = () => { @@ -87,30 +109,27 @@ export function parse(text: string): PlanDoc { }; for (const line of lines) { - const objMatch = /^#\s+Goals:\s*(.*)$/.exec(line); - if (objMatch) { - objective = objMatch[1].trim(); + const tM = TITLE.exec(line); + if (tM && !title && !GOALS_HEADER.test(line) && !LOG_HEADER.test(line)) { + title = tM[1].trim(); continue; } - - const goalMatch = GOAL_HEADER.exec(line); - if (goalMatch) { + if (GOALS_HEADER.test(line)) { flush(); + inGoals = true; inLog = false; - const status = CHAR_TO_STATUS[(goalMatch[1] ?? " ").toLowerCase()] ?? "open"; - cur = { id: "", subject: goalMatch[2].trim(), status, done_when: "", failure_modes: [], evidence: [], subtasks: [] }; continue; } - if (LOG_HEADER.test(line)) { flush(); + inGoals = false; inLog = true; continue; } - - // Any other header ends the current goal / log section. + // Any other header (e.g. "# Future work") ends the goals / log section. if (ANY_HEADER.test(line)) { flush(); + inGoals = false; inLog = false; continue; } @@ -119,50 +138,70 @@ export function parse(text: string): PlanDoc { if (/^\s*-\s+/.test(line)) log.push(line); continue; } + if (!inGoals) continue; // title + context prose between the title and ## Goals + const goalM = GOAL_ITEM.exec(line); + if (goalM) { + flush(); + cur = { + subject: goalM[2].trim(), + status: CHAR_TO_STATUS[goalM[1].toLowerCase()] ?? "open", + discriminator: [], + failure_modes: [], + evidence: [], + subtasks: [], + }; + continue; + } if (!cur) continue; - const idMatch = ID_COMMENT.exec(line.trim()); - if (idMatch) { - cur.id = idMatch[1]; + const keyM = KEY_LINE.exec(line); + if (keyM) { + const key = normalizeKey(keyM[1]); + const inlineVal = keyM[2].trim(); + if (key === "verify") { + cur.verify = inlineVal || undefined; + curList = null; + } else if (key === "tasks") { + curList = null; // subtasks are identified by being a checkbox; this marker is cosmetic + } else { + curList = cur[key]; // discriminator | failure_modes | evidence + if (inlineVal) curList.push(inlineVal); + } continue; } - // A checkbox (column 0) is a subtask; checked first so it is never read as a list item. - const checkbox = CHECKBOX.exec(line); - if (checkbox) { - curList = null; - cur.subtasks.push({ done: checkbox[1].toLowerCase() === "x", text: checkbox[2].trim() }); - continue; - } - - const kv = /^(done_when|verify|failure_modes|evidence)\s*:\s*(.*)$/.exec(line); - if (kv) { - const [, key, value] = kv; - if (key === "done_when") cur.done_when = value.trim(); - else if (key === "verify") cur.verify = value.trim() || undefined; - // failure_modes/evidence open a "- " block; done_when/verify close any open one. - curList = key === "failure_modes" ? cur.failure_modes : key === "evidence" ? cur.evidence : null; - continue; - } - - // Indented "- " items under failure_modes:/evidence: (a column-0 checkbox already returned above). - if (curList) { - const item = /^\s*-\s+(.*)$/.exec(line); - if (item) { - curList.push(item[1].trim()); + const listM = LIST_ITEM.exec(line); + if (listM) { + const body = listM[1]; + const cb = CHECKBOX_BODY.exec(body); + if (cb) { + // A checkbox without a "goal:" prefix is a subtask of the current goal. + const cancelled = cb[1] === "~~" || body.includes("~~"); + const status = cancelled ? "cancelled" : (CHAR_TO_STATUS[cb[2].toLowerCase()] ?? "open"); + cur.subtasks.push({ text: cb[3].replace(/~~/g, "").trim(), status }); + curList = null; continue; } - if (line.trim() !== "") curList = null; + // A plain "- " / "> " item belongs to the current section (discriminator/failure/evidence). + if (curList) curList.push(body.trim()); + continue; + } + + // A non-empty, non-"- " line continues the current item, so multi-line evidence (a block quote + // of a log, a table, an interpretation line) stays attached to its item. Blank lines are skipped. + if (curList && line.trim() !== "" && curList.length > 0) { + curList[curList.length - 1] += `\n${line.trim()}`; } } flush(); - return { objective, goals, log }; + return { title, goals, log }; } -export function findGoal(doc: PlanDoc, id: string): Goal | undefined { - return doc.goals.find((g) => g.id === id); +export function findGoal(doc: PlanDoc, subject: string): Goal | undefined { + const want = subject.trim(); + return doc.goals.find((g) => g.subject === want); } export function counts(doc: PlanDoc): { done: number; open: number; active: number } { @@ -175,21 +214,18 @@ export function counts(doc: PlanDoc): { done: number; open: number; active: numb return c; } -/** Flip a goal's header checkbox in place (the one write CompleteGoal needs). Normalizes a header that - * lacks a checkbox by inserting one. */ -export function setGoalStatus(text: string, id: string, status: GoalStatus): string { +/** Flip a goal's checkbox in place, matched by its subject (the one write CompleteGoal needs). */ +export function setGoalStatus(text: string, subject: string, status: GoalStatus): string { const lines = text.split("\n"); - const idIdx = lines.findIndex((l) => ID_COMMENT.exec(l.trim())?.[1] === id); - if (idIdx === -1) throw new Error(`Goal #${id} not found`); - // The header sits just above the id comment; scan upward for it. - for (let i = idIdx; i >= 0; i--) { - const m = GOAL_HEADER.exec(lines[i]); - if (m) { - lines[i] = `## Goal: [${STATUS_TO_CHAR[status]}] ${m[2].trim()}`; + const want = subject.trim(); + for (let i = 0; i < lines.length; i++) { + const m = GOAL_ITEM.exec(lines[i]); + if (m && m[2].trim() === want) { + lines[i] = lines[i].replace(/\[[ xX/-]\]/, `[${STATUS_TO_CHAR[status]}]`); return lines.join("\n"); } } - throw new Error(`Goal #${id} has no ## Goal: header`); + throw new Error(`Goal "${subject}" not found`); } /** @@ -201,28 +237,28 @@ export type SignOff = | { kind: "rejected"; missing: string } | { kind: "accepted" }; -/** Apply a sign-off outcome to goals.md text: accept flips the header checkbox to [x] + logs; reject only logs. Pure. */ +/** Apply a sign-off outcome to goals.md text: accept flips the goal checkbox to [x] + logs; reject only logs. Pure. */ export function recordSignOff( text: string, - goalId: string, + subject: string, when: string, outcome: SignOff, ): { content: string; message: string; isError: boolean } { - const goal = findGoal(parse(text), goalId); - if (!goal) return { content: text, message: `No goal #${goalId} in goals.md.`, isError: true }; + const goal = findGoal(parse(text), subject); + if (!goal) return { content: text, message: `No goal "${subject}" in goals.md.`, isError: true }; if (outcome.kind === "verify_failed") { - const content = appendLog(text, `${when} reject #${goalId}: verify exit ${outcome.exitCode}`); + const content = appendLog(text, `${when} reject "${subject}": verify exit ${outcome.exitCode}`); return { content, message: `Sign-off rejected: verify failed (exit ${outcome.exitCode}).\n${outcome.outputTail}`, isError: true }; } if (outcome.kind === "rejected") { const oneLine = outcome.missing.replace(/\s+/g, " ").trim().slice(0, 200); - const content = appendLog(text, `${when} reject #${goalId}: ${oneLine}`); + const content = appendLog(text, `${when} reject "${subject}": ${oneLine}`); return { content, message: `Sign-off rejected. Missing:\n${outcome.missing}`, isError: true }; } - const flipped = setGoalStatus(text, goalId, "done"); - const content = appendLog(flipped, `${when} signed off #${goalId}: ${goal.subject} (oracle accept)`); - return { content, message: `Signed off #${goalId}: ${goal.subject}. Marked done in goals.md.`, isError: false }; + const flipped = setGoalStatus(text, subject, "done"); + const content = appendLog(flipped, `${when} signed off "${subject}" (judge accept)`); + return { content, message: `Signed off "${subject}". Marked done in goals.md.`, isError: false }; } /** Append one verbatim line to ## Log (creating the section if absent). The other CompleteGoal write. */ diff --git a/src/prompts.ts b/src/prompts.ts index 075ff3a..fac1cf0 100644 --- a/src/prompts.ts +++ b/src/prompts.ts @@ -8,7 +8,7 @@ * trapping it. Bypasses stay visible in the git diff and the widget. * * Flow: - * SETUP (plan mode) 1. planDrafting — strong/sticky model drafts goals + * SETUP (plan mode) 1. planDrafting — drafts goals (read-only phase) * EXEC, each turn start 2. planInjection — "here is your plan, where you are" * EXEC, periodic 3. reminder — the typed nudge that drives upkeep + autonomy * EXEC, loop continue 4. continuation — keep going toward the active goal @@ -22,61 +22,82 @@ * NOT YET WIRED: 4 continuation and 5 loopJudge define the autonomous re-prompt loop, which is * intentionally not built in v1 (an until-done-style loop was judged too complex). They stay here so * the full intended flow is reviewable; wire them if/when the loop is added. + * + * The goal's test is the DISCRIMINATOR: the concrete observation that tells real success from the + * named subtle failure mode. It replaces a vague "done_when". Evidence is empty at planning and + * filled at sign-off (you don't always know the exact artifacts up front; the judge checks them then). */ /* ───────────────────────────────────────────────────────────────────────── * 1. planDrafting — SETUP, plan mode * - * System guidance for the plan-phase agent. Runs on the plan model (may differ - * from the execution model; the choice is sticky — see oracle.json-style config). - * This phase is read-only: explore, then draft goals into goals.md. No code yet. - * The field requirements here are the whole "elicitation" — get them agreed up - * front, because the human reviews this output before any execution. + * System guidance for the plan-phase agent. This phase is read-only (edit/write + * and mutating bash are blocked by a tool hook): explore, then draft goals into + * goals.md. The fields here are the whole "elicitation"; the human reviews this + * output before any execution. * ──────────────────────────────────────────────────────────────────────── */ export const planDrafting = `\ -You are in plan mode. Explore the repository read-only, then draft goals into goals.md. -Do not write or run code in this phase. Produce a plan the human will review and approve. +You are in plan mode. The objective may arrive through conversation, not as one up-front command. +Explore the repository read-only first, then ask: resolve discoverable facts by looking them up, and +only ask the human when the answer is a genuine intent or preference choice that exploration can't +settle. Don't write goals that branch on something you could just check. Do not write or run code in +this phase (edit and write are blocked, and so is mutating bash). If the ask is itself read-only +(e.g. research, a search, a report), explore enough to scope it, but leave the actual deliverable for +after the human approves the plan. When the objective is clear, draft goals into goals.md and stop +for review. Produce a plan the human will review and approve. Right-size it, don't force structure that isn't there: -- Default to ONE goal. Add another only when it's a genuinely separate checkpoint you'd want - signed off on its own (its own done_when that can pass or fail independently). A long list of - near-identical goals should be one goal with subtasks. Most objectives are 1-2 goals. -- Subtasks are the steps inside a goal. Add them when a goal has 3+ distinct steps; skip them for - a single-action goal. Don't pad with trivial steps. -- Don't invent phases to look thorough. When in doubt, merge. +- Default to ONE goal. Add another only when it's a genuinely separate checkpoint you'd want signed + off on its own (it can pass or fail independently). Most objectives are 1-2 goals. +- Subtasks are the steps inside a goal. Add them when a goal has 3+ distinct steps; skip them for a + single-action goal. Don't pad with trivial steps. +- Don't invent goals to look thorough. When in doubt, merge. -Write the whole file in this shape: +Write the whole file in this shape (markdown checkboxes, made to be skim-reviewed): -# Goals: +# -## Goal: [ ] - -done_when: -verify: -- [ ] -- [ ] + -failure_modes: - - -evidence: - - +## Goals -Keep it lean: -- The goal's state is the checkbox in its header: [ ] open, [/] active, [x] done, [-] cancelled. - Leave it [ ] at planning. Every goal needs its line; CompleteGoal finds goals by it. -- The subtask checklist comes right under the goal; failure_modes and the (empty) evidence block - sit at the end, after a blank line. Don't let the dash-lists run together. -- evidence stays empty at planning. You fill it when the goal is actually done, just before calling - CompleteGoal, with a "- " list pointing at real artifacts (files, saved logs, committed diffs). -- done_when is ONE concrete, checkable condition, not a paragraph, no "if wrong" clause. - The symptom of failure goes in failure_modes, not here. -- done_when names a real artifact: a file, a test result, a committed diff, a program's output. - Never write it about goals.md's own checkbox or ## Log: CompleteGoal writes those when it accepts, - so a done_when about them is circular and the sign-off can never pass. -- failure_modes: 0-2 terse items, only the non-obvious ways a "done" could be wrong (a - pre-mortem). If you add a verify command, one mode can be "verify passes on a gamed file". -- subtasks: a short checklist of the real steps; omit them if the goal is a single action. -- Prefer a verify command when success is a test/build/threshold. A green check beats prose. +1. [ ] goal: + - subtle failure mode: + - discriminator: + - verify: + - tasks: + 1. [ ] + 2. [ ] + - evidence: + - +2. [ ] goal: <...> + +# Future work / out of scope + +- + +## Log + +Keep it lean and legible: +- A goal is a checkbox line beginning "goal:"; its state is the checkbox ([ ] open, [/] active, [x] + done, [-] cancelled). Leave goals [ ] at planning. The number is just for the human to reference. +- subtle failure mode + discriminator are the heart of this. List the ways a "done" could look + achieved but not be (empty/zero-count output, a silently-errored step, a gamed test, a flat/no-op + result that dodged every trap and still showed nothing; these are examples, find the ones that fit). +- The discriminator is the POSITIVE observation that the goal actually succeeded AND that none of + those failure modes could have produced. It must show success happened -- the count moved the right + way, the test really exercised the path, the metric beat noise -- not merely that a failure was + ruled out: avoiding every failure mode is necessary, not sufficient. Name the success signal first, + then check it isn't something a failure mode could fake. Keep it terse. +- The discriminator is the success test, written now, in place of a vague "done": make it a concrete, + checkable observation about a real artifact (a file, a test result, a committed diff, a metric), not + about goals.md's own checkbox. +- subtasks: any checkbox WITHOUT a "goal:" prefix, under "- tasks:". Use [/] for in progress and [-] + for cancelled/impossible. +- verify: prefer one when the discriminator is a test, build, threshold, or metric: a green check or + a printed number beats prose. Omit it otherwise. +- evidence stays empty at planning. You don't always know the exact artifacts up front, and that's + fine: you fill evidence at sign-off, and a fresh read-only judge checks it then. When the goals are drafted, present them and stop for review. Do not begin execution.`; @@ -85,25 +106,26 @@ When the goals are drafted, present them and stop for review. Do not begin execu * * A late user-role message, NOT a system-prompt mutation (keeps the prefix cache * valid). Built from the parsed plan. MUST be byte-identical when nothing changed: - * fixed field order, no volatile timestamps in the body. Pass only the active - * goal + its open subtasks + the last log line — not the whole file. + * fixed field order, no volatile timestamps. Pass only the active goal + its open + * subtasks + the last log line, not the whole file. * ──────────────────────────────────────────────────────────────────────── */ export function planInjection(p: { - objective: string; - activeGoal: { subject: string; done_when: string; openSubtasks: string[] } | null; + title: string; + activeGoal: { subject: string; discriminator: string[]; openSubtasks: string[] } | null; lastLogLine: string | null; counts: { done: number; open: number }; }): string { if (!p.activeGoal) { - return `Goals (goals.md): ${p.objective}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal (set its header to [/]) or run /goals.`; + return `Goals (goals.md): ${p.title}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal (set its checkbox to [/]) or run /goals.`; } const subtasks = p.activeGoal.openSubtasks.length ? p.activeGoal.openSubtasks.map((s) => ` - [ ] ${s}`).join("\n") : " (no open subtasks)"; + const disc = p.activeGoal.discriminator.length ? p.activeGoal.discriminator.join("; ") : "(none set)"; return `\ -Goals (goals.md): ${p.objective} +Goals (goals.md): ${p.title} Active goal: ${p.activeGoal.subject} -done_when: ${p.activeGoal.done_when} +discriminator (the success test): ${disc} Open subtasks: ${subtasks} Last log: ${p.lastLogLine ?? "(none yet)"} @@ -114,20 +136,20 @@ Progress: ${p.counts.done} done, ${p.counts.open} open.`; * 3. reminder — EXEC, periodic system-reminder * * The typed nudge. This is both the housekeeping and the autonomy engine — it is - * what makes the process get followed without a hard gate. Fires after N - * file-modifying turns since the last goals.md update while a goal is active. - * Keep the wording stable so it doesn't thrash the cache. + * what makes the process get followed without a hard gate. Fires after a turn that + * left goals.md untouched while a goal is active. Keep the wording stable so it + * doesn't thrash the cache. * ──────────────────────────────────────────────────────────────────────── */ export const reminder = `\ Keep goals.md current as you work: -- tasks: tick the subtasks you've finished; add any new ones you've discovered. +- tasks: tick the subtasks you've finished ([/] for in progress); add any you've discovered. - log: append ONE short line to ## Log (append, don't rewrite earlier lines). -- goal: when the active goal's done_when is met, fill its evidence: block in goals.md (a "- " list - pointing at durable artifacts), then call CompleteGoal with the goal_id. Don't tick the goal's - header [x] by hand; CompleteGoal reads the evidence, runs the check, and writes [x]. -- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely - blocked; if blocked, say what's blocking and why. +- goal: when the active goal's discriminator is satisfied, fill its evidence: block in goals.md (a + list pointing at durable artifacts), then call CompleteGoal with the goal's desc. Don't tick the + goal [x] by hand; CompleteGoal reads the evidence, runs the check, and writes [x]. +- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely blocked; + if blocked, say what's blocking it. `; /* ───────────────────────────────────────────────────────────────────────── @@ -137,9 +159,9 @@ Keep goals.md current as you work: * continue. Does not mutate the system prompt, so the cache holds. * ──────────────────────────────────────────────────────────────────────── */ export const continuation = `\ -Continue toward the active goal in goals.md. If it now meets its done_when, fill the goal's -evidence: block (durable artifacts: saved logs, committed diffs, files, not just claims) and then -call CompleteGoal with the goal_id. If you're blocked, state what's blocking it.`; +Continue toward the active goal in goals.md. If its discriminator is now satisfied, fill the goal's +evidence: block (durable artifacts, e.g. saved logs, committed diffs, files, not just claims) and +then call CompleteGoal with the goal's desc. If you're blocked, state what's blocking it.`; /* ───────────────────────────────────────────────────────────────────────── * 5. loopJudge — EXEC, runs after each turn to decide continue / pause @@ -154,12 +176,12 @@ You decide whether an autonomous coding agent should keep working or pause for t Be conservative: only pause when the work is plainly finished or plainly blocked. When in doubt, continue. You are not verifying correctness; a later read-only judge does that. Reply with ONLY a JSON object, no other text: {"done": boolean, "reason": ""}. -Set done=true only if the agent's last message shows the active goal's done_when is met, or -the agent says it is blocked and needs the human.`; +Set done=true only if the agent's last message shows the active goal's discriminator is satisfied, +or the agent says it is blocked and needs the human.`; -export function loopJudgeUser(p: { activeGoalDoneWhen: string; lastResponse: string }): string { +export function loopJudgeUser(p: { discriminator: string; lastResponse: string }): string { return `\ -Active goal done_when: ${p.activeGoalDoneWhen} +Active goal discriminator (the success test): ${p.discriminator} Agent's last message: """ @@ -172,22 +194,26 @@ ${p.lastResponse} /* ───────────────────────────────────────────────────────────────────────── * 6. evidenceJudge — SIGN-OFF, the one rigorous check * - * Runs inside CompleteGoal, on the read-only oracle subprocess (fresh context, - * strongest reasoning on the chosen provider; override to a different vendor for - * high-stakes goals). It re-derives from the repo rather than trusting the - * agent's transcription, and it judges whether a verify command actually tests - * the criterion or could pass while a named failure mode holds (gaming). + * Runs inside CompleteGoal, on a read-only pi subprocess (fresh context via + * --no-session, so it never sees the working agent's transcript; override to a + * different vendor for an independent cross-family check). It re-derives from the + * repo rather than trusting the agent's transcription, and judges whether the + * evidence satisfies the discriminator and rules out the named failure mode. * * The transport gives it read/grep/find/ls. The prompt below imposes the verdict - * contract — the oracle returns prose by default, so parse the VERDICT line. + * contract — the subprocess returns prose by default, so parse the VERDICT line. * ──────────────────────────────────────────────────────────────────────── */ export const evidenceJudgeSystem = `\ You are a read-only reviewer signing off a coding goal. Do not trust claims; verify. Use read/grep/find/ls to inspect the repository and the cited artifacts yourself. Re-read the files, logs, and diffs the evidence points to; if something it asserts isn't on disk, you can't -confirm it. If a verify command was run, judge whether it genuinely tests the criterion or -could pass while one of the listed failure modes still holds; a tautological or skipped test -is a reject. Check each failure mode is actually ruled out, not just unmentioned. +confirm it. Judge whether the evidence shows the goal POSITIVELY succeeded -- the discriminator's +success signal is actually present, not just that the failure modes were dodged. Avoiding every +failure mode is necessary but not sufficient: a run can rule out each trap and still have produced +nothing, so reject "no problems found" that lacks the positive result. Then check the named subtle +failure modes are genuinely ruled out, not just unmentioned. If a verify command was run, +judge whether it really tests the discriminator or could pass while the failure mode still holds; a +tautological or skipped test is a reject. Finish with exactly these two lines and nothing after: VERDICT: accept | reject @@ -195,10 +221,10 @@ missing: ` - ${f}`).join("\n")} +discriminator (must be satisfied): +${p.discriminator.map((d) => ` - ${d}`).join("\n") || " (none stated, note this)"} +subtle failure modes (must be ruled out): +${p.failure_modes.map((f) => ` - ${f}`).join("\n") || " (none stated)"} ${verifyBlock} @@ -219,5 +246,5 @@ ${p.evidence} Artifacts it points to (inspect these): ${p.paths.map((x) => ` - ${x}`).join("\n") || " (none listed, note this)"} -Verify the goal against its done_when. Then give your VERDICT.`; +Verify the evidence satisfies the discriminator and rules out the failure modes. Then give your VERDICT.`; } diff --git a/test/plan-file.test.ts b/test/plan-file.test.ts index d9c1daf..01e62d4 100644 --- a/test/plan-file.test.ts +++ b/test/plan-file.test.ts @@ -1,27 +1,30 @@ import { describe, expect, it } from "vitest"; import { appendLog, counts, findGoal, parse, recordSignOff, setGoalStatus } from "../src/plan-file.js"; -const SAMPLE = `# Goals: ship the cache layer +const SAMPLE = `# papers audit -## Goal: [/] Implement cache layer - -done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log -verify: pytest tests/cache -q -failure_modes: - - cache silently bypassed (hit-rate ~0, latency ok by luck) - - bench too small to exercise eviction -- [x] wire cache client -- [ ] eviction policy -- [ ] load test -evidence: - - load-test.log shows p95=41ms - - hit-rate 0.93 in load-test.log +Clean up steering/ metadata and kill empty dirs. Keep it read-only until I approve. -## Goal: [ ] Document the API - -done_when: every public fn has a docstring; else sphinx warns -failure_modes: - - docstrings exist but are stale +## Goals + +1. [/] goal: Implement cache layer + - discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0) + - subtle failure mode: cache silently bypassed, latency ok by luck + - verify: pytest tests/cache -q + - tasks: + 1. [x] wire cache client + 2. [/] eviction policy + 3. ~~[ ]~~ distributed cache, out of scope + - evidence: + - > load-test.log: p95=41ms + - > hit-rate 0.93 (not bypassed) +2. [ ] goal: Document the API + - discriminator: every public fn has a docstring; sphinx warns on none + - subtle failure mode: docstrings exist but are stale + +# Future work / out of scope + +- distributed cache ## Log - 2026-06-15 14:02 cache client wired; eviction next @@ -49,92 +52,74 @@ function lineDelta(a: string, b: string): { added: number; removed: number } { describe("parse", () => { const doc = parse(SAMPLE); - it("reads the objective and both goals", () => { - expect(doc.objective).toBe("ship the cache layer"); - expect(doc.goals.map((g) => g.id)).toEqual(["cache-layer-1", "document-the-api-1"]); + it("reads the title and both goals (matched by subject)", () => { + expect(doc.title).toBe("papers audit"); + expect(doc.goals.map((g) => g.subject)).toEqual(["Implement cache layer", "Document the API"]); }); - it("reads goal fields, with status from the header checkbox", () => { - const g = findGoal(doc, "cache-layer-1"); - expect(g?.subject).toBe("Implement cache layer"); - expect(g?.status).toBe("active"); // from the [/] in the header - expect(g?.done_when).toBe("p95 < 50ms on bench-X. If wrong: timeouts in load-test.log"); + it("reads goal status from the checkbox", () => { + expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // [/] + expect(findGoal(doc, "Document the API")?.status).toBe("open"); // [ ] + }); + + it("reads discriminator, subtle failure mode, and verify as separate fields", () => { + const g = findGoal(doc, "Implement cache layer"); + expect(g?.discriminator).toEqual(["hit-rate > 0.8 in load-test.log (a bypass reads ~0)"]); + expect(g?.failure_modes).toEqual(["cache silently bypassed, latency ok by luck"]); expect(g?.verify).toBe("pytest tests/cache -q"); - expect(findGoal(doc, "document-the-api-1")?.status).toBe("open"); // from [ ] }); - it("separates failure_modes from subtasks", () => { - const g = findGoal(doc, "cache-layer-1"); - expect(g?.failure_modes).toHaveLength(2); - expect(g?.failure_modes[0]).toContain("cache silently bypassed"); + it("reads subtasks with their checkbox state, strikethrough as cancelled", () => { + const g = findGoal(doc, "Implement cache layer"); expect(g?.subtasks).toEqual([ - { text: "wire cache client", done: true }, - { text: "eviction policy", done: false }, - { text: "load test", done: false }, + { text: "wire cache client", status: "done" }, + { text: "eviction policy", status: "active" }, + { text: "distributed cache, out of scope", status: "cancelled" }, ]); }); - it("reads the evidence block, separate from failure_modes and subtasks", () => { - const g = findGoal(doc, "cache-layer-1"); - expect(g?.evidence).toEqual(["load-test.log shows p95=41ms", "hit-rate 0.93 in load-test.log"]); - expect(g?.failure_modes).toHaveLength(2); // unchanged by the evidence block that follows the subtasks - const g2 = findGoal(doc, "document-the-api-1"); - expect(g2?.evidence).toEqual([]); // a goal with no evidence block parses to [] + it("reads the evidence block separate from the other lists", () => { + const g = findGoal(doc, "Implement cache layer"); + expect(g?.evidence).toEqual(["> load-test.log: p95=41ms", "> hit-rate 0.93 (not bypassed)"]); + expect(findGoal(doc, "Document the API")?.evidence).toEqual([]); // a goal with no evidence parses to [] + }); + + it("keeps a multi-line evidence item together (quote + interpretation)", () => { + const doc2 = parse( + `# x\n\n## Goals\n\n1. [ ] goal: G\n - discriminator: report has non-zero counts\n - evidence:\n - > report.txt: counts 52 -> 4\n remaining 4 = index + 3 notes\n almost certain the discriminator passes\n - > second item, single line\n`, + ); + expect(findGoal(doc2, "G")?.evidence).toEqual([ + "> report.txt: counts 52 -> 4\nremaining 4 = index + 3 notes\nalmost certain the discriminator passes", + "> second item, single line", + ]); }); it("reads the log verbatim and counts by status", () => { expect(doc.log).toEqual(["- 2026-06-15 14:02 cache client wired; eviction next"]); expect(counts(doc)).toEqual({ done: 0, open: 1, active: 1 }); }); -}); -describe("failure_modes vs subtask disambiguation", () => { - it("a column-0 checkbox right after failure_modes: is a SUBTASK", () => { - const doc = parse( - `# Goals: x\n\n## Goal: [ ] G\n\ndone_when: z\nfailure_modes:\n- [ ] first subtask\n- [x] second subtask\n`, - ); - const g = findGoal(doc, "g-1"); - expect(g?.failure_modes).toEqual([]); - expect(g?.subtasks).toEqual([ - { text: "first subtask", done: false }, - { text: "second subtask", done: true }, - ]); - }); - - it("an indented checkbox-shaped item inside failure_modes is a FAILURE MODE", () => { - const doc = parse( - `# Goals: x\n\n## Goal: [ ] G\n\ndone_when: z\nfailure_modes:\n - [ ] prose that looks like a checkbox\n- [ ] real subtask\n`, - ); - const g = findGoal(doc, "g-2"); - expect(g?.failure_modes).toEqual(["[ ] prose that looks like a checkbox"]); - expect(g?.subtasks).toEqual([{ text: "real subtask", done: false }]); - }); - - it("a goal with no failure_modes keeps its subtasks", () => { - const doc = parse(`# Goals: x\n\n## Goal: [ ] G\n\ndone_when: z\n- [ ] only subtask\n`); - const g = findGoal(doc, "g-3"); - expect(g?.failure_modes).toEqual([]); - expect(g?.subtasks).toEqual([{ text: "only subtask", done: false }]); + it("ignores the Future work section, does not read it as goals or log", () => { + expect(doc.goals).toHaveLength(2); + expect(doc.log).toHaveLength(1); }); }); describe("the two CompleteGoal writes (minimal diff)", () => { it("setGoalStatus replaces exactly one line, scoped to the right goal", () => { - const next = setGoalStatus(SAMPLE, "cache-layer-1", "done"); + const next = setGoalStatus(SAMPLE, "Implement cache layer", "done"); expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 1 }); - expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("done"); - expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("open"); // untouched + expect(findGoal(parse(next), "Implement cache layer")?.status).toBe("done"); + expect(findGoal(parse(next), "Document the API")?.status).toBe("open"); // untouched }); - it("setGoalStatus targets the second goal without touching the first", () => { - const next = setGoalStatus(SAMPLE, "document-the-api-1", "active"); - expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("active"); - expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("active"); + it("setGoalStatus keeps the number and goal: prefix, flips only the checkbox", () => { + expect(setGoalStatus(SAMPLE, "Implement cache layer", "done")).toContain("1. [x] goal: Implement cache layer"); + expect(setGoalStatus(SAMPLE, "Document the API", "cancelled")).toContain("2. [-] goal: Document the API"); }); - it("setGoalStatus writes the checkbox char into the header line", () => { - expect(setGoalStatus(SAMPLE, "cache-layer-1", "done")).toContain("## Goal: [x] Implement cache layer"); - expect(setGoalStatus(SAMPLE, "document-the-api-1", "cancelled")).toContain("## Goal: [-] Document the API"); + it("setGoalStatus throws on an unknown subject", () => { + expect(() => setGoalStatus(SAMPLE, "no such goal", "done")).toThrow(); }); it("appendLog adds exactly one line under ## Log", () => { @@ -147,7 +132,7 @@ describe("the two CompleteGoal writes (minimal diff)", () => { }); it("appendLog creates the section when absent", () => { - const noLog = "# Goals: x\n\n## Goal: [ ] y\n\ndone_when: z\n"; + const noLog = "# x\n\n## Goals\n\n1. [ ] goal: y\n - discriminator: z\n"; expect(parse(appendLog(noLog, "first entry")).log).toEqual(["- first entry"]); }); }); @@ -156,30 +141,30 @@ describe("recordSignOff (CompleteGoal's pure record logic)", () => { const WHEN = "2026-06-15 16:00"; it("accept flips status:done and logs a sign-off line", () => { - const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "accepted" }); + const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "accepted" }); expect(r.isError).toBe(false); const doc = parse(r.content); - expect(findGoal(doc, "cache-layer-1")?.status).toBe("done"); - expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off #cache-layer-1: Implement cache layer (oracle accept)`); + expect(findGoal(doc, "Implement cache layer")?.status).toBe("done"); + expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off "Implement cache layer" (judge accept)`); }); it("verify_failed only logs a reject line, status stays active", () => { - const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" }); + const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" }); expect(r.isError).toBe(true); const doc = parse(r.content); - expect(findGoal(doc, "cache-layer-1")?.status).toBe("active"); // NOT marked done - expect(doc.log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: verify exit 1`); + expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // NOT marked done + expect(doc.log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": verify exit 1`); }); it("rejected logs the (one-lined) missing reason, status stays", () => { - const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" }); + const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" }); expect(r.isError).toBe(true); - expect(findGoal(parse(r.content), "cache-layer-1")?.status).toBe("active"); - expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: no saved bench log`); + expect(findGoal(parse(r.content), "Implement cache layer")?.status).toBe("active"); + expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": no saved bench log`); }); it("unknown goal returns an error and does not touch the file", () => { - const r = recordSignOff(SAMPLE, "nope-1", WHEN, { kind: "accepted" }); + const r = recordSignOff(SAMPLE, "nope", WHEN, { kind: "accepted" }); expect(r.isError).toBe(true); expect(r.content).toBe(SAMPLE); });