pi-goals: discriminator/failure-mode format + visible sign-off judge

Replace done_when with a discriminator + subtle-failure-mode pair as the
heart of each goal. The discriminator is the POSITIVE success observation
that no failure mode could fake, not just failure-avoidance: a run can
dodge every trap and still produce nothing. Carried through planDrafting,
the sign-off judge, README, and the parser doc.

Format migration: flat numbered markdown goals (`1. [/] goal: ...`),
keyword-anchored parsing (indentation cosmetic), goals matched by text,
subtask states [ ]/[/]/[x]/[-] plus ~~strike~~. Evidence empty at
planning, filled at sign-off, multi-line supported.

CompleteGoal now returns the judge's reasoning under a
`--- sign-off judge ---` block (was just "Signed off"), so the verdict is
visible. Plan mode is read-only: edit/write (except goals.md) and
mutating bash are blocked by a tool hook.

17 parser tests, typecheck + biome clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-16 11:45:08 +08:00
parent a65c822bf9
commit 838c42d7bd
7 changed files with 565 additions and 394 deletions
+1
View File
@@ -1,5 +1,6 @@
node_modules/ node_modules/
dist/ dist/
*.log *.log
.pi/
docs/reviews/raw.jsonl docs/reviews/raw.jsonl
docs/reviews/err.txt docs/reviews/err.txt
+119 -64
View File
@@ -1,17 +1,16 @@
# pi-goals # pi-goals
A [pi](https://github.com/badlogic/pi-mono) extension for plan-driven, goal-tracked work in one Plan mode for agreeing on goals before any code gets written. Each goal names the subtle failure mode
`goals.md`. Set up goals (with evidence and failure modes) in plan mode, work them, and sign a goal that could fake a "done" and the discriminator that tells real success from it, plus subtasks and the
off only when a read-only subagent has checked the evidence. evidence that gets checked at sign-off. It all lives in one markdown file you can read and print. A
widget keeps the goals in front of you through compaction, a reminder nudges the agent to keep the
file current and work toward the goals on its own, and a goal is signed off only after a read-only
subagent has checked its evidence.
Successor to [pi-lgtm](https://github.com/wassname/pi-lgtm), kept deliberately small: about It guides rather than guards. Like [pi-milestones](https://github.com/Neuron-Mr-White/UniPi/tree/main/packages/milestone)
[burneikis/pi-plan](https://github.com/burneikis/pi-plan) plus the additions, goals with evidence, and [burneikis/pi-plan](https://github.com/burneikis/pi-plan), it leans on a form and a process to
a sign-off check, a widget, and a reminder. steer the agent and trust its judgement. [pi-lgtm](https://github.com/wassname/pi-lgtm) was my earlier
attempt and got too complex; this one stays small and maintainable.
The form guides; it does not gate. The agent edits `goals.md` with its normal Edit tool. The one
blessed tool is `CompleteGoal`, which runs the sign-off check and records the result. The reminder,
the injected plan summary, and git/widget visibility carry the process. It trusts the agent's
judgement rather than guarding it.
## Install ## Install
@@ -19,7 +18,7 @@ judgement rather than guarding it.
pi install npm:@wassname2/pi-goals pi install npm:@wassname2/pi-goals
``` ```
Or run without installing: Or run it without installing:
```bash ```bash
pi -e npm:@wassname2/pi-goals pi -e npm:@wassname2/pi-goals
@@ -28,78 +27,133 @@ pi -e npm:@wassname2/pi-goals
## Use ## Use
``` ```
/goals add CSV export to the report view /goals CSV export for the report view
``` ```
1. Plan. The agent explores read-only and writes goals into `goals.md` (see format below). `/goals` enters plan mode and starts a conversation; the description is an optional seed, so plain
2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel. `/goals` works too. From there:
On Ready you choose whether to keep the current context or start fresh and compacted.
3. Work. Each turn the active goal is injected (so it survives compaction) and a reminder nudges
the agent to keep `goals.md` current and work autonomously. When a goal's `done_when` is met the
agent calls `CompleteGoal`, which runs `verify` and a read-only judge and, on accept, marks it
done and logs it.
Other commands: `/goals` (print the goals), `/goals clear` (empty `goals.md`, history kept in git), 1. Plan. The agent explores read-only, asks about anything unclear, and writes the goals into
`/goals judge <model-ref>` (use a specific model for the sign-off judge; default is your current `.pi/goals.md`.
model). 2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel. On
Ready you choose whether to keep the current context or start fresh and compacted.
3. Work. Each turn the active goal is injected so it survives compaction, and a reminder nudges the
agent to keep `goals.md` current and keep going. When a goal's discriminator is satisfied the agent
calls `CompleteGoal`, which runs `verify` and a read-only judge, then marks the goal done and logs it.
## goals.md format Other commands: `/goals clear` empties `.pi/goals.md`; `/goals judge <model-ref>` picks a specific
model for the sign-off judge (the default is your current model).
One file holds the objective, the goals, and a short append-only log. ## Example
Start plan mode with an optional seed:
```
/goals audit the papers dir metadata and clean up empty dirs
```
The agent explores read-only, then drafts the goal with a subtle failure mode and the discriminator
that beats it, and stops for review:
```markdown ```markdown
# Goals: ship the cache layer ## Goals
## Goal: [/] Implement cache layer 1. [ ] goal: Audit steering/ metadata and remove empty dirs
<!-- id: cache-layer-1 --> - subtle failure mode: report written but counts are zero (resolver errored silently)
done_when: p95 < 50ms on bench-X - discriminator: report shows the XXXX count before/after AND a non-zero rename count
verify: pytest tests/cache -q && python bench/p95.py --max-ms 50 - tasks:
- [x] wire cache client 1. [ ] dry-run the metadata resolve
- [ ] eviction policy 2. [ ] remove the empty _artifacts dirs
3. [ ] write the report
- evidence:
- <empty until sign-off>
```
failure_modes: You choose Ready. The agent works the subtasks, then fills `evidence` (each item an artifact plus a
- cache silently bypassed (hit-rate ~0, latency ok by luck) short read of it) and calls `CompleteGoal`:
- bench too small to exercise eviction
evidence: ```markdown
- load-test.log p95=41ms; bench/p95.py exited 0 - evidence:
- cache hit-rate 0.93 in load-test.log (not bypassed) - > scripts/metadata_report.txt: XXXX 52 -> 4, 146 empty _artifacts removed
- > 48 files renamed; almost certain done, the silent-resolver failure mode is ruled out
```
A fresh read-only subagent re-checks that evidence against the repo and the discriminator, then
returns its verdict and reasoning:
```
Signed off "Audit steering/ metadata and remove empty dirs". Marked done in goals.md.
--- sign-off judge ---
metadata_report.txt present; counts 52 -> 4 confirmed; rename log shows 48 renamed (not zero).
VERDICT: accept
```
## The goals.md format
One project-local file, `<cwd>/.pi/goals.md` (gitignored, like pi-tasks), holds the title, a context
block, the goals, and a short append-only log. A fresh `/goals` draft replaces it.
```markdown
# ship the cache layer
Latency target came from the SLO review; keep the existing client API.
## Goals
1. [/] goal: Implement cache layer
- subtle failure mode: cache silently bypassed, latency ok by luck
- discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0)
- verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
- tasks:
1. [x] wire cache client
2. [/] eviction policy
- evidence:
- > load-test.log: p95=41ms, hit-rate 0.93 (not bypassed)
# Future work / out of scope
- distributed cache
## Log ## Log
- 2026-06-15 14:02 cache client wired; eviction next - 2026-06-15 14:02 cache client wired; eviction next
``` ```
- A goal is a `## Goal:` header whose checkbox carries its state (`[ ]` open, `[/]` active, `[x]` - A goal is a numbered checkbox line beginning `goal:`; the checkbox carries its state (`[ ]` open,
done, `[-]` cancelled), then an `<!-- id -->`, one falsifiable `done_when:`, an optional `verify:` `[/]` active, `[x]` done, `[-]` cancelled). Goals are matched by their text, so the number is just
shell command, `- [ ]` subtasks, an optional short `failure_modes:` pre-mortem list, and an for you to reference.
`evidence:` list. - The `discriminator` is the success test, written while planning: the positive observation that the
- `done_when` is the test, written at planning. `evidence` is the proof, a `- ` list the agent fills goal actually succeeded and that none of the `subtle failure mode`s could fake. It has to show
at completion pointing at durable artifacts; `CompleteGoal` reads it from the file. `failure_modes` something happened (a count moved, a test exercised the path, a metric beat noise), not just that a
is the pre-mortem. `verify`, when present, is the deterministic first stage of the sign-off. failure was avoided. `evidence` is the proof, filled at sign-off:
- The agent ticks subtasks, appends to `## Log`, and sets the header checkbox (`[/]` when it starts each item pairs a durable artifact (a quoted and linked log, a table, a metric) with a short read of
a goal) as it works. Only `CompleteGoal` writes `[x]`. Multiple goals may be active. it, not a bare claim. `verify`, when present, is the deterministic first stage of the sign-off.
- Subtasks are any checkbox without a `goal:` prefix, under `- tasks:` (`[/]` in progress, `[-]`
cancelled). The agent ticks them, appends to `## Log`, and sets a goal `[/]` when it starts it. Only
`CompleteGoal` writes `[x]`. Several goals can be active at once.
## The sign-off check (`CompleteGoal`) ## Signing off a goal (`CompleteGoal`)
`CompleteGoal(goal_id)` is the one blessed completion path. It reads the goal's `evidence:` block `CompleteGoal(goal)` (matched by the goal's text) is the only tool that marks a goal done; everything
from goals.md (so the proof is git-tracked and human-reviewable before sign-off, not buried in a tool else is the agent editing the file. It reads the goal's `evidence:` block from `.pi/goals.md`, so the
call): proof stays in the file where you can review it, then:
1. If the goal has a `verify:` command, it is run. A non-zero exit rejects immediately, with no model 1. If the goal has a `verify:` command, it runs. A non-zero exit rejects right away, with no model
call. call.
2. Otherwise a read-only `pi` subprocess (the judge) inspects the `evidence:` items against the repo 2. Otherwise a read-only `pi` subprocess (a fresh `--no-session` context, so it never sees the working
and the named failure modes and returns a verdict. It re-derives from the artifacts the evidence agent's transcript) inspects the `evidence:` against the repo, the `discriminator`, and the
points at rather than trusting the claim, so the `evidence:` list should name durable artifacts `subtle failure mode`, and returns a verdict. It re-derives from the cited artifacts rather than
(saved logs, committed diffs, files). trusting the claim, so list real artifacts, not assertions.
3. On accept, the goal's header checkbox flips to `[x]` and a `## Log` line is written. On reject, 3. On accept, the goal flips to `[x]` and a `## Log` line is written. On reject, the goal stays open
the goal stays open and the agent is told what is missing. and the agent is told what is missing. Either way the judge's reasoning comes back in the result.
The judge defaults to your current model (guaranteed authorized and capable). Set a different one The judge defaults to your current model (a fresh context, same weights). Point it at another with
with `/goals judge <provider/model>` for an independent cross-family check. `/goals judge <provider/model>` for an independent cross-family check.
## Prompts ## Prompts
All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so the process is All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so you can read the
easy to review end to end. whole process top to bottom.
## Develop ## Develop
@@ -112,8 +166,9 @@ npm run lint
## Not (yet) included ## Not (yet) included
No autonomous re-prompt loop (an until-done-style loop judge). Autonomy comes from the reminder, not - No autonomous re-prompt loop. The reminder nudges the agent within a turn, but the turn still ends
a harness. Plan-phase model stickiness is a documented next step. and hands back to you; nothing auto-re-prompts until the goals are done.
- The plan and execution phases can't yet run on different, sticky models.
## License ## License
+1 -1
View File
@@ -1,7 +1,7 @@
{ {
"name": "@wassname2/pi-goals", "name": "@wassname2/pi-goals",
"version": "0.0.1", "version": "0.0.1",
"description": "One goals.md: set goals in plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.", "description": "One .pi/goals.md: set goals in plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.",
"author": "wassname", "author": "wassname",
"license": "MIT", "license": "MIT",
"type": "module", "type": "module",
+130 -63
View File
@@ -1,7 +1,9 @@
/** /**
* pi-goals — plan mode that sets up goals with evidence, tracked in one goals.md, signed off by a * pi-goals — plan mode that sets up goals with evidence, tracked in one .pi/goals.md, signed off by a
* read-only subagent check. A successor to pi-lgtm, kept deliberately small (≈ burneikis/pi-plan * read-only subagent check. A successor to pi-lgtm, kept deliberately small (≈ burneikis/pi-plan
* plus the additions: goals + failure_modes + subtasks, a sign-off check, a widget, a reminder). * plus the additions: goals + a discriminator + a subtle failure mode + subtasks, a sign-off check,
* a widget, a reminder). A goal's success test is its discriminator: the observation that tells real
* success from the named failure mode.
* *
* Philosophy (spec D3): the form guides, it does not gate. The agent edits goals.md with its normal * Philosophy (spec D3): the form guides, it does not gate. The agent edits goals.md with its normal
* Edit tool. The one blessed tool is CompleteGoal, which runs the sign-off check and records it. The * Edit tool. The one blessed tool is CompleteGoal, which runs the sign-off check and records it. The
@@ -9,19 +11,30 @@
* judgement rather than guarding it. * judgement rather than guarding it.
* *
* Flow: * Flow:
* /goals <objective> -> plan mode: agent explores, drafts goals into goals.md (planDrafting guides) * /goals [objective] -> plan mode (conversational): objective is an optional seed; agent explores
* read-only, asks, then drafts goals into .pi/goals.md (planDrafting guides)
* agent_end -> review menu (Ready / Edit / $EDITOR / Cancel); Ready offers compaction * agent_end -> review menu (Ready / Edit / $EDITOR / Cancel); Ready offers compaction
* execution -> each turn, inject the plan summary (survives compaction) + a reminder; * execution -> each turn, inject the plan summary (survives compaction) + a reminder;
* agent works goals, ticks subtasks, appends ## Log, calls CompleteGoal * agent works goals, ticks subtasks, appends ## Log, calls CompleteGoal
* CompleteGoal -> optional deterministic verify, then a read-only oracle judge -> accept * CompleteGoal -> optional deterministic verify, then a read-only oracle judge -> accept
* flips status:done + logs; reject returns what's missing * flips status:done + logs; reject returns what's missing
* *
* All model-facing text lives in prompts.tsx, in flow order. * The plan file lives at <cwd>/.pi/goals.md (project-local, gitignored, like pi-tasks), not in the
* repo. A fresh /goals draft just replaces it (the "overwrite" staleness rule).
*
* Plan mode is read-only: the tool_call hook blocks edit/write (except goals.md itself) and mutating
* bash while drafting, so code isn't written before the goals are agreed. Read-only bash exploration
* stays open (blocklist, not allowlist).
*
* Not built (FIXME): no plan-vs-exec model switch on accept (plan-model stickiness); noted at its
* call site below.
*
* All model-facing text lives in prompts.ts, in flow order.
*/ */
import { spawn, spawnSync } from "node:child_process"; import { spawn, spawnSync } from "node:child_process";
import { existsSync, readFileSync, writeFileSync } from "node:fs"; import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
import { basename, join } from "node:path"; import { basename, join, resolve } from "node:path";
import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@earendil-works/pi-coding-agent"; import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@earendil-works/pi-coding-agent";
import { Type } from "@sinclair/typebox"; import { Type } from "@sinclair/typebox";
import { counts, findGoal, type Goal, type PlanDoc, parse, recordSignOff, type SignOff } from "./plan-file.js"; import { counts, findGoal, type Goal, type PlanDoc, parse, recordSignOff, type SignOff } from "./plan-file.js";
@@ -32,6 +45,29 @@ const PLAN_CONTEXT = "pi-goals-context"; // injected plan-mode guidance, strippe
const STATUS_KEY = "pi-goals"; const STATUS_KEY = "pi-goals";
const WIDGET_KEY = "pi-goals-widget"; const WIDGET_KEY = "pi-goals-widget";
const READ_ONLY_TOOLS = ["read", "grep", "find", "ls", "bash"]; const READ_ONLY_TOOLS = ["read", "grep", "find", "ls", "bash"];
// File mutators blocked while drafting goals (read-only plan mode, like narumiruna/pi-plan-mode), so
// code isn't written before goals are agreed. The one allowed write is goals.md itself (the
// deliverable). A read-only task (a pure search) can still be explored in plan mode by nature.
const PLAN_MODE_BLOCKED_TOOLS = ["edit", "write"];
// bash is dual-use, so block it only when the command looks mutating; read-only exploration (cat, rg,
// git log, running a script to inspect) stays open. Blocklist, not allowlist: keep exploration
// frictionless and just stop the obvious mutators. List adapted from narumiruna/pi-plan-mode; the
// redirect rule catches `> file` / `>> file` / `>| file` but not fd-dups like `2>&1` or `>&2`.
const MUTATING_BASH_PATTERNS: RegExp[] = [
/\b(rm|rmdir|mv|cp|mkdir|touch|chmod|chown|chgrp|ln|tee|truncate|dd)\b/i,
/>\s*[^&\s]/, // redirect to a file (write/append/clobber), excludes 2>&1 and >&2
/\bnpm\s+(install|uninstall|update|ci|link|publish|version)\b/i,
/\byarn\s+(add|remove|install|publish|upgrade)\b/i,
/\bpnpm\s+(add|remove|install|publish|update)\b/i,
/\bbun\s+(add|remove|install|update|publish)\b/i,
/\bpip\s+(install|uninstall)\b/i,
/\buv\s+(add|remove|sync|lock|pip\s+install)\b/i,
/\bgit\s+(add|commit|push|pull|merge|rebase|reset|checkout|switch|stash|cherry-pick|revert|tag|init|clone)\b/i,
/\b(sudo|su|kill|pkill|killall|reboot|shutdown)\b/i,
/\bsystemctl\s+(start|stop|restart|enable|disable)\b/i,
/\b(vim?|nano|emacs|code|subl)\b/i,
];
const PLAN_REL = ".pi/goals.md"; // project-local, gitignored (pi-tasks convention); shown in the widget
interface PlanState { interface PlanState {
isPlanMode: boolean; isPlanMode: boolean;
@@ -47,8 +83,14 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
// newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /goals. // newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /goals.
let savedCmdCtx: ExtensionCommandContext | null = null; let savedCmdCtx: ExtensionCommandContext | null = null;
const planPath = (ctx: ExtensionContext) => join(ctx.cwd, "goals.md"); const planPath = (ctx: ExtensionContext) => join(ctx.cwd, ".pi", "goals.md");
const readPlan = (ctx: ExtensionContext): string => (existsSync(planPath(ctx)) ? readFileSync(planPath(ctx), "utf-8") : ""); const readPlan = (ctx: ExtensionContext): string => (existsSync(planPath(ctx)) ? readFileSync(planPath(ctx), "utf-8") : "");
// Our programmatic writes (clear, CompleteGoal). The agent creates/edits the file with its own Edit
// tool; this just makes sure .pi/ exists for our writes.
const writePlan = (ctx: ExtensionContext, content: string): void => {
mkdirSync(join(ctx.cwd, ".pi"), { recursive: true });
writeFileSync(planPath(ctx), content);
};
function persist(): void { function persist(): void {
pi.appendEntry<PlanState>(STATE, state); pi.appendEntry<PlanState>(STATE, state);
@@ -57,7 +99,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
function updateWidget(ctx: ExtensionContext): void { function updateWidget(ctx: ExtensionContext): void {
if (state.isPlanMode) { if (state.isPlanMode) {
ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("warning", "planning")); ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("warning", "planning"));
ctx.ui.setWidget(WIDGET_KEY, ["pi-goals: drafting goals", "Write goals to goals.md, then review."]); ctx.ui.setWidget(WIDGET_KEY, ["pi-goals: drafting goals", `Write goals to ${PLAN_REL}, then review.`]);
return; return;
} }
const doc = parse(readPlan(ctx)); const doc = parse(readPlan(ctx));
@@ -68,17 +110,17 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
} }
const c = counts(doc); const c = counts(doc);
ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("accent", `${c.done}/${doc.goals.length} goals`)); ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("accent", `${c.done}/${doc.goals.length} goals`));
ctx.ui.setWidget(WIDGET_KEY, goalWidgetLines(doc)); ctx.ui.setWidget(WIDGET_KEY, [...goalWidgetLines(doc), ctx.ui.theme.fg("muted", PLAN_REL)]);
} }
function goalWidgetLines(doc: PlanDoc): string[] { function goalWidgetLines(doc: PlanDoc): string[] {
const mark: Record<Goal["status"], string> = { done: "✔", active: "▸", open: "◻", cancelled: "✗" }; const mark: Record<Goal["status"], string> = { done: "✔", active: "▸", open: "◻", cancelled: "✗" };
const lines = [`Goals: ${doc.objective || "(untitled)"}`]; const lines = [`Goals: ${doc.title || "(untitled)"}`];
for (const g of doc.goals) { for (const g of doc.goals) {
// Show every goal with its status glyph (✔ done, ▸ active, ◻ open, ✗ cancelled) so finished // Show every goal with its status glyph (✔ done, ▸ active, ◻ open, ✗ cancelled) so finished
// goals read as checked off rather than vanishing. Plans are small, so this stays readable. // goals read as checked off rather than vanishing. Plans are small, so this stays readable.
const total = g.subtasks.length; const total = g.subtasks.length;
const done = g.subtasks.filter((s) => s.done).length; const done = g.subtasks.filter((s) => s.status === "done").length;
lines.push(`${mark[g.status]} ${g.subject}${total ? ` (${done}/${total} tasks)` : ""}`); lines.push(`${mark[g.status]} ${g.subject}${total ? ` (${done}/${total} tasks)` : ""}`);
} }
return lines; return lines;
@@ -99,24 +141,18 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
setJudge(arg.slice("judge".length).trim(), ctx); setJudge(arg.slice("judge".length).trim(), ctx);
return; return;
} }
// Bare `/goals` enters plan mode by prompting for the objective (the common expectation). // Conversational entry (like narumiruna/pi-plan-mode): /goals enters plan mode and starts a
// If the user cancels with no objective, fall back to showing the current plan. // dialogue. The objective is an optional seed, not a required arg, so there's no awkward
let objective = arg; // "type your objective" prompt; the agent explores read-only and asks before drafting. A
if (!objective) { // fresh draft just replaces .pi/goals.md (the "overwrite" staleness rule).
objective = (ctx.hasUI ? await ctx.ui.input("Plan mode — what's the objective?", "Describe what you want to plan") : undefined)?.trim() ?? ""; const objective = arg || null;
if (!objective) {
showPlan(ctx);
return;
}
}
state = { ...state, isPlanMode: true, objective }; state = { ...state, isPlanMode: true, objective };
persist(); persist();
updateWidget(ctx); updateWidget(ctx);
pi.sendUserMessage( const seed = objective
`Enter plan mode for this objective: ${objective}\n\nExplore read-only, then write the plan to ${planPath(ctx)}.`, ? `We're in plan mode. Objective: ${objective}\n\nExplore the repo read-only and ask me anything unclear. When the objective is nailed down, draft (or replace) the goals in ${planPath(ctx)}, then stop for review.`
{ deliverAs: "followUp" }, : `We're in plan mode. Tell me what you want to plan. Explore read-only and ask questions as needed; when the objective is clear, draft the goals in ${planPath(ctx)} and stop for review.`;
); pi.sendUserMessage(seed, { deliverAs: "followUp" });
}, },
}); });
@@ -132,23 +168,14 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
return; return;
} }
if (ctx.hasUI) { if (ctx.hasUI) {
const ok = await ctx.ui.select("Clear goals.md? (it stays in git history)", ["Cancel", "Clear goals.md"]); const ok = await ctx.ui.select(`Clear ${PLAN_REL}?`, ["Cancel", "Clear goals.md"]);
if (ok !== "Clear goals.md") return; if (ok !== "Clear goals.md") return;
} }
writeFileSync(planPath(ctx), ""); writePlan(ctx, "");
state = { ...state, isPlanMode: false, objective: null }; state = { ...state, isPlanMode: false, objective: null };
persist(); persist();
updateWidget(ctx); updateWidget(ctx);
ctx.ui.notify("Cleared goals.md.", "info"); ctx.ui.notify(`Cleared ${PLAN_REL}.`, "info");
}
function showPlan(ctx: ExtensionContext): void {
const content = readPlan(ctx);
if (!content.trim()) {
ctx.ui.notify("No goals yet. Use /goals <objective> to start.", "info");
return;
}
ctx.ui.notify(content, "info");
} }
// --- review loop (after the agent drafts the plan) -------------------------------------------- // --- review loop (after the agent drafts the plan) --------------------------------------------
@@ -190,6 +217,9 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
} }
async function startExecution(ctx: ExtensionContext): Promise<void> { async function startExecution(ctx: ExtensionContext): Promise<void> {
// FIXME(model-switch): the plan phase should be able to run on a sticky plan model and execution
// on a different one (see README "Not yet included"). newSession can't switch the model yet; wire
// this when pi exposes a model override on newSession.
// Offer a clean execution context (D13). newSession lives only on the saved command context. // Offer a clean execution context (D13). newSession lives only on the saved command context.
let fresh = false; let fresh = false;
if (ctx.hasUI && savedCmdCtx) { if (ctx.hasUI && savedCmdCtx) {
@@ -203,7 +233,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
const planFile = planPath(ctx); const planFile = planPath(ctx);
const planContent = readPlan(ctx); // captured now: ctx is stale after newSession below const planContent = readPlan(ctx); // captured now: ctx is stale after newSession below
const parentSession = ctx.sessionManager.getSessionFile(); const parentSession = ctx.sessionManager.getSessionFile();
const startMsg = `Work the goals in ${planFile}. Pick an open goal, mark it active (set its header to [/]), work its subtasks, and when its done_when is met fill the goal's evidence: block then call CompleteGoal with the goal_id. Keep goals.md current as you go.`; const startMsg = `Work the goals in ${planFile}. Pick an open goal, mark it active (set its checkbox to [/]), work its subtasks, and when its discriminator is satisfied fill the goal's evidence: block then call CompleteGoal with the goal's desc. Keep goals.md current as you go.`;
exitPlanMode(ctx); exitPlanMode(ctx);
if (fresh && savedCmdCtx) { if (fresh && savedCmdCtx) {
@@ -223,7 +253,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
} }
return; return;
} }
if (doc.objective) pi.setSessionName(`Goals: ${doc.objective}`); if (doc.title) pi.setSessionName(`Goals: ${doc.title}`);
ctx.ui.notify(planContent, "info"); ctx.ui.notify(planContent, "info");
pi.sendUserMessage(startMsg, { deliverAs: "followUp" }); pi.sendUserMessage(startMsg, { deliverAs: "followUp" });
} }
@@ -234,29 +264,33 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
name: "CompleteGoal", name: "CompleteGoal",
label: "Complete goal", label: "Complete goal",
description: description:
"Sign off a goal once its done_when is met. First fill the goal's evidence: block in goals.md " + "Sign off a goal once its discriminator is satisfied. First fill the goal's evidence: block in " +
"(a '- ' list pointing at durable artifacts: saved logs, committed diffs, files, not claims), then " + "goals.md: a list where each item pairs a durable artifact with a short read of it (a quoted+linked " +
"call this with the goal_id. Runs the goal's verify command (if any) then a read-only subagent that " + "log, a table plus how to read it, or a metric plus what it shows; quote the key lines and link the " +
"inspects that evidence against the repo. On accept, the goal is marked done and logged; on reject, " + "rest, not a pasted blob or a bare claim). Then call this with the goal's desc (the text after " +
"it stays open and you get what is missing.", "'goal:'). Runs the goal's verify command (if any) then a read-only subagent that inspects that " +
"evidence against the repo and the discriminator. On accept, the goal is marked done and logged; on " +
"reject, it stays open and you get what is missing. The subagent's reasoning is returned either way.",
parameters: Type.Object({ parameters: Type.Object({
goal_id: Type.String({ description: "The goal's <!-- id --> from goals.md" }), goal: Type.String({ description: "The goal's desc: the exact text after 'goal:' in its line." }),
}), }),
async execute(_id, params, signal, _onUpdate, ctx) { async execute(_id, params, signal, _onUpdate, ctx) {
const content = readPlan(ctx); const content = readPlan(ctx);
const goal = findGoal(parse(content), params.goal_id); const goal = findGoal(parse(content), params.goal);
if (!goal) return text(`No goal #${params.goal_id} in goals.md.`, true); if (!goal) return text(`No goal "${params.goal}" in goals.md. Use the exact text after "goal:".`, true);
if (goal.evidence.length === 0) { if (goal.evidence.length === 0) {
return text(`Goal #${goal.id} has no evidence: block. Add a "- " evidence list to the goal in goals.md (what shows done_when is met, and where to verify it), then call CompleteGoal.`, true); return text(`Goal "${goal.subject}" has no evidence yet. Add an evidence: list to the goal in goals.md (artifacts + a short read showing the discriminator is satisfied), then call CompleteGoal.`, true);
} }
// Decide the outcome (the I/O); recordSignOff applies it to the file (the pure write). // Decide the outcome (the I/O); recordSignOff applies it to the file (the pure write).
// Evidence and the artifacts to inspect both come from the goal's evidence: block (single source of truth). // Evidence and the artifacts to inspect both come from the goal's evidence: block (single source of truth).
const outcome = await decideSignOff(goal, goal.evidence.join("\n"), goal.evidence, state.judgeModel, ctx.cwd, signal); const { outcome, reasoning } = await decideSignOff(goal, goal.evidence.join("\n"), goal.evidence, state.judgeModel, ctx.cwd, signal);
const res = recordSignOff(content, goal.id, stamp(), outcome); const res = recordSignOff(content, goal.subject, stamp(), outcome);
if (res.content !== content) writeFileSync(planPath(ctx), res.content); if (res.content !== content) writePlan(ctx, res.content);
updateWidget(ctx); updateWidget(ctx);
return text(res.message, res.isError); // Surface the sign-off judge's actual reasoning, not just the verdict, so it's visible (was a gap).
const detail = reasoning ? `\n\n--- sign-off judge ---\n${reasoning}` : "";
return text(res.message + detail, res.isError);
}, },
}); });
@@ -264,6 +298,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
pi.on("before_agent_start", async (_event, ctx) => { pi.on("before_agent_start", async (_event, ctx) => {
if (state.isPlanMode) { if (state.isPlanMode) {
// Read-only is enforced in the tool_call hook below (blocks edit/write while planning).
return { message: { customType: PLAN_CONTEXT, content: `${planDrafting}\n\nWrite the plan to ${planPath(ctx)}.`, display: false } }; return { message: { customType: PLAN_CONTEXT, content: `${planDrafting}\n\nWrite the plan to ${planPath(ctx)}.`, display: false } };
} }
const doc = parse(readPlan(ctx)); const doc = parse(readPlan(ctx));
@@ -272,9 +307,13 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
const active = doc.goals.find((g) => g.status === "active") ?? doc.goals.find((g) => g.status === "open") ?? null; const active = doc.goals.find((g) => g.status === "active") ?? doc.goals.find((g) => g.status === "open") ?? null;
const c = counts(doc); const c = counts(doc);
let body = planInjection({ let body = planInjection({
objective: doc.objective, title: doc.title,
activeGoal: active activeGoal: active
? { subject: active.subject, done_when: active.done_when, openSubtasks: active.subtasks.filter((s) => !s.done).map((s) => s.text) } ? {
subject: active.subject,
discriminator: active.discriminator,
openSubtasks: active.subtasks.filter((s) => s.status !== "done" && s.status !== "cancelled").map((s) => s.text),
}
: null, : null,
lastLogLine: doc.log.at(-1) ?? null, lastLogLine: doc.log.at(-1) ?? null,
counts: { done: c.done, open: c.open + c.active }, counts: { done: c.done, open: c.open + c.active },
@@ -286,6 +325,25 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
return { message: { customType: PLAN_CONTEXT, content: body, display: false } }; return { message: { customType: PLAN_CONTEXT, content: body, display: false } };
}); });
// Enforce read-only planning: block file mutators while in plan mode so code isn't written before
// the goals are agreed. The agent draws back to read/grep/find/ls and read-only bash to explore.
pi.on("tool_call", async (event, ctx) => {
if (!state.isPlanMode) return;
// edit/write: blocked, except writing goals.md itself (the deliverable of plan mode).
if (PLAN_MODE_BLOCKED_TOOLS.includes(event.toolName)) {
const target = (event.input as { path?: string }).path;
if (target && resolve(ctx.cwd, target) === resolve(planPath(ctx))) return;
return { block: true, reason: `Plan mode is read-only: agree the goals in ${PLAN_REL} and choose Ready before writing code (${event.toolName} is blocked while planning; only ${PLAN_REL} may be written).` };
}
// bash: blocked only when the command looks mutating; read-only exploration stays open.
if (event.toolName === "bash") {
const command = (event.input as { command?: string }).command ?? "";
if (MUTATING_BASH_PATTERNS.some((re) => re.test(command))) {
return { block: true, reason: `Plan mode is read-only: this bash command looks like it mutates state, so it's blocked while planning. Explore read-only, agree the goals in ${PLAN_REL}, then choose Ready.\nCommand: ${command}` };
}
}
});
pi.on("agent_end", async (_event, ctx) => { pi.on("agent_end", async (_event, ctx) => {
if (!state.isPlanMode || !ctx.hasUI) return; if (!state.isPlanMode || !ctx.hasUI) return;
const doc = parse(readPlan(ctx)); const doc = parse(readPlan(ctx));
@@ -327,7 +385,8 @@ function stamp(): string {
return new Date().toISOString().slice(0, 16).replace("T", " "); return new Date().toISOString().slice(0, 16).replace("T", " ");
} }
/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge. */ /** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge.
* Returns the outcome plus the judge's (or verify's) reasoning so CompleteGoal can show WHY. */
async function decideSignOff( async function decideSignOff(
goal: Goal, goal: Goal,
evidence: string, evidence: string,
@@ -335,16 +394,20 @@ async function decideSignOff(
judgeModel: string | null, judgeModel: string | null,
cwd: string, cwd: string,
signal: AbortSignal | undefined, signal: AbortSignal | undefined,
): Promise<SignOff> { ): Promise<{ outcome: SignOff; reasoning: string }> {
let verifyResult: { command: string; exitCode: number; outputTail: string } | null = null; let verifyResult: { command: string; exitCode: number; outputTail: string } | null = null;
if (goal.verify) { if (goal.verify) {
verifyResult = runVerify(goal.verify, cwd, signal); verifyResult = runVerify(goal.verify, cwd, signal);
if (verifyResult.exitCode !== 0) { if (verifyResult.exitCode !== 0) {
return { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail }; return {
outcome: { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail },
reasoning: `verify \`${goal.verify}\` exited ${verifyResult.exitCode}:\n${verifyResult.outputTail}`,
};
} }
} }
const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal); const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal);
return verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing }; const outcome: SignOff = verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing };
return { outcome, reasoning: verdict.reasoning };
} }
/** Run the goal's verify command. It is agent-authored and trusted (single-user machine, guide-not-guard). */ /** Run the goal's verify command. It is agent-authored and trusted (single-user machine, guide-not-guard). */
@@ -372,13 +435,13 @@ async function runJudge(
judgeModel: string | null, judgeModel: string | null,
cwd: string, cwd: string,
signal: AbortSignal | undefined, signal: AbortSignal | undefined,
): Promise<{ accept: boolean; missing: string }> { ): Promise<{ accept: boolean; missing: string; reasoning: string }> {
const task = evidenceJudgeUser({ const task = evidenceJudgeUser({
subject: goal.subject, subject: goal.subject,
done_when: goal.done_when, discriminator: goal.discriminator,
failure_modes: goal.failure_modes,
verify: goal.verify ?? null, verify: goal.verify ?? null,
verifyResult, verifyResult,
failure_modes: goal.failure_modes,
evidence, evidence,
paths, paths,
}); });
@@ -403,5 +466,9 @@ async function runJudge(
const accept = /accept/i.test(verdictLine); const accept = /accept/i.test(verdictLine);
const missingMatch = clean.match(/missing\s*:\s*([\s\S]*)$/i); const missingMatch = clean.match(/missing\s*:\s*([\s\S]*)$/i);
const missing = accept ? "" : (missingMatch?.[1].trim() || clean.trim().slice(-500) || "judge gave no reason"); const missing = accept ? "" : (missingMatch?.[1].trim() || clean.trim().slice(-500) || "judge gave no reason");
return { accept, missing }; // The judge's own words (inspection + verdict), so CompleteGoal can show them. The verdict is at the
// end, so keep the tail when it's long.
const trimmed = clean.trim();
const reasoning = trimmed.length > 1800 ? `...\n${trimmed.slice(-1800)}` : trimmed;
return { accept, missing, reasoning };
} }
+132 -96
View File
@@ -2,82 +2,104 @@
* plan-file.ts — read goals.md, and the two writes CompleteGoal needs. That is all. * plan-file.ts — read goals.md, and the two writes CompleteGoal needs. That is all.
* *
* Pure module, no pi deps, so it unit-tests without a runtime. The file is the canonical store and * Pure module, no pi deps, so it unit-tests without a runtime. The file is the canonical store and
* the agent edits it with its normal Edit tool (create goals, tick subtasks, append log), guided by * the agent edits it with its normal Edit tool (create goals, tick subtasks, fill evidence), guided
* the format in prompts.ts and the reminder -- the form guides, it does not gate (spec D3). So this * by the format in prompts.ts and the reminder -- the form guides, it does not gate. The only
* module does NOT render or create goals; the format's single source of truth is the planDrafting * programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to record an accepted
* prompt. The only programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to * sign-off; both touch one line so the diff stays readable.
* record an accepted sign-off; both touch one line so the git diff stays readable.
* *
* A goal's state lives in a checkbox on its header (single source of truth, renders natively): * Format (markdown, checkbox-first, made to be skim-reviewed by a human):
* [ ] open [/] active (in progress) [x] done [-] cancelled
* Only CompleteGoal writes [x]; the agent sets [/] when it starts a goal.
* *
* Format: * # <plan title>
* *
* # Goals: <objective> * <context: the user's ask, preferences, decisions>
* *
* ## Goal: [ ] <subject> * ## Goals
* <!-- id: <slug> -->
* done_when: <one falsifiable check>
* verify: <shell command, optional>
* - [ ] <subtask>
* *
* failure_modes: * 1. [ ] goal: <desc> <- state in the checkbox: [ ] open [/] active [x] done [-] cancelled
* - <pre-mortem item> * - discriminator: <positive observation that the goal succeeded, that no failure below could fake>
* evidence: * - subtle failure mode: <a way this looks done but isn't>
* - <proof the done_when is met; filled at completion, read by CompleteGoal> * - verify: <optional shell command that exits 0 only when the discriminator passes>
* - tasks:
* 1. [x] <subtask> <- a subtask is any checkbox WITHOUT a "goal:" prefix
* 2. [/] <subtask>
* 3. [-] <subtask> <- [-] or ~~[ ]~~ both read as cancelled
* - evidence: <- empty at planning; filled at sign-off, read by CompleteGoal
* - > <artifact path / link / metric, plus a short read of it>
* 2. [ ] goal: <desc>
*
* # Future work / out of scope
* *
* ## Log * ## Log
* - <verbatim append-only line> * - <verbatim append-only line>
*
* A goal/subtask's state lives in its checkbox (single source of truth, renders natively). Goals are
* matched by their <desc> (the text after "goal:"); the list number is human-facing only. Only
* CompleteGoal writes a goal's [x]; the agent sets [/] when it starts one.
*/ */
export type GoalStatus = "open" | "active" | "done" | "cancelled"; export type GoalStatus = "open" | "active" | "done" | "cancelled";
export interface Subtask { export interface Subtask {
text: string; text: string;
done: boolean; status: GoalStatus;
} }
export interface Goal { export interface Goal {
id: string; /** The text after "goal:" in the header line; the handle CompleteGoal matches on. */
subject: string; subject: string;
status: GoalStatus; status: GoalStatus;
done_when: string; /** Positive observation(s) that the goal succeeded AND that no failure mode could fake. The success test. Written at planning. */
verify?: string; discriminator: string[];
/** Pre-mortem: ways a "done" could be wrong. Written at planning. */ /** Subtle ways a "done" could be wrong (look-like-success failures). Written at planning. */
failure_modes: string[]; failure_modes: string[];
/** Proof the done_when is met, pointing at durable artifacts. Written at completion; read by CompleteGoal. */ /** Optional command that exits 0 only when the discriminator passes (the cheap deterministic gate). */
verify?: string;
/** Proof the discriminator passed, pointing at durable artifacts. Written at completion; read by CompleteGoal. */
evidence: string[]; evidence: string[];
subtasks: Subtask[]; subtasks: Subtask[];
} }
export interface PlanDoc { export interface PlanDoc {
objective: string; title: string;
goals: Goal[]; goals: Goal[];
/** Verbatim ## Log lines, including the leading "- ". */ /** Verbatim ## Log lines, including the leading "- ". */
log: string[]; log: string[];
} }
// Goal header carries the state checkbox: `## Goal: [x] subject`. The checkbox is optional so a const TITLE = /^#\s+(.+?)\s*$/; // the first single-# H1
// header written without one parses as open (group 1 undefined -> " "). const GOALS_HEADER = /^##\s+Goals\s*$/i;
const GOAL_HEADER = /^##\s+Goal:\s*(?:\[([ xX/-])\]\s+)?(.*)$/;
const ANY_HEADER = /^#{1,6}\s/;
const LOG_HEADER = /^##\s+Log\s*$/i; const LOG_HEADER = /^##\s+Log\s*$/i;
const ID_COMMENT = /^<!--\s*id:\s*(.+?)\s*-->$/; const ANY_HEADER = /^#{1,6}\s/;
const CHECKBOX = /^- \[([ xX])\]\s+(.*)$/; // A goal: a numbered or bulleted checkbox item whose text begins "goal:".
const GOAL_ITEM = /^\s*(?:\d+\.|[-*])\s*\[([ xX/-])\]\s*goal:\s*(.*)$/i;
// A section marker bullet under a goal (the trailing colon is optional, e.g. "- tasks").
const KEY_LINE = /^\s*[-*]\s*(discriminator|subtle failure modes?|failure_modes?|verify|tasks?|evidence)\s*:?\s*(.*)$/i;
// Any list item (numbered or bulleted); used for subtasks and for list items inside the sections.
const LIST_ITEM = /^\s*(?:\d+\.|[-*])\s+(.*)$/;
// A checkbox inside a list-item body (subtask). A leading/trailing ~~ marks it cancelled.
const CHECKBOX_BODY = /^(~~)?\s*\[([ xX/-])\]\s*(.*)$/;
const CHAR_TO_STATUS: Record<string, GoalStatus> = { " ": "open", "/": "active", x: "done", "-": "cancelled" }; const CHAR_TO_STATUS: Record<string, GoalStatus> = { " ": "open", "/": "active", x: "done", "-": "cancelled" };
const STATUS_TO_CHAR: Record<GoalStatus, string> = { open: " ", active: "/", done: "x", cancelled: "-" }; const STATUS_TO_CHAR: Record<GoalStatus, string> = { open: " ", active: "/", done: "x", cancelled: "-" };
function normalizeKey(raw: string): "discriminator" | "failure_modes" | "verify" | "tasks" | "evidence" {
const k = raw.toLowerCase();
if (k.startsWith("discriminator")) return "discriminator";
if (k.startsWith("verify")) return "verify";
if (k.startsWith("task")) return "tasks";
if (k.startsWith("evidence")) return "evidence";
return "failure_modes"; // "subtle failure mode(s)" / "failure_mode(s)"
}
export function parse(text: string): PlanDoc { export function parse(text: string): PlanDoc {
const lines = text.split("\n"); const lines = text.split("\n");
let objective = ""; let title = "";
const goals: Goal[] = []; const goals: Goal[] = [];
const log: string[] = []; const log: string[] = [];
let cur: Goal | null = null; let cur: Goal | null = null;
// While inside a `failure_modes:`/`evidence:` block, points at the list the "- " items append to. let curList: string[] | null = null; // the discriminator/failure_modes/evidence list "- " items append to
let curList: string[] | null = null; let inGoals = false;
let inLog = false; let inLog = false;
const flush = () => { const flush = () => {
@@ -87,30 +109,27 @@ export function parse(text: string): PlanDoc {
}; };
for (const line of lines) { for (const line of lines) {
const objMatch = /^#\s+Goals:\s*(.*)$/.exec(line); const tM = TITLE.exec(line);
if (objMatch) { if (tM && !title && !GOALS_HEADER.test(line) && !LOG_HEADER.test(line)) {
objective = objMatch[1].trim(); title = tM[1].trim();
continue; continue;
} }
if (GOALS_HEADER.test(line)) {
const goalMatch = GOAL_HEADER.exec(line);
if (goalMatch) {
flush(); flush();
inGoals = true;
inLog = false; inLog = false;
const status = CHAR_TO_STATUS[(goalMatch[1] ?? " ").toLowerCase()] ?? "open";
cur = { id: "", subject: goalMatch[2].trim(), status, done_when: "", failure_modes: [], evidence: [], subtasks: [] };
continue; continue;
} }
if (LOG_HEADER.test(line)) { if (LOG_HEADER.test(line)) {
flush(); flush();
inGoals = false;
inLog = true; inLog = true;
continue; continue;
} }
// Any other header (e.g. "# Future work") ends the goals / log section.
// Any other header ends the current goal / log section.
if (ANY_HEADER.test(line)) { if (ANY_HEADER.test(line)) {
flush(); flush();
inGoals = false;
inLog = false; inLog = false;
continue; continue;
} }
@@ -119,50 +138,70 @@ export function parse(text: string): PlanDoc {
if (/^\s*-\s+/.test(line)) log.push(line); if (/^\s*-\s+/.test(line)) log.push(line);
continue; continue;
} }
if (!inGoals) continue; // title + context prose between the title and ## Goals
const goalM = GOAL_ITEM.exec(line);
if (goalM) {
flush();
cur = {
subject: goalM[2].trim(),
status: CHAR_TO_STATUS[goalM[1].toLowerCase()] ?? "open",
discriminator: [],
failure_modes: [],
evidence: [],
subtasks: [],
};
continue;
}
if (!cur) continue; if (!cur) continue;
const idMatch = ID_COMMENT.exec(line.trim()); const keyM = KEY_LINE.exec(line);
if (idMatch) { if (keyM) {
cur.id = idMatch[1]; const key = normalizeKey(keyM[1]);
const inlineVal = keyM[2].trim();
if (key === "verify") {
cur.verify = inlineVal || undefined;
curList = null;
} else if (key === "tasks") {
curList = null; // subtasks are identified by being a checkbox; this marker is cosmetic
} else {
curList = cur[key]; // discriminator | failure_modes | evidence
if (inlineVal) curList.push(inlineVal);
}
continue; continue;
} }
// A checkbox (column 0) is a subtask; checked first so it is never read as a list item. const listM = LIST_ITEM.exec(line);
const checkbox = CHECKBOX.exec(line); if (listM) {
if (checkbox) { const body = listM[1];
curList = null; const cb = CHECKBOX_BODY.exec(body);
cur.subtasks.push({ done: checkbox[1].toLowerCase() === "x", text: checkbox[2].trim() }); if (cb) {
continue; // A checkbox without a "goal:" prefix is a subtask of the current goal.
} const cancelled = cb[1] === "~~" || body.includes("~~");
const status = cancelled ? "cancelled" : (CHAR_TO_STATUS[cb[2].toLowerCase()] ?? "open");
const kv = /^(done_when|verify|failure_modes|evidence)\s*:\s*(.*)$/.exec(line); cur.subtasks.push({ text: cb[3].replace(/~~/g, "").trim(), status });
if (kv) { curList = null;
const [, key, value] = kv;
if (key === "done_when") cur.done_when = value.trim();
else if (key === "verify") cur.verify = value.trim() || undefined;
// failure_modes/evidence open a "- " block; done_when/verify close any open one.
curList = key === "failure_modes" ? cur.failure_modes : key === "evidence" ? cur.evidence : null;
continue;
}
// Indented "- " items under failure_modes:/evidence: (a column-0 checkbox already returned above).
if (curList) {
const item = /^\s*-\s+(.*)$/.exec(line);
if (item) {
curList.push(item[1].trim());
continue; continue;
} }
if (line.trim() !== "") curList = null; // A plain "- " / "> " item belongs to the current section (discriminator/failure/evidence).
if (curList) curList.push(body.trim());
continue;
}
// A non-empty, non-"- " line continues the current item, so multi-line evidence (a block quote
// of a log, a table, an interpretation line) stays attached to its item. Blank lines are skipped.
if (curList && line.trim() !== "" && curList.length > 0) {
curList[curList.length - 1] += `\n${line.trim()}`;
} }
} }
flush(); flush();
return { objective, goals, log }; return { title, goals, log };
} }
export function findGoal(doc: PlanDoc, id: string): Goal | undefined { export function findGoal(doc: PlanDoc, subject: string): Goal | undefined {
return doc.goals.find((g) => g.id === id); const want = subject.trim();
return doc.goals.find((g) => g.subject === want);
} }
export function counts(doc: PlanDoc): { done: number; open: number; active: number } { export function counts(doc: PlanDoc): { done: number; open: number; active: number } {
@@ -175,21 +214,18 @@ export function counts(doc: PlanDoc): { done: number; open: number; active: numb
return c; return c;
} }
/** Flip a goal's header checkbox in place (the one write CompleteGoal needs). Normalizes a header that /** Flip a goal's checkbox in place, matched by its subject (the one write CompleteGoal needs). */
* lacks a checkbox by inserting one. */ export function setGoalStatus(text: string, subject: string, status: GoalStatus): string {
export function setGoalStatus(text: string, id: string, status: GoalStatus): string {
const lines = text.split("\n"); const lines = text.split("\n");
const idIdx = lines.findIndex((l) => ID_COMMENT.exec(l.trim())?.[1] === id); const want = subject.trim();
if (idIdx === -1) throw new Error(`Goal #${id} not found`); for (let i = 0; i < lines.length; i++) {
// The header sits just above the id comment; scan upward for it. const m = GOAL_ITEM.exec(lines[i]);
for (let i = idIdx; i >= 0; i--) { if (m && m[2].trim() === want) {
const m = GOAL_HEADER.exec(lines[i]); lines[i] = lines[i].replace(/\[[ xX/-]\]/, `[${STATUS_TO_CHAR[status]}]`);
if (m) {
lines[i] = `## Goal: [${STATUS_TO_CHAR[status]}] ${m[2].trim()}`;
return lines.join("\n"); return lines.join("\n");
} }
} }
throw new Error(`Goal #${id} has no ## Goal: header`); throw new Error(`Goal "${subject}" not found`);
} }
/** /**
@@ -201,28 +237,28 @@ export type SignOff =
| { kind: "rejected"; missing: string } | { kind: "rejected"; missing: string }
| { kind: "accepted" }; | { kind: "accepted" };
/** Apply a sign-off outcome to goals.md text: accept flips the header checkbox to [x] + logs; reject only logs. Pure. */ /** Apply a sign-off outcome to goals.md text: accept flips the goal checkbox to [x] + logs; reject only logs. Pure. */
export function recordSignOff( export function recordSignOff(
text: string, text: string,
goalId: string, subject: string,
when: string, when: string,
outcome: SignOff, outcome: SignOff,
): { content: string; message: string; isError: boolean } { ): { content: string; message: string; isError: boolean } {
const goal = findGoal(parse(text), goalId); const goal = findGoal(parse(text), subject);
if (!goal) return { content: text, message: `No goal #${goalId} in goals.md.`, isError: true }; if (!goal) return { content: text, message: `No goal "${subject}" in goals.md.`, isError: true };
if (outcome.kind === "verify_failed") { if (outcome.kind === "verify_failed") {
const content = appendLog(text, `${when} reject #${goalId}: verify exit ${outcome.exitCode}`); const content = appendLog(text, `${when} reject "${subject}": verify exit ${outcome.exitCode}`);
return { content, message: `Sign-off rejected: verify failed (exit ${outcome.exitCode}).\n${outcome.outputTail}`, isError: true }; return { content, message: `Sign-off rejected: verify failed (exit ${outcome.exitCode}).\n${outcome.outputTail}`, isError: true };
} }
if (outcome.kind === "rejected") { if (outcome.kind === "rejected") {
const oneLine = outcome.missing.replace(/\s+/g, " ").trim().slice(0, 200); const oneLine = outcome.missing.replace(/\s+/g, " ").trim().slice(0, 200);
const content = appendLog(text, `${when} reject #${goalId}: ${oneLine}`); const content = appendLog(text, `${when} reject "${subject}": ${oneLine}`);
return { content, message: `Sign-off rejected. Missing:\n${outcome.missing}`, isError: true }; return { content, message: `Sign-off rejected. Missing:\n${outcome.missing}`, isError: true };
} }
const flipped = setGoalStatus(text, goalId, "done"); const flipped = setGoalStatus(text, subject, "done");
const content = appendLog(flipped, `${when} signed off #${goalId}: ${goal.subject} (oracle accept)`); const content = appendLog(flipped, `${when} signed off "${subject}" (judge accept)`);
return { content, message: `Signed off #${goalId}: ${goal.subject}. Marked done in goals.md.`, isError: false }; return { content, message: `Signed off "${subject}". Marked done in goals.md.`, isError: false };
} }
/** Append one verbatim line to ## Log (creating the section if absent). The other CompleteGoal write. */ /** Append one verbatim line to ## Log (creating the section if absent). The other CompleteGoal write. */
+107 -80
View File
@@ -8,7 +8,7 @@
* trapping it. Bypasses stay visible in the git diff and the widget. * trapping it. Bypasses stay visible in the git diff and the widget.
* *
* Flow: * Flow:
* SETUP (plan mode) 1. planDrafting — strong/sticky model drafts goals * SETUP (plan mode) 1. planDrafting — drafts goals (read-only phase)
* EXEC, each turn start 2. planInjection — "here is your plan, where you are" * EXEC, each turn start 2. planInjection — "here is your plan, where you are"
* EXEC, periodic 3. reminder — the typed nudge that drives upkeep + autonomy * EXEC, periodic 3. reminder — the typed nudge that drives upkeep + autonomy
* EXEC, loop continue 4. continuation — keep going toward the active goal * EXEC, loop continue 4. continuation — keep going toward the active goal
@@ -22,61 +22,82 @@
* NOT YET WIRED: 4 continuation and 5 loopJudge define the autonomous re-prompt loop, which is * NOT YET WIRED: 4 continuation and 5 loopJudge define the autonomous re-prompt loop, which is
* intentionally not built in v1 (an until-done-style loop was judged too complex). They stay here so * intentionally not built in v1 (an until-done-style loop was judged too complex). They stay here so
* the full intended flow is reviewable; wire them if/when the loop is added. * the full intended flow is reviewable; wire them if/when the loop is added.
*
* The goal's test is the DISCRIMINATOR: the concrete observation that tells real success from the
* named subtle failure mode. It replaces a vague "done_when". Evidence is empty at planning and
* filled at sign-off (you don't always know the exact artifacts up front; the judge checks them then).
*/ */
/* ───────────────────────────────────────────────────────────────────────── /* ─────────────────────────────────────────────────────────────────────────
* 1. planDrafting — SETUP, plan mode * 1. planDrafting — SETUP, plan mode
* *
* System guidance for the plan-phase agent. Runs on the plan model (may differ * System guidance for the plan-phase agent. This phase is read-only (edit/write
* from the execution model; the choice is sticky — see oracle.json-style config). * and mutating bash are blocked by a tool hook): explore, then draft goals into
* This phase is read-only: explore, then draft goals into goals.md. No code yet. * goals.md. The fields here are the whole "elicitation"; the human reviews this
* The field requirements here are the whole "elicitation" — get them agreed up * output before any execution.
* front, because the human reviews this output before any execution.
* ──────────────────────────────────────────────────────────────────────── */ * ──────────────────────────────────────────────────────────────────────── */
export const planDrafting = `\ export const planDrafting = `\
You are in plan mode. Explore the repository read-only, then draft goals into goals.md. You are in plan mode. The objective may arrive through conversation, not as one up-front command.
Do not write or run code in this phase. Produce a plan the human will review and approve. Explore the repository read-only first, then ask: resolve discoverable facts by looking them up, and
only ask the human when the answer is a genuine intent or preference choice that exploration can't
settle. Don't write goals that branch on something you could just check. Do not write or run code in
this phase (edit and write are blocked, and so is mutating bash). If the ask is itself read-only
(e.g. research, a search, a report), explore enough to scope it, but leave the actual deliverable for
after the human approves the plan. When the objective is clear, draft goals into goals.md and stop
for review. Produce a plan the human will review and approve.
Right-size it, don't force structure that isn't there: Right-size it, don't force structure that isn't there:
- Default to ONE goal. Add another only when it's a genuinely separate checkpoint you'd want - Default to ONE goal. Add another only when it's a genuinely separate checkpoint you'd want signed
signed off on its own (its own done_when that can pass or fail independently). A long list of off on its own (it can pass or fail independently). Most objectives are 1-2 goals.
near-identical goals should be one goal with subtasks. Most objectives are 1-2 goals. - Subtasks are the steps inside a goal. Add them when a goal has 3+ distinct steps; skip them for a
- Subtasks are the steps inside a goal. Add them when a goal has 3+ distinct steps; skip them for single-action goal. Don't pad with trivial steps.
a single-action goal. Don't pad with trivial steps. - Don't invent goals to look thorough. When in doubt, merge.
- Don't invent phases to look thorough. When in doubt, merge.
Write the whole file in this shape: Write the whole file in this shape (markdown checkboxes, made to be skim-reviewed):
# Goals: <the objective> # <short plan title>
## Goal: [ ] <one short imperative line> <context: restate the user's ask, their stated preferences, and any decisions you've agreed on>
<!-- id: <kebab-case-slug, unique> -->
done_when: <one falsifiable check; what is true on disk when this is done>
verify: <optional shell command that exits 0 only when done_when holds; omit if not testable>
- [ ] <subtask>
- [ ] <subtask>
failure_modes: ## Goals
- <a sneaky way this could look done but isn't; terse, optional>
evidence:
- <leave empty now; fill at sign-off with proof the done_when is met (durable artifacts)>
Keep it lean: 1. [ ] goal: <one short imperative line>
- The goal's state is the checkbox in its header: [ ] open, [/] active, [x] done, [-] cancelled. - subtle failure mode: <a way this could look done but isn't>
Leave it [ ] at planning. Every goal needs its <!-- id --> line; CompleteGoal finds goals by it. - discriminator: <the concrete observation that tells real success from that failure>
- The subtask checklist comes right under the goal; failure_modes and the (empty) evidence block - verify: <optional shell command that exits 0 only when the discriminator passes; omit if not testable>
sit at the end, after a blank line. Don't let the dash-lists run together. - tasks:
- evidence stays empty at planning. You fill it when the goal is actually done, just before calling 1. [ ] <subtask>
CompleteGoal, with a "- " list pointing at real artifacts (files, saved logs, committed diffs). 2. [ ] <subtask>
- done_when is ONE concrete, checkable condition, not a paragraph, no "if wrong" clause. - evidence:
The symptom of failure goes in failure_modes, not here. - <leave empty now; filled at sign-off>
- done_when names a real artifact: a file, a test result, a committed diff, a program's output. 2. [ ] goal: <...>
Never write it about goals.md's own checkbox or ## Log: CompleteGoal writes those when it accepts,
so a done_when about them is circular and the sign-off can never pass. # Future work / out of scope
- failure_modes: 0-2 terse items, only the non-obvious ways a "done" could be wrong (a
pre-mortem). If you add a verify command, one mode can be "verify passes on a gamed file". - <anything deliberately not in these goals>
- subtasks: a short checklist of the real steps; omit them if the goal is a single action.
- Prefer a verify command when success is a test/build/threshold. A green check beats prose. ## Log
Keep it lean and legible:
- A goal is a checkbox line beginning "goal:"; its state is the checkbox ([ ] open, [/] active, [x]
done, [-] cancelled). Leave goals [ ] at planning. The number is just for the human to reference.
- subtle failure mode + discriminator are the heart of this. List the ways a "done" could look
achieved but not be (empty/zero-count output, a silently-errored step, a gamed test, a flat/no-op
result that dodged every trap and still showed nothing; these are examples, find the ones that fit).
- The discriminator is the POSITIVE observation that the goal actually succeeded AND that none of
those failure modes could have produced. It must show success happened -- the count moved the right
way, the test really exercised the path, the metric beat noise -- not merely that a failure was
ruled out: avoiding every failure mode is necessary, not sufficient. Name the success signal first,
then check it isn't something a failure mode could fake. Keep it terse.
- The discriminator is the success test, written now, in place of a vague "done": make it a concrete,
checkable observation about a real artifact (a file, a test result, a committed diff, a metric), not
about goals.md's own checkbox.
- subtasks: any checkbox WITHOUT a "goal:" prefix, under "- tasks:". Use [/] for in progress and [-]
for cancelled/impossible.
- verify: prefer one when the discriminator is a test, build, threshold, or metric: a green check or
a printed number beats prose. Omit it otherwise.
- evidence stays empty at planning. You don't always know the exact artifacts up front, and that's
fine: you fill evidence at sign-off, and a fresh read-only judge checks it then.
When the goals are drafted, present them and stop for review. Do not begin execution.`; When the goals are drafted, present them and stop for review. Do not begin execution.`;
@@ -85,25 +106,26 @@ When the goals are drafted, present them and stop for review. Do not begin execu
* *
* A late user-role message, NOT a system-prompt mutation (keeps the prefix cache * A late user-role message, NOT a system-prompt mutation (keeps the prefix cache
* valid). Built from the parsed plan. MUST be byte-identical when nothing changed: * valid). Built from the parsed plan. MUST be byte-identical when nothing changed:
* fixed field order, no volatile timestamps in the body. Pass only the active * fixed field order, no volatile timestamps. Pass only the active goal + its open
* goal + its open subtasks + the last log line not the whole file. * subtasks + the last log line, not the whole file.
* ──────────────────────────────────────────────────────────────────────── */ * ──────────────────────────────────────────────────────────────────────── */
export function planInjection(p: { export function planInjection(p: {
objective: string; title: string;
activeGoal: { subject: string; done_when: string; openSubtasks: string[] } | null; activeGoal: { subject: string; discriminator: string[]; openSubtasks: string[] } | null;
lastLogLine: string | null; lastLogLine: string | null;
counts: { done: number; open: number }; counts: { done: number; open: number };
}): string { }): string {
if (!p.activeGoal) { if (!p.activeGoal) {
return `Goals (goals.md): ${p.objective}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal (set its header to [/]) or run /goals.`; return `Goals (goals.md): ${p.title}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal (set its checkbox to [/]) or run /goals.`;
} }
const subtasks = p.activeGoal.openSubtasks.length const subtasks = p.activeGoal.openSubtasks.length
? p.activeGoal.openSubtasks.map((s) => ` - [ ] ${s}`).join("\n") ? p.activeGoal.openSubtasks.map((s) => ` - [ ] ${s}`).join("\n")
: " (no open subtasks)"; : " (no open subtasks)";
const disc = p.activeGoal.discriminator.length ? p.activeGoal.discriminator.join("; ") : "(none set)";
return `\ return `\
Goals (goals.md): ${p.objective} Goals (goals.md): ${p.title}
Active goal: ${p.activeGoal.subject} Active goal: ${p.activeGoal.subject}
done_when: ${p.activeGoal.done_when} discriminator (the success test): ${disc}
Open subtasks: Open subtasks:
${subtasks} ${subtasks}
Last log: ${p.lastLogLine ?? "(none yet)"} Last log: ${p.lastLogLine ?? "(none yet)"}
@@ -114,20 +136,20 @@ Progress: ${p.counts.done} done, ${p.counts.open} open.`;
* 3. reminder — EXEC, periodic system-reminder * 3. reminder — EXEC, periodic system-reminder
* *
* The typed nudge. This is both the housekeeping and the autonomy engine — it is * The typed nudge. This is both the housekeeping and the autonomy engine — it is
* what makes the process get followed without a hard gate. Fires after N * what makes the process get followed without a hard gate. Fires after a turn that
* file-modifying turns since the last goals.md update while a goal is active. * left goals.md untouched while a goal is active. Keep the wording stable so it
* Keep the wording stable so it doesn't thrash the cache. * doesn't thrash the cache.
* ──────────────────────────────────────────────────────────────────────── */ * ──────────────────────────────────────────────────────────────────────── */
export const reminder = `\ export const reminder = `\
<system-reminder> <system-reminder>
Keep goals.md current as you work: Keep goals.md current as you work:
- tasks: tick the subtasks you've finished; add any new ones you've discovered. - tasks: tick the subtasks you've finished ([/] for in progress); add any you've discovered.
- log: append ONE short line to ## Log (append, don't rewrite earlier lines). - log: append ONE short line to ## Log (append, don't rewrite earlier lines).
- goal: when the active goal's done_when is met, fill its evidence: block in goals.md (a "- " list - goal: when the active goal's discriminator is satisfied, fill its evidence: block in goals.md (a
pointing at durable artifacts), then call CompleteGoal with the goal_id. Don't tick the goal's list pointing at durable artifacts), then call CompleteGoal with the goal's desc. Don't tick the
header [x] by hand; CompleteGoal reads the evidence, runs the check, and writes [x]. goal [x] by hand; CompleteGoal reads the evidence, runs the check, and writes [x].
- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely - otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely blocked;
blocked; if blocked, say what's blocking and why. if blocked, say what's blocking it.
</system-reminder>`; </system-reminder>`;
/* ───────────────────────────────────────────────────────────────────────── /* ─────────────────────────────────────────────────────────────────────────
@@ -137,9 +159,9 @@ Keep goals.md current as you work:
* continue. Does not mutate the system prompt, so the cache holds. * continue. Does not mutate the system prompt, so the cache holds.
* ──────────────────────────────────────────────────────────────────────── */ * ──────────────────────────────────────────────────────────────────────── */
export const continuation = `\ export const continuation = `\
Continue toward the active goal in goals.md. If it now meets its done_when, fill the goal's Continue toward the active goal in goals.md. If its discriminator is now satisfied, fill the goal's
evidence: block (durable artifacts: saved logs, committed diffs, files, not just claims) and then evidence: block (durable artifacts, e.g. saved logs, committed diffs, files, not just claims) and
call CompleteGoal with the goal_id. If you're blocked, state what's blocking it.`; then call CompleteGoal with the goal's desc. If you're blocked, state what's blocking it.`;
/* ───────────────────────────────────────────────────────────────────────── /* ─────────────────────────────────────────────────────────────────────────
* 5. loopJudge — EXEC, runs after each turn to decide continue / pause * 5. loopJudge — EXEC, runs after each turn to decide continue / pause
@@ -154,12 +176,12 @@ You decide whether an autonomous coding agent should keep working or pause for t
Be conservative: only pause when the work is plainly finished or plainly blocked. When in Be conservative: only pause when the work is plainly finished or plainly blocked. When in
doubt, continue. You are not verifying correctness; a later read-only judge does that. doubt, continue. You are not verifying correctness; a later read-only judge does that.
Reply with ONLY a JSON object, no other text: {"done": boolean, "reason": "<one sentence>"}. Reply with ONLY a JSON object, no other text: {"done": boolean, "reason": "<one sentence>"}.
Set done=true only if the agent's last message shows the active goal's done_when is met, or Set done=true only if the agent's last message shows the active goal's discriminator is satisfied,
the agent says it is blocked and needs the human.`; or the agent says it is blocked and needs the human.`;
export function loopJudgeUser(p: { activeGoalDoneWhen: string; lastResponse: string }): string { export function loopJudgeUser(p: { discriminator: string; lastResponse: string }): string {
return `\ return `\
Active goal done_when: ${p.activeGoalDoneWhen} Active goal discriminator (the success test): ${p.discriminator}
Agent's last message: Agent's last message:
""" """
@@ -172,22 +194,26 @@ ${p.lastResponse}
/* ───────────────────────────────────────────────────────────────────────── /* ─────────────────────────────────────────────────────────────────────────
* 6. evidenceJudge — SIGN-OFF, the one rigorous check * 6. evidenceJudge — SIGN-OFF, the one rigorous check
* *
* Runs inside CompleteGoal, on the read-only oracle subprocess (fresh context, * Runs inside CompleteGoal, on a read-only pi subprocess (fresh context via
* strongest reasoning on the chosen provider; override to a different vendor for * --no-session, so it never sees the working agent's transcript; override to a
* high-stakes goals). It re-derives from the repo rather than trusting the * different vendor for an independent cross-family check). It re-derives from the
* agent's transcription, and it judges whether a verify command actually tests * repo rather than trusting the agent's transcription, and judges whether the
* the criterion or could pass while a named failure mode holds (gaming). * evidence satisfies the discriminator and rules out the named failure mode.
* *
* The transport gives it read/grep/find/ls. The prompt below imposes the verdict * The transport gives it read/grep/find/ls. The prompt below imposes the verdict
* contract — the oracle returns prose by default, so parse the VERDICT line. * contract — the subprocess returns prose by default, so parse the VERDICT line.
* ──────────────────────────────────────────────────────────────────────── */ * ──────────────────────────────────────────────────────────────────────── */
export const evidenceJudgeSystem = `\ export const evidenceJudgeSystem = `\
You are a read-only reviewer signing off a coding goal. Do not trust claims; verify. You are a read-only reviewer signing off a coding goal. Do not trust claims; verify.
Use read/grep/find/ls to inspect the repository and the cited artifacts yourself. Re-read the Use read/grep/find/ls to inspect the repository and the cited artifacts yourself. Re-read the
files, logs, and diffs the evidence points to; if something it asserts isn't on disk, you can't files, logs, and diffs the evidence points to; if something it asserts isn't on disk, you can't
confirm it. If a verify command was run, judge whether it genuinely tests the criterion or confirm it. Judge whether the evidence shows the goal POSITIVELY succeeded -- the discriminator's
could pass while one of the listed failure modes still holds; a tautological or skipped test success signal is actually present, not just that the failure modes were dodged. Avoiding every
is a reject. Check each failure mode is actually ruled out, not just unmentioned. failure mode is necessary but not sufficient: a run can rule out each trap and still have produced
nothing, so reject "no problems found" that lacks the positive result. Then check the named subtle
failure modes are genuinely ruled out, not just unmentioned. If a verify command was run,
judge whether it really tests the discriminator or could pass while the failure mode still holds; a
tautological or skipped test is a reject.
Finish with exactly these two lines and nothing after: Finish with exactly these two lines and nothing after:
VERDICT: accept | reject VERDICT: accept | reject
@@ -195,10 +221,10 @@ missing: <empty if accept; otherwise a short list of what's needed before this c
export function evidenceJudgeUser(p: { export function evidenceJudgeUser(p: {
subject: string; subject: string;
done_when: string; discriminator: string[];
failure_modes: string[];
verify: string | null; verify: string | null;
verifyResult: { command: string; exitCode: number; outputTail: string } | null; verifyResult: { command: string; exitCode: number; outputTail: string } | null;
failure_modes: string[];
evidence: string; evidence: string;
paths: string[]; paths: string[];
}): string { }): string {
@@ -207,9 +233,10 @@ export function evidenceJudgeUser(p: {
: "verify command: none (no deterministic check for this goal)"; : "verify command: none (no deterministic check for this goal)";
return `\ return `\
Goal: ${p.subject} Goal: ${p.subject}
done_when: ${p.done_when} discriminator (must be satisfied):
failure_modes: ${p.discriminator.map((d) => ` - ${d}`).join("\n") || " (none stated, note this)"}
${p.failure_modes.map((f) => ` - ${f}`).join("\n")} subtle failure modes (must be ruled out):
${p.failure_modes.map((f) => ` - ${f}`).join("\n") || " (none stated)"}
${verifyBlock} ${verifyBlock}
@@ -219,5 +246,5 @@ ${p.evidence}
Artifacts it points to (inspect these): Artifacts it points to (inspect these):
${p.paths.map((x) => ` - ${x}`).join("\n") || " (none listed, note this)"} ${p.paths.map((x) => ` - ${x}`).join("\n") || " (none listed, note this)"}
Verify the goal against its done_when. Then give your VERDICT.`; Verify the evidence satisfies the discriminator and rules out the failure modes. Then give your VERDICT.`;
} }
+75 -90
View File
@@ -1,27 +1,30 @@
import { describe, expect, it } from "vitest"; import { describe, expect, it } from "vitest";
import { appendLog, counts, findGoal, parse, recordSignOff, setGoalStatus } from "../src/plan-file.js"; import { appendLog, counts, findGoal, parse, recordSignOff, setGoalStatus } from "../src/plan-file.js";
const SAMPLE = `# Goals: ship the cache layer const SAMPLE = `# papers audit
## Goal: [/] Implement cache layer Clean up steering/ metadata and kill empty dirs. Keep it read-only until I approve.
<!-- id: cache-layer-1 -->
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
verify: pytest tests/cache -q
failure_modes:
- cache silently bypassed (hit-rate ~0, latency ok by luck)
- bench too small to exercise eviction
- [x] wire cache client
- [ ] eviction policy
- [ ] load test
evidence:
- load-test.log shows p95=41ms
- hit-rate 0.93 in load-test.log
## Goal: [ ] Document the API ## Goals
<!-- id: document-the-api-1 -->
done_when: every public fn has a docstring; else sphinx warns 1. [/] goal: Implement cache layer
failure_modes: - discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0)
- docstrings exist but are stale - subtle failure mode: cache silently bypassed, latency ok by luck
- verify: pytest tests/cache -q
- tasks:
1. [x] wire cache client
2. [/] eviction policy
3. ~~[ ]~~ distributed cache, out of scope
- evidence:
- > load-test.log: p95=41ms
- > hit-rate 0.93 (not bypassed)
2. [ ] goal: Document the API
- discriminator: every public fn has a docstring; sphinx warns on none
- subtle failure mode: docstrings exist but are stale
# Future work / out of scope
- distributed cache
## Log ## Log
- 2026-06-15 14:02 cache client wired; eviction next - 2026-06-15 14:02 cache client wired; eviction next
@@ -49,92 +52,74 @@ function lineDelta(a: string, b: string): { added: number; removed: number } {
describe("parse", () => { describe("parse", () => {
const doc = parse(SAMPLE); const doc = parse(SAMPLE);
it("reads the objective and both goals", () => { it("reads the title and both goals (matched by subject)", () => {
expect(doc.objective).toBe("ship the cache layer"); expect(doc.title).toBe("papers audit");
expect(doc.goals.map((g) => g.id)).toEqual(["cache-layer-1", "document-the-api-1"]); expect(doc.goals.map((g) => g.subject)).toEqual(["Implement cache layer", "Document the API"]);
}); });
it("reads goal fields, with status from the header checkbox", () => { it("reads goal status from the checkbox", () => {
const g = findGoal(doc, "cache-layer-1"); expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // [/]
expect(g?.subject).toBe("Implement cache layer"); expect(findGoal(doc, "Document the API")?.status).toBe("open"); // [ ]
expect(g?.status).toBe("active"); // from the [/] in the header });
expect(g?.done_when).toBe("p95 < 50ms on bench-X. If wrong: timeouts in load-test.log");
it("reads discriminator, subtle failure mode, and verify as separate fields", () => {
const g = findGoal(doc, "Implement cache layer");
expect(g?.discriminator).toEqual(["hit-rate > 0.8 in load-test.log (a bypass reads ~0)"]);
expect(g?.failure_modes).toEqual(["cache silently bypassed, latency ok by luck"]);
expect(g?.verify).toBe("pytest tests/cache -q"); expect(g?.verify).toBe("pytest tests/cache -q");
expect(findGoal(doc, "document-the-api-1")?.status).toBe("open"); // from [ ]
}); });
it("separates failure_modes from subtasks", () => { it("reads subtasks with their checkbox state, strikethrough as cancelled", () => {
const g = findGoal(doc, "cache-layer-1"); const g = findGoal(doc, "Implement cache layer");
expect(g?.failure_modes).toHaveLength(2);
expect(g?.failure_modes[0]).toContain("cache silently bypassed");
expect(g?.subtasks).toEqual([ expect(g?.subtasks).toEqual([
{ text: "wire cache client", done: true }, { text: "wire cache client", status: "done" },
{ text: "eviction policy", done: false }, { text: "eviction policy", status: "active" },
{ text: "load test", done: false }, { text: "distributed cache, out of scope", status: "cancelled" },
]); ]);
}); });
it("reads the evidence block, separate from failure_modes and subtasks", () => { it("reads the evidence block separate from the other lists", () => {
const g = findGoal(doc, "cache-layer-1"); const g = findGoal(doc, "Implement cache layer");
expect(g?.evidence).toEqual(["load-test.log shows p95=41ms", "hit-rate 0.93 in load-test.log"]); expect(g?.evidence).toEqual(["> load-test.log: p95=41ms", "> hit-rate 0.93 (not bypassed)"]);
expect(g?.failure_modes).toHaveLength(2); // unchanged by the evidence block that follows the subtasks expect(findGoal(doc, "Document the API")?.evidence).toEqual([]); // a goal with no evidence parses to []
const g2 = findGoal(doc, "document-the-api-1"); });
expect(g2?.evidence).toEqual([]); // a goal with no evidence block parses to []
it("keeps a multi-line evidence item together (quote + interpretation)", () => {
const doc2 = parse(
`# x\n\n## Goals\n\n1. [ ] goal: G\n - discriminator: report has non-zero counts\n - evidence:\n - > report.txt: counts 52 -> 4\n remaining 4 = index + 3 notes\n almost certain the discriminator passes\n - > second item, single line\n`,
);
expect(findGoal(doc2, "G")?.evidence).toEqual([
"> report.txt: counts 52 -> 4\nremaining 4 = index + 3 notes\nalmost certain the discriminator passes",
"> second item, single line",
]);
}); });
it("reads the log verbatim and counts by status", () => { it("reads the log verbatim and counts by status", () => {
expect(doc.log).toEqual(["- 2026-06-15 14:02 cache client wired; eviction next"]); expect(doc.log).toEqual(["- 2026-06-15 14:02 cache client wired; eviction next"]);
expect(counts(doc)).toEqual({ done: 0, open: 1, active: 1 }); expect(counts(doc)).toEqual({ done: 0, open: 1, active: 1 });
}); });
});
describe("failure_modes vs subtask disambiguation", () => { it("ignores the Future work section, does not read it as goals or log", () => {
it("a column-0 checkbox right after failure_modes: is a SUBTASK", () => { expect(doc.goals).toHaveLength(2);
const doc = parse( expect(doc.log).toHaveLength(1);
`# Goals: x\n\n## Goal: [ ] G\n<!-- id: g-1 -->\ndone_when: z\nfailure_modes:\n- [ ] first subtask\n- [x] second subtask\n`,
);
const g = findGoal(doc, "g-1");
expect(g?.failure_modes).toEqual([]);
expect(g?.subtasks).toEqual([
{ text: "first subtask", done: false },
{ text: "second subtask", done: true },
]);
});
it("an indented checkbox-shaped item inside failure_modes is a FAILURE MODE", () => {
const doc = parse(
`# Goals: x\n\n## Goal: [ ] G\n<!-- id: g-2 -->\ndone_when: z\nfailure_modes:\n - [ ] prose that looks like a checkbox\n- [ ] real subtask\n`,
);
const g = findGoal(doc, "g-2");
expect(g?.failure_modes).toEqual(["[ ] prose that looks like a checkbox"]);
expect(g?.subtasks).toEqual([{ text: "real subtask", done: false }]);
});
it("a goal with no failure_modes keeps its subtasks", () => {
const doc = parse(`# Goals: x\n\n## Goal: [ ] G\n<!-- id: g-3 -->\ndone_when: z\n- [ ] only subtask\n`);
const g = findGoal(doc, "g-3");
expect(g?.failure_modes).toEqual([]);
expect(g?.subtasks).toEqual([{ text: "only subtask", done: false }]);
}); });
}); });
describe("the two CompleteGoal writes (minimal diff)", () => { describe("the two CompleteGoal writes (minimal diff)", () => {
it("setGoalStatus replaces exactly one line, scoped to the right goal", () => { it("setGoalStatus replaces exactly one line, scoped to the right goal", () => {
const next = setGoalStatus(SAMPLE, "cache-layer-1", "done"); const next = setGoalStatus(SAMPLE, "Implement cache layer", "done");
expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 1 }); expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 1 });
expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("done"); expect(findGoal(parse(next), "Implement cache layer")?.status).toBe("done");
expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("open"); // untouched expect(findGoal(parse(next), "Document the API")?.status).toBe("open"); // untouched
}); });
it("setGoalStatus targets the second goal without touching the first", () => { it("setGoalStatus keeps the number and goal: prefix, flips only the checkbox", () => {
const next = setGoalStatus(SAMPLE, "document-the-api-1", "active"); expect(setGoalStatus(SAMPLE, "Implement cache layer", "done")).toContain("1. [x] goal: Implement cache layer");
expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("active"); expect(setGoalStatus(SAMPLE, "Document the API", "cancelled")).toContain("2. [-] goal: Document the API");
expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("active");
}); });
it("setGoalStatus writes the checkbox char into the header line", () => { it("setGoalStatus throws on an unknown subject", () => {
expect(setGoalStatus(SAMPLE, "cache-layer-1", "done")).toContain("## Goal: [x] Implement cache layer"); expect(() => setGoalStatus(SAMPLE, "no such goal", "done")).toThrow();
expect(setGoalStatus(SAMPLE, "document-the-api-1", "cancelled")).toContain("## Goal: [-] Document the API");
}); });
it("appendLog adds exactly one line under ## Log", () => { it("appendLog adds exactly one line under ## Log", () => {
@@ -147,7 +132,7 @@ describe("the two CompleteGoal writes (minimal diff)", () => {
}); });
it("appendLog creates the section when absent", () => { it("appendLog creates the section when absent", () => {
const noLog = "# Goals: x\n\n## Goal: [ ] y\n<!-- id: y-1 -->\ndone_when: z\n"; const noLog = "# x\n\n## Goals\n\n1. [ ] goal: y\n - discriminator: z\n";
expect(parse(appendLog(noLog, "first entry")).log).toEqual(["- first entry"]); expect(parse(appendLog(noLog, "first entry")).log).toEqual(["- first entry"]);
}); });
}); });
@@ -156,30 +141,30 @@ describe("recordSignOff (CompleteGoal's pure record logic)", () => {
const WHEN = "2026-06-15 16:00"; const WHEN = "2026-06-15 16:00";
it("accept flips status:done and logs a sign-off line", () => { it("accept flips status:done and logs a sign-off line", () => {
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "accepted" }); const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "accepted" });
expect(r.isError).toBe(false); expect(r.isError).toBe(false);
const doc = parse(r.content); const doc = parse(r.content);
expect(findGoal(doc, "cache-layer-1")?.status).toBe("done"); expect(findGoal(doc, "Implement cache layer")?.status).toBe("done");
expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off #cache-layer-1: Implement cache layer (oracle accept)`); expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off "Implement cache layer" (judge accept)`);
}); });
it("verify_failed only logs a reject line, status stays active", () => { it("verify_failed only logs a reject line, status stays active", () => {
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" }); const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" });
expect(r.isError).toBe(true); expect(r.isError).toBe(true);
const doc = parse(r.content); const doc = parse(r.content);
expect(findGoal(doc, "cache-layer-1")?.status).toBe("active"); // NOT marked done expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // NOT marked done
expect(doc.log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: verify exit 1`); expect(doc.log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": verify exit 1`);
}); });
it("rejected logs the (one-lined) missing reason, status stays", () => { it("rejected logs the (one-lined) missing reason, status stays", () => {
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" }); const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" });
expect(r.isError).toBe(true); expect(r.isError).toBe(true);
expect(findGoal(parse(r.content), "cache-layer-1")?.status).toBe("active"); expect(findGoal(parse(r.content), "Implement cache layer")?.status).toBe("active");
expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: no saved bench log`); expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": no saved bench log`);
}); });
it("unknown goal returns an error and does not touch the file", () => { it("unknown goal returns an error and does not touch the file", () => {
const r = recordSignOff(SAMPLE, "nope-1", WHEN, { kind: "accepted" }); const r = recordSignOff(SAMPLE, "nope", WHEN, { kind: "accepted" });
expect(r.isError).toBe(true); expect(r.isError).toBe(true);
expect(r.content).toBe(SAMPLE); expect(r.content).toBe(SAMPLE);
}); });