Judge: read-only + bash (no edit/write), renderCall/renderResult, streaming progress

- Judge gets read, bash, grep, find, ls but edit+write are blocked via --exclude-tools - Added renderCall: shows goal name while running - Added renderResult: shows accept/reject icon, model, duration, collapsed/expanded view - Wired onUpdate through decideSignOff -> runJudge so the TUI shows progress while judging - Added SignOffDetails type for structured metadata - Added 120s timeout on judge subprocess
FIXME: judge side-effect clones pollute user workspace
2026-06-27 16:46:16 +08:00 · 2026-06-17 18:21:45 +08:00 · 2026-06-17 18:16:32 +08:00 · 2026-06-17 18:09:03 +08:00 · 2026-06-16 11:50:12 +08:00 · 2026-06-16 11:45:08 +08:00
9 changed files with 803 additions and 412 deletions
@@ -1,5 +1,6 @@
 node_modules/
 dist/
 *.log
+.pi/
 docs/reviews/raw.jsonl
 docs/reviews/err.txt
@@ -1,99 +1,153 @@
-# pi-plan
+# pi-goals

-A [pi](https://github.com/badlogic/pi-mono) extension for plan-driven, goal-tracked work in one
-`plan.md`. Set up goals (with evidence and failure modes) in plan mode, work them, and sign a goal
-off only when a read-only subagent has checked the evidence.
+Plan mode for agreeing on goals before any code gets written. Each goal names the subtle failure mode
+that could fake a "done" and the discriminator that tells real success from it, plus subtasks and the
+evidence checked at sign-off. It lives in one markdown file. A widget keeps the goals in front of you
+through compaction, a reminder nudges the agent to keep the file current, and a goal is signed off
+only after a read-only subagent checks its evidence.

-Successor to [pi-lgtm](https://github.com/wassname/pi-lgtm), kept deliberately small: about
-[burneikis/pi-plan](https://github.com/burneikis/pi-plan) plus the additions, goals with evidence,
-a sign-off check, a widget, and a reminder.
-
-The form guides; it does not gate. The agent edits `plan.md` with its normal Edit tool. The one
-blessed tool is `CompleteGoal`, which runs the sign-off check and records the result. The reminder,
-the injected plan summary, and git/widget visibility carry the process. It trusts the agent's
-judgement rather than guarding it.
+Like [pi-milestones](https://github.com/Neuron-Mr-White/UniPi/tree/main/packages/milestone) and
+[burneikis/pi-plan](https://github.com/burneikis/pi-plan), it guides rather than guards: a form and a
+process the agent follows. [pi-lgtm](https://github.com/wassname/pi-lgtm) was my earlier, more complex
+attempt.

 ## Install

 ```bash
-pi install npm:@wassname2/pi-plan
+pi install npm:@wassname2/pi-goals
 ```

-Or run without installing:
+Or run it without installing:

 ```bash
-pi -e npm:@wassname2/pi-plan
+pi -e npm:@wassname2/pi-goals
 ```

 ## Use

 ```
-/plan add CSV export to the report view
+/goals CSV export for the report view
 ```

-1. Plan. The agent explores read-only and writes goals into `plan.md` (see format below).
-2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel.
-   On Ready you choose whether to keep the current context or start fresh and compacted.
-3. Work. Each turn the active goal is injected (so it survives compaction) and a reminder nudges
-   the agent to keep `plan.md` current and work autonomously. When a goal's `done_when` is met the
-   agent calls `CompleteGoal`, which runs `verify` and a read-only judge and, on accept, marks it
-   done and logs it.
+`/goals` enters plan mode and starts a conversation; the description is an optional seed, so plain
+`/goals` works too. From there:

-Other commands: `/plan` (print the plan), `/plan clear` (empty `plan.md`, history kept in git),
-`/plan judge <model-ref>` (use a specific model for the sign-off judge; default is your current
-model).
+1. Plan. The agent explores read-only, asks about anything unclear, and writes the goals into
+   `.pi/goals.md`.
+2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel. On
+   Ready you choose whether to keep the current context or start fresh and compacted.
+3. Work. Each turn the active goal is injected so it survives compaction, and a reminder nudges the
+   agent to keep `goals.md` current and keep going. When a goal's discriminator is satisfied the agent
+   calls `CompleteGoal`, which runs `verify` and a read-only judge, then marks the goal done and logs it.

-## plan.md format
+Other commands: `/goals clear` empties `.pi/goals.md`; `/goals judge <model-ref>` picks a specific
+model for the sign-off judge (the default is your current model).

-One file holds the objective, the goals, and a short append-only log.
+## Example
+
+```
+/goals audit the papers dir metadata and clean up empty dirs
+```
+
+The agent explores read-only, drafts the goal with a subtle failure mode and the discriminator that
+beats it, and stops for review:

 ```markdown
-# Plan: ship the cache layer
+## Goals

-## Goal: Implement cache layer
-<!-- id: cache-layer-1 -->
-status: active
-done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
-verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
-failure_modes:
-  - cache silently bypassed (hit-rate ~0, latency ok by luck)
-  - bench too small to exercise eviction
- [x] wire cache client
- [ ] eviction policy
+1. [ ] goal: Audit steering/ metadata and remove empty dirs
+  - subtle failure mode: report written but counts are zero (resolver errored silently)
+  - discriminator: report shows the XXXX count before/after AND a non-zero rename count
+  - tasks:
+    1. [ ] dry-run the metadata resolve
+    2. [ ] remove the empty _artifacts dirs
+    3. [ ] write the report
+  - evidence:
+    - <empty until sign-off>
+```
+
+You choose Ready. The agent works the subtasks, fills `evidence` (each item an artifact plus a short
+read of it), and calls `CompleteGoal`:
+
+```markdown
+  - evidence:
+    - > scripts/metadata_report.txt: XXXX 52 -> 4, 146 empty _artifacts removed
+    - > 48 files renamed; almost certain done, the silent-resolver failure mode is ruled out
+```
+
+A fresh read-only subagent re-checks the evidence against the repo and the discriminator, then
+returns its verdict and reasoning:
+
+```
+Signed off "Audit steering/ metadata and remove empty dirs". Marked done in goals.md.
+
+--- sign-off judge ---
+metadata_report.txt present; counts 52 -> 4 confirmed; rename log shows 48 renamed (not zero).
+VERDICT: accept
+```
+
+## The goals.md format
+
+One project-local file, `<cwd>/.pi/goals.md` (gitignored), holds the title, a context block, the
+goals, and a short append-only log. A fresh `/goals` draft replaces it.
+
+```markdown
+# ship the cache layer
+
+Latency target came from the SLO review; keep the existing client API.
+
+## Goals
+
+1. [/] goal: Implement cache layer
+  - subtle failure mode: cache silently bypassed, latency ok by luck
+  - discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0)
+  - verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
+  - tasks:
+    1. [x] wire cache client
+    2. [/] eviction policy
+  - evidence:
+    - > load-test.log: p95=41ms, hit-rate 0.93 (not bypassed)
+
+# Future work / out of scope
+
+- distributed cache

 ## Log
 - 2026-06-15 14:02  cache client wired; eviction next
 ```

- A goal is a `## Goal:` header with an `<!-- id -->`, a `status:`
-  (`open` | `active` | `done` | `cancelled`), a falsifiable `done_when:` (what you expect, and the
-  symptom if it is NOT met), an optional `verify:` shell command, a `failure_modes:` pre-mortem
-  list, and `- [ ]` subtasks.
- `done_when` names the evidence that distinguishes real success from a subtle failure. `verify`,
-  when present, is the deterministic first stage of the sign-off check.
- The agent ticks subtasks, appends to `## Log`, and sets `status` as it works. Multiple goals may
-  be `active`.
+- A goal is a numbered checkbox line beginning `goal:`; the checkbox carries its state (`[ ]` open,
+  `[/]` active, `[x]` done, `[-]` cancelled). Goals are matched by their text, so the number is just
+  for you to reference.
+- The `discriminator` is the success test, written while planning: the positive observation that the
+  goal succeeded and that none of the `subtle failure mode`s could fake (a count moved, a test
+  exercised the path, a metric beat noise), not just that a failure was avoided. `evidence` is the
+  proof, filled at sign-off: each item pairs a durable artifact (a quoted and linked log, a table, a
+  metric) with a short read of it. `verify`, when present, is the deterministic first stage.
+- Subtasks are any checkbox without a `goal:` prefix, under `- tasks:`. The agent ticks them, appends
+  to `## Log`, and sets a goal `[/]` when it starts it; only `CompleteGoal` writes `[x]`. Several
+  goals can be active at once.

-## The sign-off check (`CompleteGoal`)
+## Signing off a goal (`CompleteGoal`)

-`CompleteGoal(goal_id, evidence, paths?)` is the one blessed completion path:
+`CompleteGoal(goal)` (matched by the goal's text) is the only tool that marks a goal done; everything
+else is the agent editing the file. It reads the goal's `evidence:` block from `.pi/goals.md`, then:

-1. If the goal has a `verify:` command, it is run. A non-zero exit rejects immediately, with no model
-   call.
-2. Otherwise a read-only `pi` subprocess (the judge) inspects the evidence against the repo and the
-   named failure modes and returns a verdict. It re-derives from the artifacts you point it at
-   rather than trusting the claim, so point `evidence`/`paths` at durable artifacts (saved logs,
-   committed diffs, files).
-3. On accept, the goal's `status` flips to `done` and a `## Log` line is written. On reject, the
-   goal stays open and the agent is told what is missing.
+1. If the goal has a `verify:` command, it runs. A non-zero exit rejects right away, no model call.
+2. Otherwise a read-only `pi` subprocess (a fresh `--no-session` context, so it never sees the working
+   agent's transcript) inspects the `evidence:` against the repo, the `discriminator`, and the
+   `subtle failure mode`. It re-derives from the cited artifacts rather than trusting the claim, so
+   list real artifacts, not assertions.
+3. On accept, the goal flips to `[x]` and a `## Log` line is written. On reject, it stays open and the
+   agent is told what is missing. Either way the judge's reasoning comes back in the result.

-The judge defaults to your current model (guaranteed authorized and capable). Set a different one
-with `/plan judge <provider/model>` for an independent cross-family check.
+The judge defaults to your current model (a fresh context, same weights). Point it at another with
+`/goals judge <provider/model>` for an independent cross-family check.

 ## Prompts

-All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so the process is
-easy to review end to end.
+All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so you can read the
+whole process top to bottom.

 ## Develop

@@ -106,8 +160,9 @@ npm run lint

 ## Not (yet) included

-No autonomous re-prompt loop (an until-done-style loop judge). Autonomy comes from the reminder, not
-a harness. Plan-phase model stickiness is a documented next step.
+- No autonomous re-prompt loop. The reminder nudges the agent within a turn, but the turn still ends
+  and hands back to you; nothing auto-re-prompts until the goals are done.
+- The plan and execution phases can't yet run on different, sticky models.

 ## License

@@ -1,4 +1,4 @@
-Code review against spec `docs/spec/2026-06-15_pi-plan.md`.
+Code review against spec `docs/spec/2026-06-15_pi-goals.md`.

 ---

@@ -1,4 +1,4 @@
-# pi-plan — design spec
+# pi-goals — design spec

 Working title. A pi extension: set up goals (with subtasks and evidence) through plan mode, work them autonomously, and sign a goal off only when a check passes. One markdown file holds everything. The form guides a process; it does not police one. Successor to `pi-lgtm`, deliberately smaller.

@@ -1,13 +1,13 @@
 {
-  "name": "@wassname2/pi-plan",
+  "name": "@wassname2/pi-goals",
  "version": "0.0.1",
-  "description": "One plan.md: set goals via plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.",
+  "description": "One .pi/goals.md: set goals in plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.",
  "author": "wassname",
  "license": "MIT",
  "type": "module",
  "repository": {
    "type": "git",
-    "url": "https://github.com/wassname/pi-plan.git"
+    "url": "https://github.com/wassname/pi-goals.git"
  },
  "keywords": [
    "pi-package",
@@ -1,37 +1,87 @@
 /**
- * pi-plan — plan mode that sets up goals with evidence, tracked in one plan.md, signed off by a
+ * pi-goals — plan mode that sets up goals with evidence, tracked in one .pi/goals.md, signed off by a
 * read-only subagent check. A successor to pi-lgtm, kept deliberately small (≈ burneikis/pi-plan
- * plus the additions: goals + failure_modes + subtasks, a sign-off check, a widget, a reminder).
+ * plus the additions: goals + a discriminator + a subtle failure mode + subtasks, a sign-off check,
+ * a widget, a reminder). A goal's success test is its discriminator: the observation that tells real
+ * success from the named failure mode.
 *
- * Philosophy (spec D3): the form guides, it does not gate. The agent edits plan.md with its normal
+ * Philosophy (spec D3): the form guides, it does not gate. The agent edits goals.md with its normal
 * Edit tool. The one blessed tool is CompleteGoal, which runs the sign-off check and records it. The
 * reminder + the injected plan + git/widget visibility carry the process; we trust the agent's
 * judgement rather than guarding it.
 *
 * Flow:
- *   /plan <objective>  -> plan mode: agent explores, drafts goals into plan.md (planDrafting guides)
+ *   /goals [objective] -> plan mode (conversational): objective is an optional seed; agent explores
+ *                         read-only, asks, then drafts goals into .pi/goals.md (planDrafting guides)
 *   agent_end          -> review menu (Ready / Edit / $EDITOR / Cancel); Ready offers compaction
 *   execution          -> each turn, inject the plan summary (survives compaction) + a reminder;
 *                         agent works goals, ticks subtasks, appends ## Log, calls CompleteGoal
 *   CompleteGoal       -> optional deterministic verify, then a read-only oracle judge -> accept
 *                         flips status:done + logs; reject returns what's missing
 *
- * All model-facing text lives in prompts.tsx, in flow order.
+ * The plan file lives at <cwd>/.pi/goals.md (project-local, gitignored, like pi-tasks), not in the
+ * repo. A fresh /goals draft just replaces it (the "overwrite" staleness rule).
+ *
+ * Plan mode is read-only: the tool_call hook blocks edit/write (except goals.md itself) and mutating
+ * bash while drafting, so code isn't written before the goals are agreed. Read-only bash exploration
+ * stays open (blocklist, not allowlist).
+ *
+ * Not built (FIXME): no plan-vs-exec model switch on accept (plan-model stickiness); noted at its
+ * call site below.
+ *
+ * All model-facing text lives in prompts.ts, in flow order.
 */

 import { spawn, spawnSync } from "node:child_process";
-import { existsSync, readFileSync, writeFileSync } from "node:fs";
-import { basename, join } from "node:path";
+import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
+import { basename, join, resolve } from "node:path";
 import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@earendil-works/pi-coding-agent";
+import { getMarkdownTheme } from "@earendil-works/pi-coding-agent";
+import { Container, Markdown, Spacer, Text } from "@earendil-works/pi-tui";
 import { Type } from "@sinclair/typebox";
 import { counts, findGoal, type Goal, type PlanDoc, parse, recordSignOff, type SignOff } from "./plan-file.js";
-import { evidenceJudgeSystem, evidenceJudgeUser, planDrafting, planInjection, reminder } from "./prompts.js";
+import {
+	completeGoalDescription,
+	completeGoalParamDescription,
+	evidenceJudgeSystem,
+	evidenceJudgeUser,
+	planDrafting,
+	planInjection,
+	reminder,
+} from "./prompts.js";

-const STATE = "pi-plan-state";
-const PLAN_CONTEXT = "pi-plan-context"; // injected plan-mode guidance, stripped from history later
-const STATUS_KEY = "pi-plan";
-const WIDGET_KEY = "pi-plan-widget";
-const READ_ONLY_TOOLS = ["read", "grep", "find", "ls", "bash"];
+const STATE = "pi-goals-state";
+const PLAN_CONTEXT = "pi-goals-context"; // injected plan-mode guidance, stripped from history later
+const STATUS_KEY = "pi-goals";
+const WIDGET_KEY = "pi-goals-widget";
+// Tools the sign-off judge gets: read-only inspection + bash (for git log, cat, running scripts to
+// inspect). File mutators (edit, write) are blocked so the judge cannot modify anything.
+// Names match pi's internal tool registry (grep→ffgrep, find→fffind, etc.).
+const JUDGE_TOOLS = ["read", "bash", "grep", "find", "ls"];
+const JUDGE_BLOCKED_TOOLS = ["edit", "write"];
+// File mutators blocked while drafting goals (read-only plan mode, like narumiruna/pi-plan-mode), so
+// code isn't written before goals are agreed. The one allowed write is goals.md itself (the
+// deliverable). A read-only task (a pure search) can still be explored in plan mode by nature.
+const PLAN_MODE_BLOCKED_TOOLS = ["edit", "write"];
+// bash is dual-use, so block it only when the command looks mutating; read-only exploration (cat, rg,
+// git log, running a script to inspect) stays open. Blocklist, not allowlist: keep exploration
+// frictionless and just stop the obvious mutators. List adapted from narumiruna/pi-plan-mode; the
+// redirect rule catches `> file` / `>> file` / `>| file` but not fd-dups like `2>&1` or `>&2`.
+const MUTATING_BASH_PATTERNS: RegExp[] = [
+	/\b(rm|rmdir|mv|cp|mkdir|touch|chmod|chown|chgrp|ln|tee|truncate|dd)\b/i,
+	/>\s*[^&\s]/, // redirect to a file (write/append/clobber), excludes 2>&1 and >&2
+	/\bnpm\s+(install|uninstall|update|ci|link|publish|version)\b/i,
+	/\byarn\s+(add|remove|install|publish|upgrade)\b/i,
+	/\bpnpm\s+(add|remove|install|publish|update)\b/i,
+	/\bbun\s+(add|remove|install|update|publish)\b/i,
+	/\bpip\s+(install|uninstall)\b/i,
+	/\buv\s+(add|remove|sync|lock|pip\s+install)\b/i,
+	/\bgit\s+(add|commit|push|pull|merge|rebase|reset|checkout|switch|stash|cherry-pick|revert|tag|init|clone)\b/i,
+	/\b(sudo|su|kill|pkill|killall|reboot|shutdown)\b/i,
+	/\bsystemctl\s+(start|stop|restart|enable|disable)\b/i,
+	/\b(vim?|nano|emacs|code|subl)\b/i,
+];
+const PLAN_REL = ".pi/goals.md"; // project-local, gitignored (pi-tasks convention); shown in the widget

 interface PlanState {
 	isPlanMode: boolean;
@@ -40,15 +90,21 @@ interface PlanState {
 	judgeModel: string | null;
 }

-export default function piPlanExtension(pi: ExtensionAPI): void {
+export default function piGoalsExtension(pi: ExtensionAPI): void {
 	let state: PlanState = { isPlanMode: false, objective: null, judgeModel: null };
-	// Reminder cadence: fire when an active goal exists but plan.md was not touched since last turn.
+	// Reminder cadence: fire when an active goal exists but goals.md was not touched since last turn.
 	let lastInjectedPlan = "";
-	// newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /plan.
+	// newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /goals.
 	let savedCmdCtx: ExtensionCommandContext | null = null;

-	const planPath = (ctx: ExtensionContext) => join(ctx.cwd, "plan.md");
+	const planPath = (ctx: ExtensionContext) => join(ctx.cwd, ".pi", "goals.md");
 	const readPlan = (ctx: ExtensionContext): string => (existsSync(planPath(ctx)) ? readFileSync(planPath(ctx), "utf-8") : "");
+	// Our programmatic writes (clear, CompleteGoal). The agent creates/edits the file with its own Edit
+	// tool; this just makes sure .pi/ exists for our writes.
+	const writePlan = (ctx: ExtensionContext, content: string): void => {
+		mkdirSync(join(ctx.cwd, ".pi"), { recursive: true });
+		writeFileSync(planPath(ctx), content);
+	};

 	function persist(): void {
 		pi.appendEntry<PlanState>(STATE, state);
@@ -57,7 +113,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 	function updateWidget(ctx: ExtensionContext): void {
 		if (state.isPlanMode) {
 			ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("warning", "planning"));
-			ctx.ui.setWidget(WIDGET_KEY, ["pi-plan: drafting goals", "Write goals to plan.md, then review."]);
+			ctx.ui.setWidget(WIDGET_KEY, ["pi-goals: drafting goals", `Write goals to ${PLAN_REL}, then review.`]);
 			return;
 		}
 		const doc = parse(readPlan(ctx));
@@ -68,26 +124,26 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 		}
 		const c = counts(doc);
 		ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("accent", `◷ ${c.done}/${doc.goals.length} goals`));
-		ctx.ui.setWidget(WIDGET_KEY, goalWidgetLines(doc));
+		ctx.ui.setWidget(WIDGET_KEY, [...goalWidgetLines(doc), ctx.ui.theme.fg("muted", PLAN_REL)]);
 	}

 	function goalWidgetLines(doc: PlanDoc): string[] {
 		const mark: Record<Goal["status"], string> = { done: "✔", active: "▸", open: "◻", cancelled: "✗" };
-		const lines = [`Plan: ${doc.objective || "(untitled)"}`];
+		const lines = [`Goals: ${doc.title || "(untitled)"}`];
 		for (const g of doc.goals) {
-			if (g.status === "done") continue; // hide finished goals; they stay in the file
-			const open = g.subtasks.filter((s) => !s.done).length;
-			lines.push(`${mark[g.status]} ${g.subject}${open ? ` (${open} todo)` : ""}`);
+			// Show every goal with its status glyph (✔ done, ▸ active, ◻ open, ✗ cancelled) so finished
+			// goals read as checked off rather than vanishing. Plans are small, so this stays readable.
+			const total = g.subtasks.length;
+			const done = g.subtasks.filter((s) => s.status === "done").length;
+			lines.push(`${mark[g.status]} ${g.subject}${total ? ` (${done}/${total} tasks)` : ""}`);
 		}
-		const c = counts(doc);
-		if (c.done) lines.push(`(${c.done} done, hidden)`);
 		return lines;
 	}

 	// --- plan mode: setup -------------------------------------------------------------------------

-	pi.registerCommand("plan", {
-		description: "Plan mode: set up goals (with evidence) in plan.md, then work them. /plan <objective>",
+	pi.registerCommand("goals", {
+		description: "Plan mode: set up goals (with evidence) in goals.md, then work them. /goals <objective>",
 		handler: async (args, ctx) => {
 			savedCmdCtx = ctx; // ctx here is an ExtensionCommandContext (has newSession); keep it for later
 			const arg = args.trim();
@@ -99,18 +155,18 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 				setJudge(arg.slice("judge".length).trim(), ctx);
 				return;
 			}
-			if (!arg) {
-				showPlan(ctx);
-				return;
-			}
-
-			state = { ...state, isPlanMode: true, objective: arg };
+			// Conversational entry (like narumiruna/pi-plan-mode): /goals enters plan mode and starts a
+			// dialogue. The objective is an optional seed, not a required arg, so there's no awkward
+			// "type your objective" prompt; the agent explores read-only and asks before drafting. A
+			// fresh draft just replaces .pi/goals.md (the "overwrite" staleness rule).
+			const objective = arg || null;
+			state = { ...state, isPlanMode: true, objective };
 			persist();
 			updateWidget(ctx);
-			pi.sendUserMessage(
-				`Enter plan mode for this objective: ${arg}\n\nExplore read-only, then write the plan to ${planPath(ctx)}.`,
-				{ deliverAs: "followUp" },
-			);
+			const seed = objective
+				? `We're in plan mode. Objective: ${objective}\n\nExplore the repo read-only and ask me anything unclear. When the objective is nailed down, draft (or replace) the goals in ${planPath(ctx)}, then stop for review.`
+				: `We're in plan mode. Tell me what you want to plan. Explore read-only and ask questions as needed; when the objective is clear, draft the goals in ${planPath(ctx)} and stop for review.`;
+			pi.sendUserMessage(seed, { deliverAs: "followUp" });
 		},
 	});

@@ -122,27 +178,18 @@ export default function piPlanExtension(pi: ExtensionAPI): void {

 	async function clearPlan(ctx: ExtensionContext): Promise<void> {
 		if (!existsSync(planPath(ctx))) {
-			ctx.ui.notify("No plan.md to clear.", "info");
+			ctx.ui.notify("No goals.md to clear.", "info");
 			return;
 		}
 		if (ctx.hasUI) {
-			const ok = await ctx.ui.select("Clear plan.md? (it stays in git history)", ["Cancel", "Clear plan.md"]);
-			if (ok !== "Clear plan.md") return;
+			const ok = await ctx.ui.select(`Clear ${PLAN_REL}?`, ["Cancel", "Clear goals.md"]);
+			if (ok !== "Clear goals.md") return;
 		}
-		writeFileSync(planPath(ctx), "");
+		writePlan(ctx, "");
 		state = { ...state, isPlanMode: false, objective: null };
 		persist();
 		updateWidget(ctx);
-		ctx.ui.notify("Cleared plan.md.", "info");
-	}
-
-	function showPlan(ctx: ExtensionContext): void {
-		const content = readPlan(ctx);
-		if (!content.trim()) {
-			ctx.ui.notify("No plan yet. Use /plan <objective> to start.", "info");
-			return;
-		}
-		ctx.ui.notify(content, "info");
+		ctx.ui.notify(`Cleared ${PLAN_REL}.`, "info");
 	}

 	// --- review loop (after the agent drafts the plan) --------------------------------------------
@@ -150,7 +197,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 	async function reviewLoop(ctx: ExtensionContext): Promise<void> {
 		while (true) {
 			const doc = parse(readPlan(ctx));
-			const choice = await ctx.ui.select(`Plan: ${doc.goals.length} goal(s). What next?`, [
+			const choice = await ctx.ui.select(`Goals: ${doc.goals.length} goal(s). What next?`, [
 				"Ready — start working the plan",
 				"Edit — ask the agent to revise",
 				"Open in $EDITOR",
@@ -158,14 +205,14 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 			]);
 			if (!choice || choice.startsWith("Cancel")) {
 				exitPlanMode(ctx);
-				ctx.ui.notify("Left plan mode. plan.md kept.", "info");
+				ctx.ui.notify("Left plan mode. goals.md kept.", "info");
 				return;
 			}
 			if (choice.startsWith("Ready")) return startExecution(ctx);
 			if (choice.startsWith("Edit")) {
 				const changes = await ctx.ui.editor("What should change about the plan?", "");
 				if (changes?.trim()) {
-					pi.sendUserMessage(`Revise the plan at ${planPath(ctx)} with these changes, same format:\n\n${changes.trim()}`);
+					pi.sendUserMessage(`Revise the plan at ${planPath(ctx)} with these changes, same format:\n\n${changes.trim()}`, { deliverAs: "followUp" });
 					return; // agent_end re-opens the review loop
 				}
 				continue;
@@ -184,6 +231,9 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 	}

 	async function startExecution(ctx: ExtensionContext): Promise<void> {
+		// FIXME(model-switch): the plan phase should be able to run on a sticky plan model and execution
+		// on a different one (see README "Not yet included"). newSession can't switch the model yet; wire
+		// this when pi exposes a model override on newSession.
 		// Offer a clean execution context (D13). newSession lives only on the saved command context.
 		let fresh = false;
 		if (ctx.hasUI && savedCmdCtx) {
@@ -193,49 +243,114 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 			]);
 			fresh = choice?.startsWith("A fresh") ?? false;
 		}
-		exitPlanMode(ctx);
 		const doc = parse(readPlan(ctx));
-		if (doc.objective) pi.setSessionName(`Plan: ${doc.objective}`);
+		const planFile = planPath(ctx);
+		const planContent = readPlan(ctx); // captured now: ctx is stale after newSession below
+		const parentSession = ctx.sessionManager.getSessionFile();
+		const startMsg = `Work the goals in ${planFile}. Pick an open goal, mark it active (set its checkbox to [/]), work its subtasks, and when its discriminator is satisfied fill the goal's evidence: block then call CompleteGoal with the goal's desc. Keep goals.md current as you go.`;
+		exitPlanMode(ctx);

 		if (fresh && savedCmdCtx) {
-			const result = await savedCmdCtx.newSession({ parentSession: ctx.sessionManager.getSessionFile() });
+			// After newSession, `ctx`/`pi` bound to the old session are stale; do post-swap work
+			// through the ReplacedSessionContext passed to withSession (see runner.assertActive).
+			const result = await savedCmdCtx.newSession({
+				parentSession,
+				withSession: async (sessionCtx) => {
+					// pi.* and the outer ctx are invalidated by newSession; use the fresh sessionCtx only.
+					// (No setSessionName here: it lives on pi/the outer ctx, both stale now. Cosmetic, skip it.)
+					sessionCtx.ui.notify(planContent, "info");
+					await sessionCtx.sendUserMessage(startMsg, { deliverAs: "followUp" });
+				},
+			});
 			if (result.cancelled) {
-				ctx.ui.notify("Execution cancelled.", "warning");
 				return;
 			}
+			return;
 		}
-		pi.sendUserMessage(
-			`Work the plan in ${planPath(ctx)}. Pick an open goal, set it active, work its subtasks, and when its done_when is met call CompleteGoal with the evidence. Keep plan.md current as you go.`,
-			{ deliverAs: "followUp" },
-		);
+		if (doc.title) pi.setSessionName(`Goals: ${doc.title}`);
+		ctx.ui.notify(planContent, "info");
+		pi.sendUserMessage(startMsg, { deliverAs: "followUp" });
 	}

 	// --- the one blessed tool: CompleteGoal -------------------------------------------------------

 	pi.registerTool({
 		name: "CompleteGoal",
-		label: "Complete goal",
-		description:
-			"Sign off a goal once its done_when is met. Runs the goal's verify command (if any) then a " +
-			"read-only subagent that inspects your evidence against the repo. On accept, the goal is marked " +
-			"done and logged; on reject, it stays open and you get what is missing. Point evidence at durable " +
-			"artifacts (saved logs, committed diffs, files), not claims.",
+		label: "Goal signoff",
+		description: completeGoalDescription,
 		parameters: Type.Object({
-			goal_id: Type.String({ description: "The goal's <!-- id --> from plan.md" }),
-			evidence: Type.String({ description: "What shows the done_when is met, and where to verify it" }),
-			paths: Type.Optional(Type.Array(Type.String(), { description: "Durable artifacts the judge should inspect" })),
+			goal: Type.String({ description: completeGoalParamDescription }),
 		}),
-		async execute(_id, params, signal, _onUpdate, ctx) {
+		async execute(_id, params, signal, onUpdate, ctx) {
 			const content = readPlan(ctx);
-			const goal = findGoal(parse(content), params.goal_id);
-			if (!goal) return text(`No goal #${params.goal_id} in plan.md.`, true);
+			const goal = findGoal(parse(content), params.goal);
+			if (!goal) return text(`No goal "${params.goal}" in goals.md. Use the exact text after "goal:".`, true);
+			if (goal.evidence.length === 0) {
+				return text(`Goal "${goal.subject}" has no evidence yet. Add an evidence: list to the goal in goals.md (artifacts + a short read showing the discriminator is satisfied), then call CompleteGoal.`, true);
+			}

-			// Decide the outcome (the I/O); recordSignOff applies it to the file (the pure write).
-			const outcome = await decideSignOff(goal, params.evidence, params.paths ?? [], state.judgeModel, ctx.cwd, signal);
-			const res = recordSignOff(content, goal.id, stamp(), outcome);
-			if (res.content !== content) writeFileSync(planPath(ctx), res.content);
+			const handleUpdate = (partial: { content: Array<{ type: "text"; text: string }>; details: SignOffDetails }) => {
+				onUpdate?.(partial);
+			};
+
+			const { outcome, reasoning, durationMs } = await decideSignOff(goal, goal.evidence.join("\n"), goal.evidence, state.judgeModel, ctx.cwd, signal, handleUpdate);
+			const res = recordSignOff(content, goal.subject, stamp(), outcome);
+			if (res.content !== content) writePlan(ctx, res.content);
 			updateWidget(ctx);
-			return text(res.message, res.isError);
+			const detail = reasoning ? `\n\n--- sign-off judge ---\n${reasoning}` : "";
+			const outcomeLabel = outcome.kind === "accepted" ? "accepted" : outcome.kind === "verify_failed" ? "verify_failed" : "rejected";
+			const details: SignOffDetails = {
+				goal: goal.subject,
+				outcome: outcomeLabel,
+				durationMs,
+				verifyCommand: goal.verify ?? undefined,
+				verifyExitCode: outcome.kind === "verify_failed" ? outcome.exitCode : undefined,
+				judgeModel: state.judgeModel ?? undefined,
+				reasoning,
+				isError: res.isError,
+			};
+			return textWithDetails(res.message + detail, details, res.isError);
+		},
+
+		renderCall(args, theme) {
+			const goalText = args.goal.length > 80 ? `${args.goal.slice(0, 80)}...` : args.goal;
+			return new Text(
+				`${theme.fg("toolTitle", theme.bold("goal signoff "))}${theme.fg("dim", goalText)}`,
+				0, 0,
+			);
+		},
+
+		renderResult(result, { expanded }, theme) {
+			const details = result.details as SignOffDetails | undefined;
+			const body = result.content[0]?.type === "text" ? result.content[0].text : "(no output)";
+			if (!details || details.outcome === "running") return new Text(body, 0, 0);
+
+			const icon = details.outcome === "accepted" ? theme.fg("success", "✔") : theme.fg("error", "✗");
+			const outcomeText = details.outcome === "accepted" ? "accepted" : details.outcome === "verify_failed" ? `verify failed (exit ${details.verifyExitCode})` : "rejected";
+			const header = `${icon} ${theme.fg("toolTitle", theme.bold("goal signoff "))}${theme.fg("accent", outcomeText)}`;
+			const duration = details.durationMs < 1000 ? `${details.durationMs}ms` : `${(details.durationMs / 1000).toFixed(1)}s`;
+			const sub = [details.judgeModel, duration].filter(Boolean).join(" · ");
+
+			if (!expanded) {
+				let text = header;
+				if (sub) text += `\n${theme.fg("dim", sub)}`;
+				text += `\n\n${theme.fg("toolOutput", body.slice(0, 500))}`;
+				if (body.length > 500) text += theme.fg("dim", "...");
+				text += `\n${theme.fg("muted", "(Ctrl+O to expand)")}`;
+				return new Text(text, 0, 0);
+			}
+
+			const container = new Container();
+			container.addChild(new Text(header, 0, 0));
+			if (sub) container.addChild(new Text(theme.fg("dim", sub), 0, 0));
+			if (details.verifyCommand) {
+				container.addChild(new Spacer(1));
+				container.addChild(new Text(theme.fg("muted", `verify: ${details.verifyCommand}`), 0, 0));
+			}
+			container.addChild(new Spacer(1));
+			container.addChild(new Text(theme.fg("muted", "Judge"), 0, 0));
+			container.addChild(new Markdown(body.trim(), 0, 0, getMarkdownTheme()));
+			return container;
 		},
 	});

@@ -243,6 +358,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {

 	pi.on("before_agent_start", async (_event, ctx) => {
 		if (state.isPlanMode) {
+			// Read-only is enforced in the tool_call hook below (blocks edit/write while planning).
 			return { message: { customType: PLAN_CONTEXT, content: `${planDrafting}\n\nWrite the plan to ${planPath(ctx)}.`, display: false } };
 		}
 		const doc = parse(readPlan(ctx));
@@ -251,25 +367,48 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
 		const active = doc.goals.find((g) => g.status === "active") ?? doc.goals.find((g) => g.status === "open") ?? null;
 		const c = counts(doc);
 		let body = planInjection({
-			objective: doc.objective,
+			title: doc.title,
 			activeGoal: active
-				? { subject: active.subject, done_when: active.done_when, openSubtasks: active.subtasks.filter((s) => !s.done).map((s) => s.text) }
+				? {
+						subject: active.subject,
+						discriminator: active.discriminator,
+						openSubtasks: active.subtasks.filter((s) => s.status !== "done" && s.status !== "cancelled").map((s) => s.text),
+					}
 				: null,
 			lastLogLine: doc.log.at(-1) ?? null,
 			counts: { done: c.done, open: c.open + c.active },
 		});
-		// Reminder fires when there is an active goal but plan.md was untouched since the last turn.
+		// Reminder fires when there is an active goal but goals.md was untouched since the last turn.
 		const planNow = readPlan(ctx);
 		if (active && planNow === lastInjectedPlan) body += `\n\n${reminder}`;
 		lastInjectedPlan = planNow;
 		return { message: { customType: PLAN_CONTEXT, content: body, display: false } };
 	});

+	// Enforce read-only planning: block file mutators while in plan mode so code isn't written before
+	// the goals are agreed. The agent draws back to read/grep/find/ls and read-only bash to explore.
+	pi.on("tool_call", async (event, ctx) => {
+		if (!state.isPlanMode) return;
+		// edit/write: blocked, except writing goals.md itself (the deliverable of plan mode).
+		if (PLAN_MODE_BLOCKED_TOOLS.includes(event.toolName)) {
+			const target = (event.input as { path?: string }).path;
+			if (target && resolve(ctx.cwd, target) === resolve(planPath(ctx))) return;
+			return { block: true, reason: `Plan mode is read-only: agree the goals in ${PLAN_REL} and choose Ready before writing code (${event.toolName} is blocked while planning; only ${PLAN_REL} may be written).` };
+		}
+		// bash: blocked only when the command looks mutating; read-only exploration stays open.
+		if (event.toolName === "bash") {
+			const command = (event.input as { command?: string }).command ?? "";
+			if (MUTATING_BASH_PATTERNS.some((re) => re.test(command))) {
+				return { block: true, reason: `Plan mode is read-only: this bash command looks like it mutates state, so it's blocked while planning. Explore read-only, agree the goals in ${PLAN_REL}, then choose Ready.\nCommand: ${command}` };
+			}
+		}
+	});
+
 	pi.on("agent_end", async (_event, ctx) => {
 		if (!state.isPlanMode || !ctx.hasUI) return;
 		const doc = parse(readPlan(ctx));
 		if (doc.goals.length === 0) {
-			ctx.ui.notify("No goals found in plan.md yet — ask the agent to draft them.", "warning");
+			ctx.ui.notify("No goals found in goals.md yet — ask the agent to draft them.", "warning");
 			return;
 		}
 		await reviewLoop(ctx);
@@ -298,15 +437,33 @@ export default function piPlanExtension(pi: ExtensionAPI): void {

 // --- helpers (module scope; pure enough to keep out of the closure) -------------------------------

+/** Structured details returned by CompleteGoal so renderCall/renderResult can show metadata. */
+interface SignOffDetails {
+	goal: string;
+	outcome: "accepted" | "rejected" | "verify_failed" | "running";
+	phase?: string; // "verifying" | "spawning" | "judging" — while running
+	durationMs: number;
+	verifyCommand?: string;
+	verifyExitCode?: number;
+	judgeModel?: string;
+	reasoning: string;
+	isError?: boolean;
+}
+
 function text(s: string, isError = false) {
 	return { content: [{ type: "text" as const, text: s }], details: { isError }, isError };
 }

+function textWithDetails(s: string, details: SignOffDetails, isError = false) {
+	return { content: [{ type: "text" as const, text: s }], details, isError };
+}
+
 function stamp(): string {
 	return new Date().toISOString().slice(0, 16).replace("T", " ");
 }

-/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge. */
+/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge.
+ *  Returns the outcome plus the judge's (or verify's) reasoning so CompleteGoal can show WHY. */
 async function decideSignOff(
 	goal: Goal,
 	evidence: string,
@@ -314,16 +471,30 @@ async function decideSignOff(
 	judgeModel: string | null,
 	cwd: string,
 	signal: AbortSignal | undefined,
-): Promise<SignOff> {
+	onUpdate?: (partial: { content: Array<{ type: "text"; text: string }>; details: SignOffDetails }) => void,
+): Promise<{ outcome: SignOff; reasoning: string; durationMs: number }> {
+	const startedAt = Date.now();
+	const emit = (phase: string, text: string) => {
+		onUpdate?.({
+			content: [{ type: "text" as const, text }],
+			details: { goal: goal.subject, outcome: "running", phase, durationMs: Date.now() - startedAt, verifyCommand: goal.verify ?? undefined, judgeModel: judgeModel ?? undefined, reasoning: "" },
+		});
+	};
 	let verifyResult: { command: string; exitCode: number; outputTail: string } | null = null;
 	if (goal.verify) {
+		emit("verifying", `Running verify: ${goal.verify}`);
 		verifyResult = runVerify(goal.verify, cwd, signal);
 		if (verifyResult.exitCode !== 0) {
-			return { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail };
+			return {
+				outcome: { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail },
+				reasoning: `verify \`${goal.verify}\` exited ${verifyResult.exitCode}:\n${verifyResult.outputTail}`,
+				durationMs: Date.now() - startedAt,
+			};
 		}
 	}
-	const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal);
-	return verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing };
+	const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal, onUpdate);
+	const outcome: SignOff = verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing };
+	return { outcome, reasoning: verdict.reasoning, durationMs: verdict.durationMs };
 }

 /** Run the goal's verify command. It is agent-authored and trusted (single-user machine, guide-not-guard). */
@@ -351,33 +522,73 @@ async function runJudge(
 	judgeModel: string | null,
 	cwd: string,
 	signal: AbortSignal | undefined,
-): Promise<{ accept: boolean; missing: string }> {
+	onUpdate?: (partial: { content: Array<{ type: "text"; text: string }>; details: SignOffDetails }) => void,
+): Promise<{ accept: boolean; missing: string; reasoning: string; durationMs: number }> {
+	const startedAt = Date.now();
+	const emit = (phase: string, text: string) => {
+		onUpdate?.({
+			content: [{ type: "text" as const, text }],
+			details: { goal: goal.subject, outcome: "running", phase, durationMs: Date.now() - startedAt, verifyCommand: goal.verify ?? undefined, judgeModel: judgeModel ?? undefined, reasoning: "" },
+		});
+	};
 	const task = evidenceJudgeUser({
 		subject: goal.subject,
-		done_when: goal.done_when,
+		discriminator: goal.discriminator,
+		failure_modes: goal.failure_modes,
 		verify: goal.verify ?? null,
 		verifyResult,
-		failure_modes: goal.failure_modes,
 		evidence,
 		paths,
 	});
-	const args = ["-p", "--no-session", "--tools", READ_ONLY_TOOLS.join(","), "--append-system-prompt", evidenceJudgeSystem];
+	const args = ["-p", "--no-session", "--tools", JUDGE_TOOLS.join(","), "--exclude-tools", JUDGE_BLOCKED_TOOLS.join(","), "--append-system-prompt", evidenceJudgeSystem];
 	if (judgeModel) args.push("--model", judgeModel);
 	args.push(task);

+	emit("spawning", `Spawning read-only judge for: ${goal.subject}`);
 	const inv = getPiInvocation(args);
+	// FIXME(side-effect): pi -p --no-session clones the repo into the PARENT of cwd (so alongside
+	// the working dir), leaving a stale directory. The judge should run in a temp dir or inside the
+	// existing repo checkout so it doesn't pollute the user's workspace.
+	const JUDGE_TIMEOUT_MS = 120_000;
 	const output = await new Promise<string>((resolve) => {
+		let settled = false;
+		const timer = setTimeout(() => {
+			if (!settled) {
+				settled = true;
+				proc.kill();
+				resolve(`VERDICT: reject\nmissing: judge timed out after ${JUDGE_TIMEOUT_MS / 1000}s`);
+			}
+		}, JUDGE_TIMEOUT_MS);
 		const proc = spawn(inv.command, inv.args, { cwd, shell: false, stdio: ["ignore", "pipe", "pipe"], signal });
 		let out = "";
 		proc.stdout.on("data", (d) => (out += d));
 		proc.stderr.on("data", (d) => (out += d));
-		proc.on("close", () => resolve(out));
-		proc.on("error", (e) => resolve(`VERDICT: reject\nmissing: judge subprocess failed: ${e.message}`));
+		proc.on("close", () => {
+			if (!settled) {
+				settled = true;
+				clearTimeout(timer);
+				resolve(out);
+			}
+		});
+		proc.on("error", (e) => {
+			if (!settled) {
+				settled = true;
+				clearTimeout(timer);
+				resolve(`VERDICT: reject\nmissing: judge subprocess failed: ${e.message}`);
+			}
+		});
 	});

-	const verdictLine = output.split("\n").find((l) => /^\s*VERDICT\s*:/i.test(l)) ?? "";
+	// The subprocess emits ANSI/CSI control codes in -p mode; strip them so they don't leak into `missing`.
+	const clean = output.replace(/\u001b\[[0-9;?]*[ -/]*[@-~]/g, "");
+
+	const verdictLine = clean.split("\n").find((l) => /^\s*VERDICT\s*:/i.test(l)) ?? "";
 	const accept = /accept/i.test(verdictLine);
-	const missingMatch = output.match(/missing\s*:\s*([\s\S]*)$/i);
-	const missing = accept ? "" : (missingMatch?.[1].trim() || output.trim().slice(-500) || "judge gave no reason");
-	return { accept, missing };
+	const missingMatch = clean.match(/missing\s*:\s*([\s\S]*)$/i);
+	const missing = accept ? "" : (missingMatch?.[1].trim() || clean.trim().slice(-500) || "judge gave no reason");
+	// The judge's own words (inspection + verdict), so CompleteGoal can show them. The verdict is at the
+	// end, so keep the tail when it's long.
+	const trimmed = clean.trim();
+	const reasoning = trimmed.length > 1800 ? `...\n${trimmed.slice(-1800)}` : trimmed;
+	return { accept, missing, reasoning, durationMs: Date.now() - startedAt };
 }
@@ -1,100 +1,135 @@
 /**
- * plan-file.ts — read plan.md, and the two writes CompleteGoal needs. That is all.
+ * plan-file.ts — read goals.md, and the two writes CompleteGoal needs. That is all.
 *
 * Pure module, no pi deps, so it unit-tests without a runtime. The file is the canonical store and
- * the agent edits it with its normal Edit tool (create goals, tick subtasks, append log), guided by
- * the format in prompts.tsx and the reminder -- the form guides, it does not gate (spec D3). So this
- * module does NOT render or create goals; the format's single source of truth is the planDrafting
- * prompt. The only programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to
- * record an accepted sign-off; both touch one line so the git diff stays readable.
+ * the agent edits it with its normal Edit tool (create goals, tick subtasks, fill evidence), guided
+ * by the format in prompts.ts and the reminder -- the form guides, it does not gate. The only
+ * programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to record an accepted
+ * sign-off; both touch one line so the diff stays readable.
 *
- * Format (spec §4):
+ * Format (markdown, checkbox-first, made to be skim-reviewed by a human):
 *
- *   # Plan: <objective>
+ *   # <plan title>
 *
- *   ## Goal: <subject>
- *   <!-- id: <slug> -->
- *   status: open | active | done | cancelled
- *   done_when: <falsifiable check; plus the symptom if NOT met>
- *   verify: <shell command, optional>
- *   failure_modes:
- *     - <pre-mortem item>
- *   - [ ] <subtask>
+ *   <context: the user's ask, preferences, decisions>
+ *
+ *   ## Goals
+ *
+ *   1. [ ] goal: <desc>            <- state in the checkbox: [ ] open  [/] active  [x] done  [-] cancelled
+ *     - discriminator: <positive observation that the goal succeeded, that no failure below could fake>
+ *     - subtle failure mode: <a way this looks done but isn't>
+ *     - verify: <optional shell command that exits 0 only when the discriminator passes>
+ *     - tasks:
+ *       1. [x] <subtask>           <- a subtask is any checkbox WITHOUT a "goal:" prefix
+ *       2. [/] <subtask>
+ *       3. [-] <subtask>           <- [-] or ~~[ ]~~ both read as cancelled
+ *     - evidence:                  <- empty at planning; filled at sign-off, read by CompleteGoal
+ *       - > <artifact path / link / metric, plus a short read of it>
+ *   2. [ ] goal: <desc>
+ *
+ *   # Future work / out of scope
 *
 *   ## Log
 *   - <verbatim append-only line>
+ *
+ * A goal/subtask's state lives in its checkbox (single source of truth, renders natively). Goals are
+ * matched by their <desc> (the text after "goal:"); the list number is human-facing only. Only
+ * CompleteGoal writes a goal's [x]; the agent sets [/] when it starts one.
 */

 export type GoalStatus = "open" | "active" | "done" | "cancelled";

 export interface Subtask {
 	text: string;
-	done: boolean;
+	status: GoalStatus;
 }

 export interface Goal {
-	id: string;
+	/** The text after "goal:" in the header line; the handle CompleteGoal matches on. */
 	subject: string;
 	status: GoalStatus;
-	done_when: string;
-	verify?: string;
+	/** Positive observation(s) that the goal succeeded AND that no failure mode could fake. The success test. Written at planning. */
+	discriminator: string[];
+	/** Subtle ways a "done" could be wrong (look-like-success failures). Written at planning. */
 	failure_modes: string[];
+	/** Optional command that exits 0 only when the discriminator passes (the cheap deterministic gate). */
+	verify?: string;
+	/** Proof the discriminator passed, pointing at durable artifacts. Written at completion; read by CompleteGoal. */
+	evidence: string[];
 	subtasks: Subtask[];
 }

 export interface PlanDoc {
-	objective: string;
+	title: string;
 	goals: Goal[];
 	/** Verbatim ## Log lines, including the leading "- ". */
 	log: string[];
 }

-const GOAL_HEADER = /^##\s+Goal:\s*(.*)$/;
-const ANY_HEADER = /^#{1,6}\s/;
+const TITLE = /^#\s+(.+?)\s*$/; // the first single-# H1
+const GOALS_HEADER = /^##\s+Goals\s*$/i;
 const LOG_HEADER = /^##\s+Log\s*$/i;
-const ID_COMMENT = /^<!--\s*id:\s*(.+?)\s*-->$/;
-const CHECKBOX = /^- \[([ xX])\]\s+(.*)$/;
+const ANY_HEADER = /^#{1,6}\s/;
+// A goal: a numbered or bulleted checkbox item whose text begins "goal:".
+const GOAL_ITEM = /^\s*(?:\d+\.|[-*])\s*\[([ xX/-])\]\s*goal:\s*(.*)$/i;
+// A section marker bullet under a goal (the trailing colon is optional, e.g. "- tasks").
+const KEY_LINE = /^\s*[-*]\s*(discriminator|subtle failure modes?|failure_modes?|verify|tasks?|evidence)\s*:?\s*(.*)$/i;
+// Any list item (numbered or bulleted); used for subtasks and for list items inside the sections.
+const LIST_ITEM = /^\s*(?:\d+\.|[-*])\s+(.*)$/;
+// A checkbox inside a list-item body (subtask). A leading/trailing ~~ marks it cancelled.
+const CHECKBOX_BODY = /^(~~)?\s*\[([ xX/-])\]\s*(.*)$/;
+
+const CHAR_TO_STATUS: Record<string, GoalStatus> = { " ": "open", "/": "active", x: "done", "-": "cancelled" };
+const STATUS_TO_CHAR: Record<GoalStatus, string> = { open: " ", active: "/", done: "x", cancelled: "-" };
+
+function normalizeKey(raw: string): "discriminator" | "failure_modes" | "verify" | "tasks" | "evidence" {
+	const k = raw.toLowerCase();
+	if (k.startsWith("discriminator")) return "discriminator";
+	if (k.startsWith("verify")) return "verify";
+	if (k.startsWith("task")) return "tasks";
+	if (k.startsWith("evidence")) return "evidence";
+	return "failure_modes"; // "subtle failure mode(s)" / "failure_mode(s)"
+}

 export function parse(text: string): PlanDoc {
 	const lines = text.split("\n");
-	let objective = "";
+	let title = "";
 	const goals: Goal[] = [];
 	const log: string[] = [];

 	let cur: Goal | null = null;
-	let inFailureModes = false;
+	let curList: string[] | null = null; // the discriminator/failure_modes/evidence list "- " items append to
+	let inGoals = false;
 	let inLog = false;

 	const flush = () => {
 		if (cur) goals.push(cur);
 		cur = null;
-		inFailureModes = false;
+		curList = null;
 	};

 	for (const line of lines) {
-		const objMatch = /^#\s+Plan:\s*(.*)$/.exec(line);
-		if (objMatch) {
-			objective = objMatch[1].trim();
+		const tM = TITLE.exec(line);
+		if (tM && !title && !GOALS_HEADER.test(line) && !LOG_HEADER.test(line)) {
+			title = tM[1].trim();
 			continue;
 		}
-
-		const goalMatch = GOAL_HEADER.exec(line);
-		if (goalMatch) {
+		if (GOALS_HEADER.test(line)) {
 			flush();
+			inGoals = true;
 			inLog = false;
-			cur = { id: "", subject: goalMatch[1].trim(), status: "open", done_when: "", failure_modes: [], subtasks: [] };
 			continue;
 		}
-
 		if (LOG_HEADER.test(line)) {
 			flush();
+			inGoals = false;
 			inLog = true;
 			continue;
 		}
-
-		// Any other header ends the current goal / log section.
+		// Any other header (e.g. "# Future work") ends the goals / log section.
 		if (ANY_HEADER.test(line)) {
 			flush();
+			inGoals = false;
 			inLog = false;
 			continue;
 		}
@@ -103,50 +138,70 @@ export function parse(text: string): PlanDoc {
 			if (/^\s*-\s+/.test(line)) log.push(line);
 			continue;
 		}
+		if (!inGoals) continue; // title + context prose between the title and ## Goals

+		const goalM = GOAL_ITEM.exec(line);
+		if (goalM) {
+			flush();
+			cur = {
+				subject: goalM[2].trim(),
+				status: CHAR_TO_STATUS[goalM[1].toLowerCase()] ?? "open",
+				discriminator: [],
+				failure_modes: [],
+				evidence: [],
+				subtasks: [],
+			};
+			continue;
+		}
 		if (!cur) continue;

-		const idMatch = ID_COMMENT.exec(line.trim());
-		if (idMatch) {
-			cur.id = idMatch[1];
+		const keyM = KEY_LINE.exec(line);
+		if (keyM) {
+			const key = normalizeKey(keyM[1]);
+			const inlineVal = keyM[2].trim();
+			if (key === "verify") {
+				cur.verify = inlineVal || undefined;
+				curList = null;
+			} else if (key === "tasks") {
+				curList = null; // subtasks are identified by being a checkbox; this marker is cosmetic
+			} else {
+				curList = cur[key]; // discriminator | failure_modes | evidence
+				if (inlineVal) curList.push(inlineVal);
+			}
 			continue;
 		}

-		// A checkbox (column 0) is a subtask; checked first so it is never read as a failure mode.
-		const checkbox = CHECKBOX.exec(line);
-		if (checkbox) {
-			inFailureModes = false;
-			cur.subtasks.push({ done: checkbox[1].toLowerCase() === "x", text: checkbox[2].trim() });
-			continue;
-		}
-
-		const kv = /^(status|done_when|verify|failure_modes)\s*:\s*(.*)$/.exec(line);
-		if (kv) {
-			const [, key, value] = kv;
-			if (key === "status") cur.status = value.trim() as GoalStatus;
-			else if (key === "done_when") cur.done_when = value.trim();
-			else if (key === "verify") cur.verify = value.trim() || undefined;
-			else if (key === "failure_modes") inFailureModes = true;
-			continue;
-		}
-
-		// Indented "- " items under failure_modes: (a column-0 checkbox already returned above).
-		if (inFailureModes) {
-			const fm = /^\s*-\s+(.*)$/.exec(line);
-			if (fm) {
-				cur.failure_modes.push(fm[1].trim());
+		const listM = LIST_ITEM.exec(line);
+		if (listM) {
+			const body = listM[1];
+			const cb = CHECKBOX_BODY.exec(body);
+			if (cb) {
+				// A checkbox without a "goal:" prefix is a subtask of the current goal.
+				const cancelled = cb[1] === "~~" || body.includes("~~");
+				const status = cancelled ? "cancelled" : (CHAR_TO_STATUS[cb[2].toLowerCase()] ?? "open");
+				cur.subtasks.push({ text: cb[3].replace(/~~/g, "").trim(), status });
+				curList = null;
 				continue;
 			}
-			if (line.trim() !== "") inFailureModes = false;
+			// A plain "- " / "> " item belongs to the current section (discriminator/failure/evidence).
+			if (curList) curList.push(body.trim());
+			continue;
+		}
+
+		// A non-empty, non-"- " line continues the current item, so multi-line evidence (a block quote
+		// of a log, a table, an interpretation line) stays attached to its item. Blank lines are skipped.
+		if (curList && line.trim() !== "" && curList.length > 0) {
+			curList[curList.length - 1] += `\n${line.trim()}`;
 		}
 	}
 	flush();

-	return { objective, goals, log };
+	return { title, goals, log };
 }

-export function findGoal(doc: PlanDoc, id: string): Goal | undefined {
-	return doc.goals.find((g) => g.id === id);
+export function findGoal(doc: PlanDoc, subject: string): Goal | undefined {
+	const want = subject.trim();
+	return doc.goals.find((g) => g.subject === want);
 }

 export function counts(doc: PlanDoc): { done: number; open: number; active: number } {
@@ -159,20 +214,18 @@ export function counts(doc: PlanDoc): { done: number; open: number; active: numb
 	return c;
 }

-/** Flip a goal's `status:` line in place (the one write CompleteGoal needs). */
-export function setGoalStatus(text: string, id: string, status: GoalStatus): string {
+/** Flip a goal's checkbox in place, matched by its subject (the one write CompleteGoal needs). */
+export function setGoalStatus(text: string, subject: string, status: GoalStatus): string {
 	const lines = text.split("\n");
-	let i = lines.findIndex((l) => ID_COMMENT.test(l.trim()) && ID_COMMENT.exec(l.trim())?.[1] === id);
-	if (i === -1) throw new Error(`Goal #${id} not found`);
-	for (; i < lines.length; i++) {
-		if (i > 0 && ANY_HEADER.test(lines[i]) && !GOAL_HEADER.test(lines[i]) && !LOG_HEADER.test(lines[i])) break;
-		const kv = /^(status\s*:\s*)(.*)$/.exec(lines[i]);
-		if (kv) {
-			lines[i] = `${kv[1]}${status}`;
+	const want = subject.trim();
+	for (let i = 0; i < lines.length; i++) {
+		const m = GOAL_ITEM.exec(lines[i]);
+		if (m && m[2].trim() === want) {
+			lines[i] = lines[i].replace(/\[[ xX/-]\]/, `[${STATUS_TO_CHAR[status]}]`);
 			return lines.join("\n");
 		}
 	}
-	throw new Error(`Goal #${id} has no status: line`);
+	throw new Error(`Goal "${subject}" not found`);
 }

 /**
@@ -184,28 +237,28 @@ export type SignOff =
 	| { kind: "rejected"; missing: string }
 	| { kind: "accepted" };

-/** Apply a sign-off outcome to plan.md text: accept flips status + logs; reject only logs. Pure. */
+/** Apply a sign-off outcome to goals.md text: accept flips the goal checkbox to [x] + logs; reject only logs. Pure. */
 export function recordSignOff(
 	text: string,
-	goalId: string,
+	subject: string,
 	when: string,
 	outcome: SignOff,
 ): { content: string; message: string; isError: boolean } {
-	const goal = findGoal(parse(text), goalId);
-	if (!goal) return { content: text, message: `No goal #${goalId} in plan.md.`, isError: true };
+	const goal = findGoal(parse(text), subject);
+	if (!goal) return { content: text, message: `No goal "${subject}" in goals.md.`, isError: true };

 	if (outcome.kind === "verify_failed") {
-		const content = appendLog(text, `${when} reject #${goalId}: verify exit ${outcome.exitCode}`);
+		const content = appendLog(text, `${when} reject "${subject}": verify exit ${outcome.exitCode}`);
 		return { content, message: `Sign-off rejected: verify failed (exit ${outcome.exitCode}).\n${outcome.outputTail}`, isError: true };
 	}
 	if (outcome.kind === "rejected") {
 		const oneLine = outcome.missing.replace(/\s+/g, " ").trim().slice(0, 200);
-		const content = appendLog(text, `${when} reject #${goalId}: ${oneLine}`);
+		const content = appendLog(text, `${when} reject "${subject}": ${oneLine}`);
 		return { content, message: `Sign-off rejected. Missing:\n${outcome.missing}`, isError: true };
 	}
-	const flipped = setGoalStatus(text, goalId, "done");
-	const content = appendLog(flipped, `${when} signed off #${goalId}: ${goal.subject} (oracle accept)`);
-	return { content, message: `Signed off #${goalId}: ${goal.subject}. Marked done in plan.md.`, isError: false };
+	const flipped = setGoalStatus(text, subject, "done");
+	const content = appendLog(flipped, `${when} signed off "${subject}" (judge accept)`);
+	return { content, message: `Signed off "${subject}". Marked done in goals.md.`, isError: false };
 }

 /** Append one verbatim line to ## Log (creating the section if absent). The other CompleteGoal write. */
@@ -1,91 +1,134 @@
 /**
- * pi-plan — all model-facing text, in flow order.
+ * pi-goals — all model-facing text, in flow order.
 *
 * Philosophy: the form guides a process; it does not police one. The agent can
- * edit plan.md freely. These prompts + the plan.md structure make the right path
+ * edit goals.md freely. These prompts + the goals.md structure make the right path
 * the easy path. The only step that is genuinely rigorous is the evidence judge
- * (6), and even that is reached by guiding the agent to call CompleteGoal, not by
+ * (7), and even that is reached by guiding the agent to call CompleteGoal, not by
 * trapping it. Bypasses stay visible in the git diff and the widget.
 *
- * Flow:
- *   SETUP (plan mode)     1. planDrafting        — strong/sticky model drafts goals
+ * Flow (this file is ordered the way the agent meets each text, so it reads as one pass):
+ *   SETUP (plan mode)     1. planDrafting        — drafts goals (read-only phase)
 *   EXEC, each turn start 2. planInjection       — "here is your plan, where you are"
 *   EXEC, periodic        3. reminder            — the typed nudge that drives upkeep + autonomy
 *   EXEC, loop continue   4. continuation        — keep going toward the active goal
 *   EXEC, after each turn 5. loopJudge           — continue / pause (cheap, foolable, ok)
- *   SIGN-OFF              6. evidenceJudge        — read-only verify (rigorous; the one real check)
+ *   SIGN-OFF, agent-side  6. completeGoalTool    — the CompleteGoal tool desc + param the agent reads
+ *   SIGN-OFF, judge-side  7. evidenceJudge       — read-only verify (rigorous; the one real check)
 *
- * Read top to bottom to see the whole process. 5 and 6 are kept adjacent on
- * purpose: the cheap-foolable vs must-not-be-fooled contrast is the design.
+ * Read top to bottom to see the whole process. 5 and 7 embody the design contrast:
+ * the cheap-foolable loop gate vs the must-not-be-fooled sign-off.
 *
- * WIRED in index.ts: 1 planDrafting, 2 planInjection, 3 reminder, 6 evidenceJudge.
+ * WIRED in index.ts: 1 planDrafting, 2 planInjection, 3 reminder, 6 completeGoalTool, 7 evidenceJudge.
 * NOT YET WIRED: 4 continuation and 5 loopJudge define the autonomous re-prompt loop, which is
 * intentionally not built in v1 (an until-done-style loop was judged too complex). They stay here so
 * the full intended flow is reviewable; wire them if/when the loop is added.
+ *
+ * The goal's test is the DISCRIMINATOR: the concrete observation that tells real success from the
+ * named subtle failure mode. It replaces a vague "done_when". Evidence is empty at planning and
+ * filled at sign-off (you don't always know the exact artifacts up front; the judge checks them then).
 */

 /* ─────────────────────────────────────────────────────────────────────────
 * 1. planDrafting  —  SETUP, plan mode
 *
- * System guidance for the plan-phase agent. Runs on the plan model (may differ
- * from the execution model; the choice is sticky — see oracle.json-style config).
- * This phase is read-only: explore, then draft goals into plan.md. No code yet.
- * The field requirements here are the whole "elicitation" — get them agreed up
- * front, because the human reviews this output before any execution.
+ * System guidance for the plan-phase agent. This phase is read-only (edit/write
+ * and mutating bash are blocked by a tool hook): explore, then draft goals into
+ * goals.md. The fields here are the whole "elicitation"; the human reviews this
+ * output before any execution.
 * ──────────────────────────────────────────────────────────────────────── */
 export const planDrafting = `\
-You are in plan mode. Explore the repository read-only, then draft a plan into plan.md.
-Do not write or run code in this phase. Produce goals the human will review and approve.
+You are in plan mode. The objective may arrive through conversation, not as one up-front command.
+Explore the repository read-only first, then ask: resolve discoverable facts by looking them up, and
+only ask the human when the answer is a genuine intent or preference choice that exploration can't
+settle. Don't write goals that branch on something you could just check. Do not write or run code in
+this phase (edit and write are blocked, and so is mutating bash). If the ask is itself read-only
+(e.g. research, a search, a report), explore enough to scope it, but leave the actual deliverable for
+after the human approves the plan. When the objective is clear, draft goals into goals.md and stop
+for review. Produce a plan the human will review and approve.

-Write each goal in this shape:
+Right-size it, don't force structure that isn't there:
+- Default to ONE goal. Add another only when it's a genuinely separate checkpoint you'd want signed
+  off on its own (it can pass or fail independently). Most objectives are 1-2 goals.
+- Subtasks are the steps inside a goal. Add them when a goal has 3+ distinct steps; skip them for a
+  single-action goal. Don't pad with trivial steps.
+- Don't invent goals to look thorough. When in doubt, merge.

-## Goal: <one short imperative line>
-status: open
-done_when: <a falsifiable check, plus the symptom you'd see if it's NOT met>
-verify: <a shell command that exits 0 only when the goal is met — include this whenever
-         success is expressible as tests/lint/build/a threshold; omit it otherwise>
-failure_modes:
-  - <a concrete way this could look done but isn't>
-  - <another>
-  - <if verify exists: "verify passes on a trivial or gamed test">
- [ ] <first subtask>
- [ ] <next subtask>
+Write the whole file in this shape (markdown checkboxes, made to be skim-reviewed):

-Rules for a good plan:
- Keep goals small enough that done_when is checkable in one sitting.
- done_when must be falsifiable. "Works well" is not a criterion; "p95 < 50ms on bench-X,
-  else timeouts in load-test.log" is.
- failure_modes are a pre-mortem: the cheap, specific ways a later "done" could be wrong.
-  This is the highest-value part — it shapes what evidence you'll collect.
- Prefer a verify command. A green deterministic check is worth more than a paragraph of
-  description, and it's the first thing checked at sign-off.
+# <short plan title>

-When the plan is drafted, present it and stop for review. Do not begin execution.`;
+<context: restate the user's ask, their stated preferences, and any decisions you've agreed on>
+
+## Goals
+
+1. [ ] goal: <one short imperative line>
+  - subtle failure mode: <a way this could look done but isn't>
+  - discriminator: <the concrete observation that tells real success from that failure>
+  - verify: <optional shell command that exits 0 only when the discriminator passes; omit if not testable>
+  - tasks:
+    1. [ ] <subtask>
+    2. [ ] <subtask>
+  - evidence:
+    - <leave empty now; filled at sign-off>
+2. [ ] goal: <...>
+
+# Future work / out of scope
+
+- <anything deliberately not in these goals>
+
+## Log
+
+Keep it lean and legible:
+- A goal is a checkbox line beginning "goal:"; its state is the checkbox ([ ] open, [/] active, [x]
+  done, [-] cancelled). Leave goals [ ] at planning. The number is just for the human to reference.
+- subtle failure mode + discriminator are the heart of this. List the ways a "done" could look
+  achieved but not be (empty/zero-count output, a silently-errored step, a gamed test, a flat/no-op
+  result that dodged every trap and still showed nothing; these are examples, find the ones that fit).
+- The discriminator is the POSITIVE observation that the goal actually succeeded AND that none of
+  those failure modes could have produced. It must show success happened -- the count moved the right
+  way, the test really exercised the path, the metric beat noise -- not merely that a failure was
+  ruled out: avoiding every failure mode is necessary, not sufficient. Name the success signal first,
+  then check it isn't something a failure mode could fake. Keep it terse.
+- The discriminator is the success test, written now, in place of a vague "done": make it a concrete,
+  checkable observation about a real artifact (a file, a test result, a committed diff, a metric), not
+  about goals.md's own checkbox.
+- subtasks: any checkbox WITHOUT a "goal:" prefix, under "- tasks:". Use [/] for in progress and [-]
+  for cancelled/impossible.
+- verify: prefer one when the discriminator is a test, build, threshold, or metric: a green check or
+  a printed number beats prose. Omit it otherwise.
+- evidence stays empty at planning. You don't always know the exact artifacts up front, and that's
+  fine: you fill evidence at sign-off, and a fresh read-only judge checks it then.
+
+When the goals are drafted, present them and stop for review. Do not begin execution.`;

 /* ─────────────────────────────────────────────────────────────────────────
 * 2. planInjection  —  EXEC, injected at each agent start (and after compaction)
 *
 * A late user-role message, NOT a system-prompt mutation (keeps the prefix cache
 * valid). Built from the parsed plan. MUST be byte-identical when nothing changed:
- * fixed field order, no volatile timestamps in the body. Pass only the active
- * goal + its open subtasks + the last log line — not the whole file.
+ * fixed field order, no volatile timestamps. Pass only the active goal + its open
+ * subtasks + the last log line, not the whole file.
 * ──────────────────────────────────────────────────────────────────────── */
 export function planInjection(p: {
-  objective: string;
-  activeGoal: { subject: string; done_when: string; openSubtasks: string[] } | null;
+  title: string;
+  activeGoal: { subject: string; discriminator: string[]; openSubtasks: string[] } | null;
  lastLogLine: string | null;
  counts: { done: number; open: number };
 }): string {
  if (!p.activeGoal) {
-    return `Plan (plan.md): ${p.objective}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal or run /plan.`;
+    // FIXME(heading): user wants the heading to show ".pi/goals.md: <title>" so the filename is explicit
+    // even in the injection. Currently says "Goals (goals.md):" which is close but not the same.
+    return `.pi/goals.md: ${p.title}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal (set its checkbox to [/]) or run /goals.`;
  }
  const subtasks = p.activeGoal.openSubtasks.length
    ? p.activeGoal.openSubtasks.map((s) => `  - [ ] ${s}`).join("\n")
    : "  (no open subtasks)";
+  const disc = p.activeGoal.discriminator.length ? p.activeGoal.discriminator.join("; ") : "(none set)";
  return `\
-Plan (plan.md): ${p.objective}
+.pi/goals.md: ${p.title}
 Active goal: ${p.activeGoal.subject}
-done_when: ${p.activeGoal.done_when}
+discriminator (the success test): ${disc}
 Open subtasks:
 ${subtasks}
 Last log: ${p.lastLogLine ?? "(none yet)"}
@@ -96,19 +139,20 @@ Progress: ${p.counts.done} done, ${p.counts.open} open.`;
 * 3. reminder  —  EXEC, periodic system-reminder
 *
 * The typed nudge. This is both the housekeeping and the autonomy engine — it is
- * what makes the process get followed without a hard gate. Fires after N
- * file-modifying turns since the last plan.md update while a goal is active.
- * Keep the wording stable so it doesn't thrash the cache.
+ * what makes the process get followed without a hard gate. Fires after a turn that
+ * left goals.md untouched while a goal is active. Keep the wording stable so it
+ * doesn't thrash the cache.
 * ──────────────────────────────────────────────────────────────────────── */
 export const reminder = `\
 <system-reminder>
-Keep plan.md current as you work:
- tasks: tick the subtasks you've finished; add any new ones you've discovered.
- log: append ONE short line to ## Log (append — don't rewrite earlier lines).
- goal: if the active goal's evidence is in, sign it off by calling CompleteGoal with that
-  evidence. Don't edit status to done by hand — CompleteGoal runs the check and records it.
- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely
-  blocked; if blocked, say what's blocking and why.
+Keep goals.md current as you work:
+- tasks: tick the subtasks you've finished ([/] for in progress); add any you've discovered.
+- log: append ONE short line to ## Log (append, don't rewrite earlier lines).
+- goal: when the active goal's discriminator is satisfied, fill its evidence: block in goals.md (a
+  list pointing at durable artifacts), then call CompleteGoal with the goal's desc. Don't tick the
+  goal [x] by hand; CompleteGoal reads the evidence, runs the check, and writes [x].
+- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely blocked;
+  if blocked, say what's blocking it.
 </system-reminder>`;

 /* ─────────────────────────────────────────────────────────────────────────
@@ -118,9 +162,9 @@ Keep plan.md current as you work:
 * continue. Does not mutate the system prompt, so the cache holds.
 * ──────────────────────────────────────────────────────────────────────── */
 export const continuation = `\
-Continue toward the active goal in plan.md. If it now meets its done_when, call CompleteGoal
-with your evidence (point to durable artifacts — saved logs, committed diffs, files — not just
-claims). If you're blocked, state what's blocking it.`;
+Continue toward the active goal in goals.md. If its discriminator is now satisfied, fill the goal's
+evidence: block (durable artifacts, e.g. saved logs, committed diffs, files, not just claims) and
+then call CompleteGoal with the goal's desc. If you're blocked, state what's blocking it.`;

 /* ─────────────────────────────────────────────────────────────────────────
 * 5. loopJudge  —  EXEC, runs after each turn to decide continue / pause
@@ -133,14 +177,14 @@ claims). If you're blocked, state what's blocking it.`;
 export const loopJudgeSystem = `\
 You decide whether an autonomous coding agent should keep working or pause for the human.
 Be conservative: only pause when the work is plainly finished or plainly blocked. When in
-doubt, continue. You are not verifying correctness — a later read-only judge does that.
+doubt, continue. You are not verifying correctness; a later read-only judge does that.
 Reply with ONLY a JSON object, no other text: {"done": boolean, "reason": "<one sentence>"}.
-Set done=true only if the agent's last message shows the active goal's done_when is met, or
-the agent says it is blocked and needs the human.`;
+Set done=true only if the agent's last message shows the active goal's discriminator is satisfied,
+or the agent says it is blocked and needs the human.`;

-export function loopJudgeUser(p: { activeGoalDoneWhen: string; lastResponse: string }): string {
+export function loopJudgeUser(p: { discriminator: string; lastResponse: string }): string {
  return `\
-Active goal done_when: ${p.activeGoalDoneWhen}
+Active goal discriminator (the success test): ${p.discriminator}

 Agent's last message:
 """
@@ -151,24 +195,50 @@ ${p.lastResponse}
 }

 /* ─────────────────────────────────────────────────────────────────────────
- * 6. evidenceJudge  —  SIGN-OFF, the one rigorous check
+ * 6. completeGoalTool  —  SIGN-OFF, agent-side
 *
- * Runs inside CompleteGoal, on the read-only oracle subprocess (fresh context,
- * strongest reasoning on the chosen provider; override to a different vendor for
- * high-stakes goals). It re-derives from the repo rather than trusting the
- * agent's transcription, and it judges whether a verify command actually tests
- * the criterion or could pass while a named failure mode holds (gaming).
+ * The description + param the agent reads on the one blessed tool, CompleteGoal.
+ * This is where the agent meets the sign-off: it fills evidence and calls the
+ * tool, which then runs verify + the judge (7). Kept here with the rest of the
+ * model-facing text so the whole process reads top to bottom.
+ * ──────────────────────────────────────────────────────────────────────── */
+export const completeGoalDescription =
+	"Sign off a goal once its discriminator is satisfied. First fill the goal's evidence: block in " +
+	"goals.md: a list where each item pairs a durable artifact with a short read of it (a quoted+linked " +
+	"log, a table plus how to read it, or a metric plus what it shows; quote the key lines and link the " +
+	"rest, not a pasted blob or a bare claim). The read must show the success POSITIVELY happened (the " +
+	"result is present, the count moved the right way, the metric beat noise), not just that a failure " +
+	"was avoided; ruling out the failure modes is necessary but not sufficient. Then call this with the " +
+	"goal's desc (the text after 'goal:'). Runs the goal's verify command (if any) then a read-only " +
+	"subagent that inspects that evidence against the repo and the discriminator. On accept, the goal is " +
+	"marked done and logged; on reject, it stays open and you get what is missing. The subagent's " +
+	"reasoning is returned either way.";
+
+export const completeGoalParamDescription = "The goal's desc: the exact text after 'goal:' in its line.";
+
+/* ─────────────────────────────────────────────────────────────────────────
+ * 7. evidenceJudge  —  SIGN-OFF, judge-side; the one rigorous check
+ *
+ * Runs inside CompleteGoal, on a read-only pi subprocess (fresh context via
+ * --no-session, so it never sees the working agent's transcript; override to a
+ * different vendor for an independent cross-family check). It re-derives from the
+ * repo rather than trusting the agent's transcription, and judges whether the
+ * evidence satisfies the discriminator and rules out the named failure mode.
 *
 * The transport gives it read/grep/find/ls. The prompt below imposes the verdict
- * contract — the oracle returns prose by default, so parse the VERDICT line.
+ * contract — the subprocess returns prose by default, so parse the VERDICT line.
 * ──────────────────────────────────────────────────────────────────────── */
 export const evidenceJudgeSystem = `\
-You are a read-only reviewer signing off a coding goal. Do not trust claims — verify.
+You are a read-only reviewer signing off a coding goal. Do not trust claims; verify.
 Use read/grep/find/ls to inspect the repository and the cited artifacts yourself. Re-read the
 files, logs, and diffs the evidence points to; if something it asserts isn't on disk, you can't
-confirm it. If a verify command was run, judge whether it genuinely tests the criterion or
-could pass while one of the listed failure modes still holds — a tautological or skipped test
-is a reject. Check each failure mode is actually ruled out, not just unmentioned.
+confirm it. Judge whether the evidence shows the goal POSITIVELY succeeded -- the discriminator's
+success signal is actually present, not just that the failure modes were dodged. Avoiding every
+failure mode is necessary but not sufficient: a run can rule out each trap and still have produced
+nothing, so reject "no problems found" that lacks the positive result. Then check the named subtle
+failure modes are genuinely ruled out, not just unmentioned. If a verify command was run,
+judge whether it really tests the discriminator or could pass while the failure mode still holds; a
+tautological or skipped test is a reject.

 Finish with exactly these two lines and nothing after:
 VERDICT: accept | reject
@@ -176,10 +246,10 @@ missing: <empty if accept; otherwise a short list of what's needed before this c

 export function evidenceJudgeUser(p: {
  subject: string;
-  done_when: string;
+  discriminator: string[];
+  failure_modes: string[];
  verify: string | null;
  verifyResult: { command: string; exitCode: number; outputTail: string } | null;
-  failure_modes: string[];
  evidence: string;
  paths: string[];
 }): string {
@@ -188,9 +258,10 @@ export function evidenceJudgeUser(p: {
    : "verify command: none (no deterministic check for this goal)";
  return `\
 Goal: ${p.subject}
-done_when: ${p.done_when}
-failure_modes:
-${p.failure_modes.map((f) => `  - ${f}`).join("\n")}
+discriminator (must be satisfied):
+${p.discriminator.map((d) => `  - ${d}`).join("\n") || "  (none stated, note this)"}
+subtle failure modes (must be ruled out):
+${p.failure_modes.map((f) => `  - ${f}`).join("\n") || "  (none stated)"}

 ${verifyBlock}

@@ -198,7 +269,7 @@ Agent's evidence:
 ${p.evidence}

 Artifacts it points to (inspect these):
-${p.paths.map((x) => `  - ${x}`).join("\n") || "  (none listed — note this)"}
+${p.paths.map((x) => `  - ${x}`).join("\n") || "  (none listed, note this)"}

-Verify the goal against its done_when. Then give your VERDICT.`;
+Verify the evidence satisfies the discriminator and rules out the failure modes. Then give your VERDICT.`;
 }
@@ -1,26 +1,30 @@
 import { describe, expect, it } from "vitest";
 import { appendLog, counts, findGoal, parse, recordSignOff, setGoalStatus } from "../src/plan-file.js";

-const SAMPLE = `# Plan: ship the cache layer
+const SAMPLE = `# papers audit

-## Goal: Implement cache layer
-<!-- id: cache-layer-1 -->
-status: active
-done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
-verify: pytest tests/cache -q
-failure_modes:
-  - cache silently bypassed (hit-rate ~0, latency ok by luck)
-  - bench too small to exercise eviction
- [x] wire cache client
- [ ] eviction policy
- [ ] load test
+Clean up steering/ metadata and kill empty dirs. Keep it read-only until I approve.

-## Goal: Document the API
-<!-- id: document-the-api-1 -->
-status: open
-done_when: every public fn has a docstring; else sphinx warns
-failure_modes:
-  - docstrings exist but are stale
+## Goals
+
+1. [/] goal: Implement cache layer
+  - discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0)
+  - subtle failure mode: cache silently bypassed, latency ok by luck
+  - verify: pytest tests/cache -q
+  - tasks:
+    1. [x] wire cache client
+    2. [/] eviction policy
+    3. ~~[ ]~~ distributed cache, out of scope
+  - evidence:
+    - > load-test.log: p95=41ms
+    - > hit-rate 0.93 (not bypassed)
+2. [ ] goal: Document the API
+  - discriminator: every public fn has a docstring; sphinx warns on none
+  - subtle failure mode: docstrings exist but are stale
+
+# Future work / out of scope
+
+- distributed cache

 ## Log
 - 2026-06-15 14:02  cache client wired; eviction next
@@ -48,27 +52,45 @@ function lineDelta(a: string, b: string): { added: number; removed: number } {
 describe("parse", () => {
 	const doc = parse(SAMPLE);

-	it("reads the objective and both goals", () => {
-		expect(doc.objective).toBe("ship the cache layer");
-		expect(doc.goals.map((g) => g.id)).toEqual(["cache-layer-1", "document-the-api-1"]);
+	it("reads the title and both goals (matched by subject)", () => {
+		expect(doc.title).toBe("papers audit");
+		expect(doc.goals.map((g) => g.subject)).toEqual(["Implement cache layer", "Document the API"]);
 	});

-	it("reads goal fields", () => {
-		const g = findGoal(doc, "cache-layer-1");
-		expect(g?.subject).toBe("Implement cache layer");
-		expect(g?.status).toBe("active");
-		expect(g?.done_when).toBe("p95 < 50ms on bench-X. If wrong: timeouts in load-test.log");
+	it("reads goal status from the checkbox", () => {
+		expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // [/]
+		expect(findGoal(doc, "Document the API")?.status).toBe("open"); // [ ]
+	});
+
+	it("reads discriminator, subtle failure mode, and verify as separate fields", () => {
+		const g = findGoal(doc, "Implement cache layer");
+		expect(g?.discriminator).toEqual(["hit-rate > 0.8 in load-test.log (a bypass reads ~0)"]);
+		expect(g?.failure_modes).toEqual(["cache silently bypassed, latency ok by luck"]);
 		expect(g?.verify).toBe("pytest tests/cache -q");
 	});

-	it("separates failure_modes from subtasks", () => {
-		const g = findGoal(doc, "cache-layer-1");
-		expect(g?.failure_modes).toHaveLength(2);
-		expect(g?.failure_modes[0]).toContain("cache silently bypassed");
+	it("reads subtasks with their checkbox state, strikethrough as cancelled", () => {
+		const g = findGoal(doc, "Implement cache layer");
 		expect(g?.subtasks).toEqual([
-			{ text: "wire cache client", done: true },
-			{ text: "eviction policy", done: false },
-			{ text: "load test", done: false },
+			{ text: "wire cache client", status: "done" },
+			{ text: "eviction policy", status: "active" },
+			{ text: "distributed cache, out of scope", status: "cancelled" },
+		]);
+	});
+
+	it("reads the evidence block separate from the other lists", () => {
+		const g = findGoal(doc, "Implement cache layer");
+		expect(g?.evidence).toEqual(["> load-test.log: p95=41ms", "> hit-rate 0.93 (not bypassed)"]);
+		expect(findGoal(doc, "Document the API")?.evidence).toEqual([]); // a goal with no evidence parses to []
+	});
+
+	it("keeps a multi-line evidence item together (quote + interpretation)", () => {
+		const doc2 = parse(
+			`# x\n\n## Goals\n\n1. [ ] goal: G\n  - discriminator: report has non-zero counts\n  - evidence:\n    - > report.txt: counts 52 -> 4\n      remaining 4 = index + 3 notes\n      almost certain the discriminator passes\n    - > second item, single line\n`,
+		);
+		expect(findGoal(doc2, "G")?.evidence).toEqual([
+			"> report.txt: counts 52 -> 4\nremaining 4 = index + 3 notes\nalmost certain the discriminator passes",
+			"> second item, single line",
 		]);
 	});

@@ -76,50 +98,28 @@ describe("parse", () => {
 		expect(doc.log).toEqual(["- 2026-06-15 14:02  cache client wired; eviction next"]);
 		expect(counts(doc)).toEqual({ done: 0, open: 1, active: 1 });
 	});
-});

-describe("failure_modes vs subtask disambiguation", () => {
-	it("a column-0 checkbox right after failure_modes: is a SUBTASK", () => {
-		const doc = parse(
-			`# Plan: x\n\n## Goal: G\n<!-- id: g-1 -->\nstatus: open\ndone_when: z\nfailure_modes:\n- [ ] first subtask\n- [x] second subtask\n`,
-		);
-		const g = findGoal(doc, "g-1");
-		expect(g?.failure_modes).toEqual([]);
-		expect(g?.subtasks).toEqual([
-			{ text: "first subtask", done: false },
-			{ text: "second subtask", done: true },
-		]);
-	});
-
-	it("an indented checkbox-shaped item inside failure_modes is a FAILURE MODE", () => {
-		const doc = parse(
-			`# Plan: x\n\n## Goal: G\n<!-- id: g-2 -->\nstatus: open\ndone_when: z\nfailure_modes:\n  - [ ] prose that looks like a checkbox\n- [ ] real subtask\n`,
-		);
-		const g = findGoal(doc, "g-2");
-		expect(g?.failure_modes).toEqual(["[ ] prose that looks like a checkbox"]);
-		expect(g?.subtasks).toEqual([{ text: "real subtask", done: false }]);
-	});
-
-	it("a goal with no failure_modes keeps its subtasks", () => {
-		const doc = parse(`# Plan: x\n\n## Goal: G\n<!-- id: g-3 -->\nstatus: open\ndone_when: z\n- [ ] only subtask\n`);
-		const g = findGoal(doc, "g-3");
-		expect(g?.failure_modes).toEqual([]);
-		expect(g?.subtasks).toEqual([{ text: "only subtask", done: false }]);
+	it("ignores the Future work section, does not read it as goals or log", () => {
+		expect(doc.goals).toHaveLength(2);
+		expect(doc.log).toHaveLength(1);
 	});
 });

 describe("the two CompleteGoal writes (minimal diff)", () => {
 	it("setGoalStatus replaces exactly one line, scoped to the right goal", () => {
-		const next = setGoalStatus(SAMPLE, "cache-layer-1", "done");
+		const next = setGoalStatus(SAMPLE, "Implement cache layer", "done");
 		expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 1 });
-		expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("done");
-		expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("open"); // untouched
+		expect(findGoal(parse(next), "Implement cache layer")?.status).toBe("done");
+		expect(findGoal(parse(next), "Document the API")?.status).toBe("open"); // untouched
 	});

-	it("setGoalStatus targets the second goal without touching the first", () => {
-		const next = setGoalStatus(SAMPLE, "document-the-api-1", "active");
-		expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("active");
-		expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("active");
+	it("setGoalStatus keeps the number and goal: prefix, flips only the checkbox", () => {
+		expect(setGoalStatus(SAMPLE, "Implement cache layer", "done")).toContain("1. [x] goal: Implement cache layer");
+		expect(setGoalStatus(SAMPLE, "Document the API", "cancelled")).toContain("2. [-] goal: Document the API");
+	});
+
+	it("setGoalStatus throws on an unknown subject", () => {
+		expect(() => setGoalStatus(SAMPLE, "no such goal", "done")).toThrow();
 	});

 	it("appendLog adds exactly one line under ## Log", () => {
@@ -132,7 +132,7 @@ describe("the two CompleteGoal writes (minimal diff)", () => {
 	});

 	it("appendLog creates the section when absent", () => {
-		const noLog = "# Plan: x\n\n## Goal: y\n<!-- id: y-1 -->\nstatus: open\ndone_when: z\n";
+		const noLog = "# x\n\n## Goals\n\n1. [ ] goal: y\n  - discriminator: z\n";
 		expect(parse(appendLog(noLog, "first entry")).log).toEqual(["- first entry"]);
 	});
 });
@@ -141,30 +141,30 @@ describe("recordSignOff (CompleteGoal's pure record logic)", () => {
 	const WHEN = "2026-06-15 16:00";

 	it("accept flips status:done and logs a sign-off line", () => {
-		const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "accepted" });
+		const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "accepted" });
 		expect(r.isError).toBe(false);
 		const doc = parse(r.content);
-		expect(findGoal(doc, "cache-layer-1")?.status).toBe("done");
-		expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off #cache-layer-1: Implement cache layer (oracle accept)`);
+		expect(findGoal(doc, "Implement cache layer")?.status).toBe("done");
+		expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off "Implement cache layer" (judge accept)`);
 	});

 	it("verify_failed only logs a reject line, status stays active", () => {
-		const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" });
+		const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" });
 		expect(r.isError).toBe(true);
 		const doc = parse(r.content);
-		expect(findGoal(doc, "cache-layer-1")?.status).toBe("active"); // NOT marked done
-		expect(doc.log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: verify exit 1`);
+		expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // NOT marked done
+		expect(doc.log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": verify exit 1`);
 	});

 	it("rejected logs the (one-lined) missing reason, status stays", () => {
-		const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" });
+		const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" });
 		expect(r.isError).toBe(true);
-		expect(findGoal(parse(r.content), "cache-layer-1")?.status).toBe("active");
-		expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: no saved bench log`);
+		expect(findGoal(parse(r.content), "Implement cache layer")?.status).toBe("active");
+		expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": no saved bench log`);
 	});

 	it("unknown goal returns an error and does not touch the file", () => {
-		const r = recordSignOff(SAMPLE, "nope-1", WHEN, { kind: "accepted" });
+		const r = recordSignOff(SAMPLE, "nope", WHEN, { kind: "accepted" });
 		expect(r.isError).toBe(true);
 		expect(r.content).toBe(SAMPLE);
 	});
Author	SHA1	Message	Date
wassname	e0470a0c6d	Judge: read-only + bash (no edit/write), renderCall/renderResult, streaming progress - Judge gets read, bash, grep, find, ls but edit+write are blocked via --exclude-tools - Added renderCall: shows goal name while running - Added renderResult: shows accept/reject icon, model, duration, collapsed/expanded view - Wired onUpdate through decideSignOff -> runJudge so the TUI shows progress while judging - Added SignOffDetails type for structured metadata - Added 120s timeout on judge subprocess	2026-06-17 18:21:45 +08:00
wassname	39c83994fa	FIXME: judge side-effect clones pollute user workspace pi -p --no-session clones the repo into the parent of cwd, leaving a stale directory that the NEXT judge then finds and rejects the goal over. Needs a temp-dir fix or in-repo inspection.	2026-06-17 18:16:32 +08:00
wassname	489f9b8c35	Clean pi-plan references, add judge timeout, fix heading format - Rename spec doc to 2026-06-15_pi-goals.md, update title - Update review.md spec reference - Rename piPlanExtension -> piGoalsExtension in src/index.ts - Add 120s timeout to judge subprocess (was unbounded, caused hang) - Change planInjection heading from 'Goals (goals.md):' to '.pi/goals.md:' - Add FIXMEs for tool label, progress visibility, heading format	2026-06-17 18:09:03 +08:00
wassname	0a1503dc04	pi-goals: move CompleteGoal desc into prompts.ts; trim README The tool description and param doc are model-facing, so they belong in prompts.ts with the rest. Add them as step 6 (completeGoalTool) and renumber the evidence judge to 7; prompts.ts is now ordered the way the agent meets each text, so it reads as one pass. The moved desc also carries the positive-success framing: evidence must show the success happened, not just that a failure was avoided. README trimmed (saying less, voice unchanged): tighter intro and comparison, less prose around the examples and sign-off steps. Humanizer lint clean. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-16 11:50:12 +08:00
wassname	838c42d7bd	pi-goals: discriminator/failure-mode format + visible sign-off judge Replace done_when with a discriminator + subtle-failure-mode pair as the heart of each goal. The discriminator is the POSITIVE success observation that no failure mode could fake, not just failure-avoidance: a run can dodge every trap and still produce nothing. Carried through planDrafting, the sign-off judge, README, and the parser doc. Format migration: flat numbered markdown goals (`1. [/] goal: ...`), keyword-anchored parsing (indentation cosmetic), goals matched by text, subtask states [ ]/[/]/[x]/[-] plus ~~strike~~. Evidence empty at planning, filled at sign-off, multi-line supported. CompleteGoal now returns the judge's reasoning under a `--- sign-off judge ---` block (was just "Signed off"), so the verdict is visible. Plan mode is read-only: edit/write (except goals.md) and mutating bash are blocked by a tool hook. 17 parser tests, typecheck + biome clean. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-16 11:45:08 +08:00
wassname	a65c822bf9	pi-plan -> pi-goals: rename package, command, and file to goals.md Distinguishes this from the other pi-plan extensions by foregrounding what's different (goals tracked to verified completion). Mechanical rename only, no behavior change: - package @wassname2/pi-plan -> @wassname2/pi-goals (+ repo url) - plan.md -> goals.md (the canonical file) - command /plan -> /goals - file H1 marker "# Plan:" -> "# Goals:", widget/session labels likewise - internal state keys pi-plan-* -> pi-goals-* Internal source filename (plan-file.ts) and identifiers (planDrafting, PlanDoc, setGoalStatus) keep "plan"; they're not user-visible. External burneikis/pi-plan references are left intact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-16 05:53:22 +08:00
wassname	bb00314932	pi-plan: checkbox-in-header goal state + evidence block + widget/judge fixes Goal state moves from a `status:` line into a checkbox on the goal header (single source of truth, renders natively): [ ] open, [/] active, [x] done, [-] cancelled. Only CompleteGoal writes [x]; the agent sets [/] when starting. The GoalStatus enum and all consumers (widget, injection, counts) are unchanged. Evidence becomes a goal field, not an ephemeral tool argument: an `evidence:` block the agent fills before sign-off, read by CompleteGoal from the file (git-tracked, reviewable). The tool is now CompleteGoal(goal_id) only. Also: - format reorder: subtasks under the goal; failure_modes + evidence as separated trailing blocks (no abutting dash-lists) - widget: (done/total tasks), and done goals show checked instead of hiding - drafting prompt: guard against a circular done_when (one that points at the file's own checkbox/log, which the sign-off writes, so it can never pass) - drafting template now includes the H1 and the <!-- id --> line CompleteGoal needs to locate a goal - strip ANSI/CSI control codes from the judge subprocess output Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-16 05:49:22 +08:00
wassname	f2f9e6a1b9	pi-plan: finish stale-pi crash fix, print plan on start, todo->task - The Ready->fresh-context crash was a stale pi.* call inside withSession. Prior commit moved sendUserMessage to sessionCtx but left pi.setSessionName inside withSession (also stale -> crash). Drop it (cosmetic) and use only sessionCtx in the swap window. - Print plan.md on execution start (both fresh and in-place) so the user sees what's being worked on after a context switch. Plan text captured before newSession since ctx goes stale. - Widget: "(N todo)" -> "(N task[s])" Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-15 20:49:57 +08:00
wassname	3134adf203	pi-plan: fix crash on Ready->fresh-context; drop em-dashes in prompts - startExecution: inside withSession, send via the ReplacedSessionContext (sessionCtx.sendUserMessage) and set the session name there. The old code used the global pi.* handle bound to the replaced session, which is stale after newSession (runner.assertActive) -> crash on the "fresh, compacted context" choice. - prompts: replace em-dashes in model-facing strings with commas/ semicolons/periods (humanizer pass; comments left as-is) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-15 20:32:30 +08:00
wassname	861b2ea157	pi-plan: right-size plans (fewer goals), lean done_when/failure_modes The drafting prompt over-decomposed: one goal per item, long run-on done_when (criterion + failure symptom in one line), and 3 mandatory failure_modes. Plans came out verbose and hard to read. - planDrafting: default to ONE goal; add another only for a genuinely separate checkpoint; near-identical items become subtasks. Subtasks only for 3+ step goals. Don't invent phases. (granularity heuristic adapted from tintinweb/pi-tasks when-to/when-not guidance) - done_when: one falsifiable check, no embedded "if wrong" clause (the failure symptom belongs in failure_modes) - failure_modes: 0-2 terse items, optional - Sync the stale done_when wording in README and plan-file.ts comment Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-15 20:28:02 +08:00
wassname	158e04f4ac	pi-plan: fix corrupted index.ts, queue revise msg, bare /plan prompts - Restore exitPlanMode closing brace + CompleteGoal tool registration opening that an earlier edit dropped (parse error at 224) - Edit-revise path now sends with deliverAs:"followUp" so it doesn't throw "Agent is already processing" mid-stream - Bare /plan now prompts for an objective and enters plan mode instead of only showing the current plan Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-15 20:23:55 +08:00