Compare commits

...

11 Commits

Author SHA1 Message Date
wassname e0470a0c6d Judge: read-only + bash (no edit/write), renderCall/renderResult, streaming progress
- Judge gets read, bash, grep, find, ls but edit+write are blocked via --exclude-tools
- Added renderCall: shows goal name while running
- Added renderResult: shows accept/reject icon, model, duration, collapsed/expanded view
- Wired onUpdate through decideSignOff -> runJudge so the TUI shows progress while judging
- Added SignOffDetails type for structured metadata
- Added 120s timeout on judge subprocess
2026-06-17 18:21:45 +08:00
wassname 39c83994fa FIXME: judge side-effect clones pollute user workspace
pi -p --no-session clones the repo into the parent of cwd, leaving a stale
directory that the NEXT judge then finds and rejects the goal over. Needs a
temp-dir fix or in-repo inspection.
2026-06-17 18:16:32 +08:00
wassname 489f9b8c35 Clean pi-plan references, add judge timeout, fix heading format
- Rename spec doc to 2026-06-15_pi-goals.md, update title
- Update review.md spec reference
- Rename piPlanExtension -> piGoalsExtension in src/index.ts
- Add 120s timeout to judge subprocess (was unbounded, caused hang)
- Change planInjection heading from 'Goals (goals.md):' to '.pi/goals.md:'
- Add FIXMEs for tool label, progress visibility, heading format
2026-06-17 18:09:03 +08:00
wassname 0a1503dc04 pi-goals: move CompleteGoal desc into prompts.ts; trim README
The tool description and param doc are model-facing, so they belong in
prompts.ts with the rest. Add them as step 6 (completeGoalTool) and
renumber the evidence judge to 7; prompts.ts is now ordered the way the
agent meets each text, so it reads as one pass.

The moved desc also carries the positive-success framing: evidence must
show the success happened, not just that a failure was avoided.

README trimmed (saying less, voice unchanged): tighter intro and
comparison, less prose around the examples and sign-off steps. Humanizer
lint clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-16 11:50:12 +08:00
wassname 838c42d7bd pi-goals: discriminator/failure-mode format + visible sign-off judge
Replace done_when with a discriminator + subtle-failure-mode pair as the
heart of each goal. The discriminator is the POSITIVE success observation
that no failure mode could fake, not just failure-avoidance: a run can
dodge every trap and still produce nothing. Carried through planDrafting,
the sign-off judge, README, and the parser doc.

Format migration: flat numbered markdown goals (`1. [/] goal: ...`),
keyword-anchored parsing (indentation cosmetic), goals matched by text,
subtask states [ ]/[/]/[x]/[-] plus ~~strike~~. Evidence empty at
planning, filled at sign-off, multi-line supported.

CompleteGoal now returns the judge's reasoning under a
`--- sign-off judge ---` block (was just "Signed off"), so the verdict is
visible. Plan mode is read-only: edit/write (except goals.md) and
mutating bash are blocked by a tool hook.

17 parser tests, typecheck + biome clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-16 11:45:08 +08:00
wassname a65c822bf9 pi-plan -> pi-goals: rename package, command, and file to goals.md
Distinguishes this from the other pi-plan extensions by foregrounding what's
different (goals tracked to verified completion). Mechanical rename only, no
behavior change:
- package @wassname2/pi-plan -> @wassname2/pi-goals (+ repo url)
- plan.md -> goals.md (the canonical file)
- command /plan -> /goals
- file H1 marker "# Plan:" -> "# Goals:", widget/session labels likewise
- internal state keys pi-plan-* -> pi-goals-*

Internal source filename (plan-file.ts) and identifiers (planDrafting, PlanDoc,
setGoalStatus) keep "plan"; they're not user-visible. External burneikis/pi-plan
references are left intact.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-16 05:53:22 +08:00
wassname bb00314932 pi-plan: checkbox-in-header goal state + evidence block + widget/judge fixes
Goal state moves from a `status:` line into a checkbox on the goal header
(single source of truth, renders natively): [ ] open, [/] active, [x] done,
[-] cancelled. Only CompleteGoal writes [x]; the agent sets [/] when starting.
The GoalStatus enum and all consumers (widget, injection, counts) are unchanged.

Evidence becomes a goal field, not an ephemeral tool argument: an `evidence:`
block the agent fills before sign-off, read by CompleteGoal from the file
(git-tracked, reviewable). The tool is now CompleteGoal(goal_id) only.

Also:
- format reorder: subtasks under the goal; failure_modes + evidence as
  separated trailing blocks (no abutting dash-lists)
- widget: (done/total tasks), and done goals show checked instead of hiding
- drafting prompt: guard against a circular done_when (one that points at the
  file's own checkbox/log, which the sign-off writes, so it can never pass)
- drafting template now includes the H1 and the <!-- id --> line CompleteGoal
  needs to locate a goal
- strip ANSI/CSI control codes from the judge subprocess output

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-16 05:49:22 +08:00
wassname f2f9e6a1b9 pi-plan: finish stale-pi crash fix, print plan on start, todo->task
- The Ready->fresh-context crash was a stale pi.* call inside
  withSession. Prior commit moved sendUserMessage to sessionCtx but
  left pi.setSessionName inside withSession (also stale -> crash).
  Drop it (cosmetic) and use only sessionCtx in the swap window.
- Print plan.md on execution start (both fresh and in-place) so the
  user sees what's being worked on after a context switch. Plan text
  captured before newSession since ctx goes stale.
- Widget: "(N todo)" -> "(N task[s])"

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-15 20:49:57 +08:00
wassname 3134adf203 pi-plan: fix crash on Ready->fresh-context; drop em-dashes in prompts
- startExecution: inside withSession, send via the ReplacedSessionContext
  (sessionCtx.sendUserMessage) and set the session name there. The old
  code used the global pi.* handle bound to the replaced session, which
  is stale after newSession (runner.assertActive) -> crash on the
  "fresh, compacted context" choice.
- prompts: replace em-dashes in model-facing strings with commas/
  semicolons/periods (humanizer pass; comments left as-is)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-15 20:32:30 +08:00
wassname 861b2ea157 pi-plan: right-size plans (fewer goals), lean done_when/failure_modes
The drafting prompt over-decomposed: one goal per item, long run-on
done_when (criterion + failure symptom in one line), and 3 mandatory
failure_modes. Plans came out verbose and hard to read.

- planDrafting: default to ONE goal; add another only for a genuinely
  separate checkpoint; near-identical items become subtasks. Subtasks
  only for 3+ step goals. Don't invent phases. (granularity heuristic
  adapted from tintinweb/pi-tasks when-to/when-not guidance)
- done_when: one falsifiable check, no embedded "if wrong" clause (the
  failure symptom belongs in failure_modes)
- failure_modes: 0-2 terse items, optional
- Sync the stale done_when wording in README and plan-file.ts comment

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-15 20:28:02 +08:00
wassname 158e04f4ac pi-plan: fix corrupted index.ts, queue revise msg, bare /plan prompts
- Restore exitPlanMode closing brace + CompleteGoal tool registration
  opening that an earlier edit dropped (parse error at 224)
- Edit-revise path now sends with deliverAs:"followUp" so it doesn't
  throw "Agent is already processing" mid-stream
- Bare /plan now prompts for an objective and enters plan mode instead
  of only showing the current plan

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-15 20:23:55 +08:00
9 changed files with 803 additions and 412 deletions
+1
View File
@@ -1,5 +1,6 @@
node_modules/
dist/
*.log
.pi/
docs/reviews/raw.jsonl
docs/reviews/err.txt
+118 -63
View File
@@ -1,99 +1,153 @@
# pi-plan
# pi-goals
A [pi](https://github.com/badlogic/pi-mono) extension for plan-driven, goal-tracked work in one
`plan.md`. Set up goals (with evidence and failure modes) in plan mode, work them, and sign a goal
off only when a read-only subagent has checked the evidence.
Plan mode for agreeing on goals before any code gets written. Each goal names the subtle failure mode
that could fake a "done" and the discriminator that tells real success from it, plus subtasks and the
evidence checked at sign-off. It lives in one markdown file. A widget keeps the goals in front of you
through compaction, a reminder nudges the agent to keep the file current, and a goal is signed off
only after a read-only subagent checks its evidence.
Successor to [pi-lgtm](https://github.com/wassname/pi-lgtm), kept deliberately small: about
[burneikis/pi-plan](https://github.com/burneikis/pi-plan) plus the additions, goals with evidence,
a sign-off check, a widget, and a reminder.
The form guides; it does not gate. The agent edits `plan.md` with its normal Edit tool. The one
blessed tool is `CompleteGoal`, which runs the sign-off check and records the result. The reminder,
the injected plan summary, and git/widget visibility carry the process. It trusts the agent's
judgement rather than guarding it.
Like [pi-milestones](https://github.com/Neuron-Mr-White/UniPi/tree/main/packages/milestone) and
[burneikis/pi-plan](https://github.com/burneikis/pi-plan), it guides rather than guards: a form and a
process the agent follows. [pi-lgtm](https://github.com/wassname/pi-lgtm) was my earlier, more complex
attempt.
## Install
```bash
pi install npm:@wassname2/pi-plan
pi install npm:@wassname2/pi-goals
```
Or run without installing:
Or run it without installing:
```bash
pi -e npm:@wassname2/pi-plan
pi -e npm:@wassname2/pi-goals
```
## Use
```
/plan add CSV export to the report view
/goals CSV export for the report view
```
1. Plan. The agent explores read-only and writes goals into `plan.md` (see format below).
2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel.
On Ready you choose whether to keep the current context or start fresh and compacted.
3. Work. Each turn the active goal is injected (so it survives compaction) and a reminder nudges
the agent to keep `plan.md` current and work autonomously. When a goal's `done_when` is met the
agent calls `CompleteGoal`, which runs `verify` and a read-only judge and, on accept, marks it
done and logs it.
`/goals` enters plan mode and starts a conversation; the description is an optional seed, so plain
`/goals` works too. From there:
Other commands: `/plan` (print the plan), `/plan clear` (empty `plan.md`, history kept in git),
`/plan judge <model-ref>` (use a specific model for the sign-off judge; default is your current
model).
1. Plan. The agent explores read-only, asks about anything unclear, and writes the goals into
`.pi/goals.md`.
2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel. On
Ready you choose whether to keep the current context or start fresh and compacted.
3. Work. Each turn the active goal is injected so it survives compaction, and a reminder nudges the
agent to keep `goals.md` current and keep going. When a goal's discriminator is satisfied the agent
calls `CompleteGoal`, which runs `verify` and a read-only judge, then marks the goal done and logs it.
## plan.md format
Other commands: `/goals clear` empties `.pi/goals.md`; `/goals judge <model-ref>` picks a specific
model for the sign-off judge (the default is your current model).
One file holds the objective, the goals, and a short append-only log.
## Example
```
/goals audit the papers dir metadata and clean up empty dirs
```
The agent explores read-only, drafts the goal with a subtle failure mode and the discriminator that
beats it, and stops for review:
```markdown
# Plan: ship the cache layer
## Goals
## Goal: Implement cache layer
<!-- id: cache-layer-1 -->
status: active
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
failure_modes:
- cache silently bypassed (hit-rate ~0, latency ok by luck)
- bench too small to exercise eviction
- [x] wire cache client
- [ ] eviction policy
1. [ ] goal: Audit steering/ metadata and remove empty dirs
- subtle failure mode: report written but counts are zero (resolver errored silently)
- discriminator: report shows the XXXX count before/after AND a non-zero rename count
- tasks:
1. [ ] dry-run the metadata resolve
2. [ ] remove the empty _artifacts dirs
3. [ ] write the report
- evidence:
- <empty until sign-off>
```
You choose Ready. The agent works the subtasks, fills `evidence` (each item an artifact plus a short
read of it), and calls `CompleteGoal`:
```markdown
- evidence:
- > scripts/metadata_report.txt: XXXX 52 -> 4, 146 empty _artifacts removed
- > 48 files renamed; almost certain done, the silent-resolver failure mode is ruled out
```
A fresh read-only subagent re-checks the evidence against the repo and the discriminator, then
returns its verdict and reasoning:
```
Signed off "Audit steering/ metadata and remove empty dirs". Marked done in goals.md.
--- sign-off judge ---
metadata_report.txt present; counts 52 -> 4 confirmed; rename log shows 48 renamed (not zero).
VERDICT: accept
```
## The goals.md format
One project-local file, `<cwd>/.pi/goals.md` (gitignored), holds the title, a context block, the
goals, and a short append-only log. A fresh `/goals` draft replaces it.
```markdown
# ship the cache layer
Latency target came from the SLO review; keep the existing client API.
## Goals
1. [/] goal: Implement cache layer
- subtle failure mode: cache silently bypassed, latency ok by luck
- discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0)
- verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
- tasks:
1. [x] wire cache client
2. [/] eviction policy
- evidence:
- > load-test.log: p95=41ms, hit-rate 0.93 (not bypassed)
# Future work / out of scope
- distributed cache
## Log
- 2026-06-15 14:02 cache client wired; eviction next
```
- A goal is a `## Goal:` header with an `<!-- id -->`, a `status:`
(`open` | `active` | `done` | `cancelled`), a falsifiable `done_when:` (what you expect, and the
symptom if it is NOT met), an optional `verify:` shell command, a `failure_modes:` pre-mortem
list, and `- [ ]` subtasks.
- `done_when` names the evidence that distinguishes real success from a subtle failure. `verify`,
when present, is the deterministic first stage of the sign-off check.
- The agent ticks subtasks, appends to `## Log`, and sets `status` as it works. Multiple goals may
be `active`.
- A goal is a numbered checkbox line beginning `goal:`; the checkbox carries its state (`[ ]` open,
`[/]` active, `[x]` done, `[-]` cancelled). Goals are matched by their text, so the number is just
for you to reference.
- The `discriminator` is the success test, written while planning: the positive observation that the
goal succeeded and that none of the `subtle failure mode`s could fake (a count moved, a test
exercised the path, a metric beat noise), not just that a failure was avoided. `evidence` is the
proof, filled at sign-off: each item pairs a durable artifact (a quoted and linked log, a table, a
metric) with a short read of it. `verify`, when present, is the deterministic first stage.
- Subtasks are any checkbox without a `goal:` prefix, under `- tasks:`. The agent ticks them, appends
to `## Log`, and sets a goal `[/]` when it starts it; only `CompleteGoal` writes `[x]`. Several
goals can be active at once.
## The sign-off check (`CompleteGoal`)
## Signing off a goal (`CompleteGoal`)
`CompleteGoal(goal_id, evidence, paths?)` is the one blessed completion path:
`CompleteGoal(goal)` (matched by the goal's text) is the only tool that marks a goal done; everything
else is the agent editing the file. It reads the goal's `evidence:` block from `.pi/goals.md`, then:
1. If the goal has a `verify:` command, it is run. A non-zero exit rejects immediately, with no model
call.
2. Otherwise a read-only `pi` subprocess (the judge) inspects the evidence against the repo and the
named failure modes and returns a verdict. It re-derives from the artifacts you point it at
rather than trusting the claim, so point `evidence`/`paths` at durable artifacts (saved logs,
committed diffs, files).
3. On accept, the goal's `status` flips to `done` and a `## Log` line is written. On reject, the
goal stays open and the agent is told what is missing.
1. If the goal has a `verify:` command, it runs. A non-zero exit rejects right away, no model call.
2. Otherwise a read-only `pi` subprocess (a fresh `--no-session` context, so it never sees the working
agent's transcript) inspects the `evidence:` against the repo, the `discriminator`, and the
`subtle failure mode`. It re-derives from the cited artifacts rather than trusting the claim, so
list real artifacts, not assertions.
3. On accept, the goal flips to `[x]` and a `## Log` line is written. On reject, it stays open and the
agent is told what is missing. Either way the judge's reasoning comes back in the result.
The judge defaults to your current model (guaranteed authorized and capable). Set a different one
with `/plan judge <provider/model>` for an independent cross-family check.
The judge defaults to your current model (a fresh context, same weights). Point it at another with
`/goals judge <provider/model>` for an independent cross-family check.
## Prompts
All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so the process is
easy to review end to end.
All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so you can read the
whole process top to bottom.
## Develop
@@ -106,8 +160,9 @@ npm run lint
## Not (yet) included
No autonomous re-prompt loop (an until-done-style loop judge). Autonomy comes from the reminder, not
a harness. Plan-phase model stickiness is a documented next step.
- No autonomous re-prompt loop. The reminder nudges the agent within a turn, but the turn still ends
and hands back to you; nothing auto-re-prompts until the goals are done.
- The plan and execution phases can't yet run on different, sticky models.
## License
+1 -1
View File
@@ -1,4 +1,4 @@
Code review against spec `docs/spec/2026-06-15_pi-plan.md`.
Code review against spec `docs/spec/2026-06-15_pi-goals.md`.
---
@@ -1,4 +1,4 @@
# pi-plan — design spec
# pi-goals — design spec
Working title. A pi extension: set up goals (with subtasks and evidence) through plan mode, work them autonomously, and sign a goal off only when a check passes. One markdown file holds everything. The form guides a process; it does not police one. Successor to `pi-lgtm`, deliberately smaller.
+3 -3
View File
@@ -1,13 +1,13 @@
{
"name": "@wassname2/pi-plan",
"name": "@wassname2/pi-goals",
"version": "0.0.1",
"description": "One plan.md: set goals via plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.",
"description": "One .pi/goals.md: set goals in plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.",
"author": "wassname",
"license": "MIT",
"type": "module",
"repository": {
"type": "git",
"url": "https://github.com/wassname/pi-plan.git"
"url": "https://github.com/wassname/pi-goals.git"
},
"keywords": [
"pi-package",
+309 -98
View File
@@ -1,37 +1,87 @@
/**
* pi-plan — plan mode that sets up goals with evidence, tracked in one plan.md, signed off by a
* pi-goals — plan mode that sets up goals with evidence, tracked in one .pi/goals.md, signed off by a
* read-only subagent check. A successor to pi-lgtm, kept deliberately small (≈ burneikis/pi-plan
* plus the additions: goals + failure_modes + subtasks, a sign-off check, a widget, a reminder).
* plus the additions: goals + a discriminator + a subtle failure mode + subtasks, a sign-off check,
* a widget, a reminder). A goal's success test is its discriminator: the observation that tells real
* success from the named failure mode.
*
* Philosophy (spec D3): the form guides, it does not gate. The agent edits plan.md with its normal
* Philosophy (spec D3): the form guides, it does not gate. The agent edits goals.md with its normal
* Edit tool. The one blessed tool is CompleteGoal, which runs the sign-off check and records it. The
* reminder + the injected plan + git/widget visibility carry the process; we trust the agent's
* judgement rather than guarding it.
*
* Flow:
* /plan <objective> -> plan mode: agent explores, drafts goals into plan.md (planDrafting guides)
* /goals [objective] -> plan mode (conversational): objective is an optional seed; agent explores
* read-only, asks, then drafts goals into .pi/goals.md (planDrafting guides)
* agent_end -> review menu (Ready / Edit / $EDITOR / Cancel); Ready offers compaction
* execution -> each turn, inject the plan summary (survives compaction) + a reminder;
* agent works goals, ticks subtasks, appends ## Log, calls CompleteGoal
* CompleteGoal -> optional deterministic verify, then a read-only oracle judge -> accept
* flips status:done + logs; reject returns what's missing
*
* All model-facing text lives in prompts.tsx, in flow order.
* The plan file lives at <cwd>/.pi/goals.md (project-local, gitignored, like pi-tasks), not in the
* repo. A fresh /goals draft just replaces it (the "overwrite" staleness rule).
*
* Plan mode is read-only: the tool_call hook blocks edit/write (except goals.md itself) and mutating
* bash while drafting, so code isn't written before the goals are agreed. Read-only bash exploration
* stays open (blocklist, not allowlist).
*
* Not built (FIXME): no plan-vs-exec model switch on accept (plan-model stickiness); noted at its
* call site below.
*
* All model-facing text lives in prompts.ts, in flow order.
*/
import { spawn, spawnSync } from "node:child_process";
import { existsSync, readFileSync, writeFileSync } from "node:fs";
import { basename, join } from "node:path";
import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
import { basename, join, resolve } from "node:path";
import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@earendil-works/pi-coding-agent";
import { getMarkdownTheme } from "@earendil-works/pi-coding-agent";
import { Container, Markdown, Spacer, Text } from "@earendil-works/pi-tui";
import { Type } from "@sinclair/typebox";
import { counts, findGoal, type Goal, type PlanDoc, parse, recordSignOff, type SignOff } from "./plan-file.js";
import { evidenceJudgeSystem, evidenceJudgeUser, planDrafting, planInjection, reminder } from "./prompts.js";
import {
completeGoalDescription,
completeGoalParamDescription,
evidenceJudgeSystem,
evidenceJudgeUser,
planDrafting,
planInjection,
reminder,
} from "./prompts.js";
const STATE = "pi-plan-state";
const PLAN_CONTEXT = "pi-plan-context"; // injected plan-mode guidance, stripped from history later
const STATUS_KEY = "pi-plan";
const WIDGET_KEY = "pi-plan-widget";
const READ_ONLY_TOOLS = ["read", "grep", "find", "ls", "bash"];
const STATE = "pi-goals-state";
const PLAN_CONTEXT = "pi-goals-context"; // injected plan-mode guidance, stripped from history later
const STATUS_KEY = "pi-goals";
const WIDGET_KEY = "pi-goals-widget";
// Tools the sign-off judge gets: read-only inspection + bash (for git log, cat, running scripts to
// inspect). File mutators (edit, write) are blocked so the judge cannot modify anything.
// Names match pi's internal tool registry (grep→ffgrep, find→fffind, etc.).
const JUDGE_TOOLS = ["read", "bash", "grep", "find", "ls"];
const JUDGE_BLOCKED_TOOLS = ["edit", "write"];
// File mutators blocked while drafting goals (read-only plan mode, like narumiruna/pi-plan-mode), so
// code isn't written before goals are agreed. The one allowed write is goals.md itself (the
// deliverable). A read-only task (a pure search) can still be explored in plan mode by nature.
const PLAN_MODE_BLOCKED_TOOLS = ["edit", "write"];
// bash is dual-use, so block it only when the command looks mutating; read-only exploration (cat, rg,
// git log, running a script to inspect) stays open. Blocklist, not allowlist: keep exploration
// frictionless and just stop the obvious mutators. List adapted from narumiruna/pi-plan-mode; the
// redirect rule catches `> file` / `>> file` / `>| file` but not fd-dups like `2>&1` or `>&2`.
const MUTATING_BASH_PATTERNS: RegExp[] = [
/\b(rm|rmdir|mv|cp|mkdir|touch|chmod|chown|chgrp|ln|tee|truncate|dd)\b/i,
/>\s*[^&\s]/, // redirect to a file (write/append/clobber), excludes 2>&1 and >&2
/\bnpm\s+(install|uninstall|update|ci|link|publish|version)\b/i,
/\byarn\s+(add|remove|install|publish|upgrade)\b/i,
/\bpnpm\s+(add|remove|install|publish|update)\b/i,
/\bbun\s+(add|remove|install|update|publish)\b/i,
/\bpip\s+(install|uninstall)\b/i,
/\buv\s+(add|remove|sync|lock|pip\s+install)\b/i,
/\bgit\s+(add|commit|push|pull|merge|rebase|reset|checkout|switch|stash|cherry-pick|revert|tag|init|clone)\b/i,
/\b(sudo|su|kill|pkill|killall|reboot|shutdown)\b/i,
/\bsystemctl\s+(start|stop|restart|enable|disable)\b/i,
/\b(vim?|nano|emacs|code|subl)\b/i,
];
const PLAN_REL = ".pi/goals.md"; // project-local, gitignored (pi-tasks convention); shown in the widget
interface PlanState {
isPlanMode: boolean;
@@ -40,15 +90,21 @@ interface PlanState {
judgeModel: string | null;
}
export default function piPlanExtension(pi: ExtensionAPI): void {
export default function piGoalsExtension(pi: ExtensionAPI): void {
let state: PlanState = { isPlanMode: false, objective: null, judgeModel: null };
// Reminder cadence: fire when an active goal exists but plan.md was not touched since last turn.
// Reminder cadence: fire when an active goal exists but goals.md was not touched since last turn.
let lastInjectedPlan = "";
// newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /plan.
// newSession is only on the command-handler context; agent_end's ctx lacks it. Save it from /goals.
let savedCmdCtx: ExtensionCommandContext | null = null;
const planPath = (ctx: ExtensionContext) => join(ctx.cwd, "plan.md");
const planPath = (ctx: ExtensionContext) => join(ctx.cwd, ".pi", "goals.md");
const readPlan = (ctx: ExtensionContext): string => (existsSync(planPath(ctx)) ? readFileSync(planPath(ctx), "utf-8") : "");
// Our programmatic writes (clear, CompleteGoal). The agent creates/edits the file with its own Edit
// tool; this just makes sure .pi/ exists for our writes.
const writePlan = (ctx: ExtensionContext, content: string): void => {
mkdirSync(join(ctx.cwd, ".pi"), { recursive: true });
writeFileSync(planPath(ctx), content);
};
function persist(): void {
pi.appendEntry<PlanState>(STATE, state);
@@ -57,7 +113,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
function updateWidget(ctx: ExtensionContext): void {
if (state.isPlanMode) {
ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("warning", "planning"));
ctx.ui.setWidget(WIDGET_KEY, ["pi-plan: drafting goals", "Write goals to plan.md, then review."]);
ctx.ui.setWidget(WIDGET_KEY, ["pi-goals: drafting goals", `Write goals to ${PLAN_REL}, then review.`]);
return;
}
const doc = parse(readPlan(ctx));
@@ -68,26 +124,26 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
}
const c = counts(doc);
ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("accent", `${c.done}/${doc.goals.length} goals`));
ctx.ui.setWidget(WIDGET_KEY, goalWidgetLines(doc));
ctx.ui.setWidget(WIDGET_KEY, [...goalWidgetLines(doc), ctx.ui.theme.fg("muted", PLAN_REL)]);
}
function goalWidgetLines(doc: PlanDoc): string[] {
const mark: Record<Goal["status"], string> = { done: "✔", active: "▸", open: "◻", cancelled: "✗" };
const lines = [`Plan: ${doc.objective || "(untitled)"}`];
const lines = [`Goals: ${doc.title || "(untitled)"}`];
for (const g of doc.goals) {
if (g.status === "done") continue; // hide finished goals; they stay in the file
const open = g.subtasks.filter((s) => !s.done).length;
lines.push(`${mark[g.status]} ${g.subject}${open ? ` (${open} todo)` : ""}`);
// Show every goal with its status glyph (✔ done, ▸ active, ◻ open, ✗ cancelled) so finished
// goals read as checked off rather than vanishing. Plans are small, so this stays readable.
const total = g.subtasks.length;
const done = g.subtasks.filter((s) => s.status === "done").length;
lines.push(`${mark[g.status]} ${g.subject}${total ? ` (${done}/${total} tasks)` : ""}`);
}
const c = counts(doc);
if (c.done) lines.push(`(${c.done} done, hidden)`);
return lines;
}
// --- plan mode: setup -------------------------------------------------------------------------
pi.registerCommand("plan", {
description: "Plan mode: set up goals (with evidence) in plan.md, then work them. /plan <objective>",
pi.registerCommand("goals", {
description: "Plan mode: set up goals (with evidence) in goals.md, then work them. /goals <objective>",
handler: async (args, ctx) => {
savedCmdCtx = ctx; // ctx here is an ExtensionCommandContext (has newSession); keep it for later
const arg = args.trim();
@@ -99,18 +155,18 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
setJudge(arg.slice("judge".length).trim(), ctx);
return;
}
if (!arg) {
showPlan(ctx);
return;
}
state = { ...state, isPlanMode: true, objective: arg };
// Conversational entry (like narumiruna/pi-plan-mode): /goals enters plan mode and starts a
// dialogue. The objective is an optional seed, not a required arg, so there's no awkward
// "type your objective" prompt; the agent explores read-only and asks before drafting. A
// fresh draft just replaces .pi/goals.md (the "overwrite" staleness rule).
const objective = arg || null;
state = { ...state, isPlanMode: true, objective };
persist();
updateWidget(ctx);
pi.sendUserMessage(
`Enter plan mode for this objective: ${arg}\n\nExplore read-only, then write the plan to ${planPath(ctx)}.`,
{ deliverAs: "followUp" },
);
const seed = objective
? `We're in plan mode. Objective: ${objective}\n\nExplore the repo read-only and ask me anything unclear. When the objective is nailed down, draft (or replace) the goals in ${planPath(ctx)}, then stop for review.`
: `We're in plan mode. Tell me what you want to plan. Explore read-only and ask questions as needed; when the objective is clear, draft the goals in ${planPath(ctx)} and stop for review.`;
pi.sendUserMessage(seed, { deliverAs: "followUp" });
},
});
@@ -122,27 +178,18 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
async function clearPlan(ctx: ExtensionContext): Promise<void> {
if (!existsSync(planPath(ctx))) {
ctx.ui.notify("No plan.md to clear.", "info");
ctx.ui.notify("No goals.md to clear.", "info");
return;
}
if (ctx.hasUI) {
const ok = await ctx.ui.select("Clear plan.md? (it stays in git history)", ["Cancel", "Clear plan.md"]);
if (ok !== "Clear plan.md") return;
const ok = await ctx.ui.select(`Clear ${PLAN_REL}?`, ["Cancel", "Clear goals.md"]);
if (ok !== "Clear goals.md") return;
}
writeFileSync(planPath(ctx), "");
writePlan(ctx, "");
state = { ...state, isPlanMode: false, objective: null };
persist();
updateWidget(ctx);
ctx.ui.notify("Cleared plan.md.", "info");
}
function showPlan(ctx: ExtensionContext): void {
const content = readPlan(ctx);
if (!content.trim()) {
ctx.ui.notify("No plan yet. Use /plan <objective> to start.", "info");
return;
}
ctx.ui.notify(content, "info");
ctx.ui.notify(`Cleared ${PLAN_REL}.`, "info");
}
// --- review loop (after the agent drafts the plan) --------------------------------------------
@@ -150,7 +197,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
async function reviewLoop(ctx: ExtensionContext): Promise<void> {
while (true) {
const doc = parse(readPlan(ctx));
const choice = await ctx.ui.select(`Plan: ${doc.goals.length} goal(s). What next?`, [
const choice = await ctx.ui.select(`Goals: ${doc.goals.length} goal(s). What next?`, [
"Ready — start working the plan",
"Edit — ask the agent to revise",
"Open in $EDITOR",
@@ -158,14 +205,14 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
]);
if (!choice || choice.startsWith("Cancel")) {
exitPlanMode(ctx);
ctx.ui.notify("Left plan mode. plan.md kept.", "info");
ctx.ui.notify("Left plan mode. goals.md kept.", "info");
return;
}
if (choice.startsWith("Ready")) return startExecution(ctx);
if (choice.startsWith("Edit")) {
const changes = await ctx.ui.editor("What should change about the plan?", "");
if (changes?.trim()) {
pi.sendUserMessage(`Revise the plan at ${planPath(ctx)} with these changes, same format:\n\n${changes.trim()}`);
pi.sendUserMessage(`Revise the plan at ${planPath(ctx)} with these changes, same format:\n\n${changes.trim()}`, { deliverAs: "followUp" });
return; // agent_end re-opens the review loop
}
continue;
@@ -184,6 +231,9 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
}
async function startExecution(ctx: ExtensionContext): Promise<void> {
// FIXME(model-switch): the plan phase should be able to run on a sticky plan model and execution
// on a different one (see README "Not yet included"). newSession can't switch the model yet; wire
// this when pi exposes a model override on newSession.
// Offer a clean execution context (D13). newSession lives only on the saved command context.
let fresh = false;
if (ctx.hasUI && savedCmdCtx) {
@@ -193,49 +243,114 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
]);
fresh = choice?.startsWith("A fresh") ?? false;
}
exitPlanMode(ctx);
const doc = parse(readPlan(ctx));
if (doc.objective) pi.setSessionName(`Plan: ${doc.objective}`);
const planFile = planPath(ctx);
const planContent = readPlan(ctx); // captured now: ctx is stale after newSession below
const parentSession = ctx.sessionManager.getSessionFile();
const startMsg = `Work the goals in ${planFile}. Pick an open goal, mark it active (set its checkbox to [/]), work its subtasks, and when its discriminator is satisfied fill the goal's evidence: block then call CompleteGoal with the goal's desc. Keep goals.md current as you go.`;
exitPlanMode(ctx);
if (fresh && savedCmdCtx) {
const result = await savedCmdCtx.newSession({ parentSession: ctx.sessionManager.getSessionFile() });
// After newSession, `ctx`/`pi` bound to the old session are stale; do post-swap work
// through the ReplacedSessionContext passed to withSession (see runner.assertActive).
const result = await savedCmdCtx.newSession({
parentSession,
withSession: async (sessionCtx) => {
// pi.* and the outer ctx are invalidated by newSession; use the fresh sessionCtx only.
// (No setSessionName here: it lives on pi/the outer ctx, both stale now. Cosmetic, skip it.)
sessionCtx.ui.notify(planContent, "info");
await sessionCtx.sendUserMessage(startMsg, { deliverAs: "followUp" });
},
});
if (result.cancelled) {
ctx.ui.notify("Execution cancelled.", "warning");
return;
}
return;
}
pi.sendUserMessage(
`Work the plan in ${planPath(ctx)}. Pick an open goal, set it active, work its subtasks, and when its done_when is met call CompleteGoal with the evidence. Keep plan.md current as you go.`,
{ deliverAs: "followUp" },
);
if (doc.title) pi.setSessionName(`Goals: ${doc.title}`);
ctx.ui.notify(planContent, "info");
pi.sendUserMessage(startMsg, { deliverAs: "followUp" });
}
// --- the one blessed tool: CompleteGoal -------------------------------------------------------
pi.registerTool({
name: "CompleteGoal",
label: "Complete goal",
description:
"Sign off a goal once its done_when is met. Runs the goal's verify command (if any) then a " +
"read-only subagent that inspects your evidence against the repo. On accept, the goal is marked " +
"done and logged; on reject, it stays open and you get what is missing. Point evidence at durable " +
"artifacts (saved logs, committed diffs, files), not claims.",
label: "Goal signoff",
description: completeGoalDescription,
parameters: Type.Object({
goal_id: Type.String({ description: "The goal's <!-- id --> from plan.md" }),
evidence: Type.String({ description: "What shows the done_when is met, and where to verify it" }),
paths: Type.Optional(Type.Array(Type.String(), { description: "Durable artifacts the judge should inspect" })),
goal: Type.String({ description: completeGoalParamDescription }),
}),
async execute(_id, params, signal, _onUpdate, ctx) {
async execute(_id, params, signal, onUpdate, ctx) {
const content = readPlan(ctx);
const goal = findGoal(parse(content), params.goal_id);
if (!goal) return text(`No goal #${params.goal_id} in plan.md.`, true);
const goal = findGoal(parse(content), params.goal);
if (!goal) return text(`No goal "${params.goal}" in goals.md. Use the exact text after "goal:".`, true);
if (goal.evidence.length === 0) {
return text(`Goal "${goal.subject}" has no evidence yet. Add an evidence: list to the goal in goals.md (artifacts + a short read showing the discriminator is satisfied), then call CompleteGoal.`, true);
}
// Decide the outcome (the I/O); recordSignOff applies it to the file (the pure write).
const outcome = await decideSignOff(goal, params.evidence, params.paths ?? [], state.judgeModel, ctx.cwd, signal);
const res = recordSignOff(content, goal.id, stamp(), outcome);
if (res.content !== content) writeFileSync(planPath(ctx), res.content);
const handleUpdate = (partial: { content: Array<{ type: "text"; text: string }>; details: SignOffDetails }) => {
onUpdate?.(partial);
};
const { outcome, reasoning, durationMs } = await decideSignOff(goal, goal.evidence.join("\n"), goal.evidence, state.judgeModel, ctx.cwd, signal, handleUpdate);
const res = recordSignOff(content, goal.subject, stamp(), outcome);
if (res.content !== content) writePlan(ctx, res.content);
updateWidget(ctx);
return text(res.message, res.isError);
const detail = reasoning ? `\n\n--- sign-off judge ---\n${reasoning}` : "";
const outcomeLabel = outcome.kind === "accepted" ? "accepted" : outcome.kind === "verify_failed" ? "verify_failed" : "rejected";
const details: SignOffDetails = {
goal: goal.subject,
outcome: outcomeLabel,
durationMs,
verifyCommand: goal.verify ?? undefined,
verifyExitCode: outcome.kind === "verify_failed" ? outcome.exitCode : undefined,
judgeModel: state.judgeModel ?? undefined,
reasoning,
isError: res.isError,
};
return textWithDetails(res.message + detail, details, res.isError);
},
renderCall(args, theme) {
const goalText = args.goal.length > 80 ? `${args.goal.slice(0, 80)}...` : args.goal;
return new Text(
`${theme.fg("toolTitle", theme.bold("goal signoff "))}${theme.fg("dim", goalText)}`,
0, 0,
);
},
renderResult(result, { expanded }, theme) {
const details = result.details as SignOffDetails | undefined;
const body = result.content[0]?.type === "text" ? result.content[0].text : "(no output)";
if (!details || details.outcome === "running") return new Text(body, 0, 0);
const icon = details.outcome === "accepted" ? theme.fg("success", "✔") : theme.fg("error", "✗");
const outcomeText = details.outcome === "accepted" ? "accepted" : details.outcome === "verify_failed" ? `verify failed (exit ${details.verifyExitCode})` : "rejected";
const header = `${icon} ${theme.fg("toolTitle", theme.bold("goal signoff "))}${theme.fg("accent", outcomeText)}`;
const duration = details.durationMs < 1000 ? `${details.durationMs}ms` : `${(details.durationMs / 1000).toFixed(1)}s`;
const sub = [details.judgeModel, duration].filter(Boolean).join(" · ");
if (!expanded) {
let text = header;
if (sub) text += `\n${theme.fg("dim", sub)}`;
text += `\n\n${theme.fg("toolOutput", body.slice(0, 500))}`;
if (body.length > 500) text += theme.fg("dim", "...");
text += `\n${theme.fg("muted", "(Ctrl+O to expand)")}`;
return new Text(text, 0, 0);
}
const container = new Container();
container.addChild(new Text(header, 0, 0));
if (sub) container.addChild(new Text(theme.fg("dim", sub), 0, 0));
if (details.verifyCommand) {
container.addChild(new Spacer(1));
container.addChild(new Text(theme.fg("muted", `verify: ${details.verifyCommand}`), 0, 0));
}
container.addChild(new Spacer(1));
container.addChild(new Text(theme.fg("muted", "Judge"), 0, 0));
container.addChild(new Markdown(body.trim(), 0, 0, getMarkdownTheme()));
return container;
},
});
@@ -243,6 +358,7 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
pi.on("before_agent_start", async (_event, ctx) => {
if (state.isPlanMode) {
// Read-only is enforced in the tool_call hook below (blocks edit/write while planning).
return { message: { customType: PLAN_CONTEXT, content: `${planDrafting}\n\nWrite the plan to ${planPath(ctx)}.`, display: false } };
}
const doc = parse(readPlan(ctx));
@@ -251,25 +367,48 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
const active = doc.goals.find((g) => g.status === "active") ?? doc.goals.find((g) => g.status === "open") ?? null;
const c = counts(doc);
let body = planInjection({
objective: doc.objective,
title: doc.title,
activeGoal: active
? { subject: active.subject, done_when: active.done_when, openSubtasks: active.subtasks.filter((s) => !s.done).map((s) => s.text) }
? {
subject: active.subject,
discriminator: active.discriminator,
openSubtasks: active.subtasks.filter((s) => s.status !== "done" && s.status !== "cancelled").map((s) => s.text),
}
: null,
lastLogLine: doc.log.at(-1) ?? null,
counts: { done: c.done, open: c.open + c.active },
});
// Reminder fires when there is an active goal but plan.md was untouched since the last turn.
// Reminder fires when there is an active goal but goals.md was untouched since the last turn.
const planNow = readPlan(ctx);
if (active && planNow === lastInjectedPlan) body += `\n\n${reminder}`;
lastInjectedPlan = planNow;
return { message: { customType: PLAN_CONTEXT, content: body, display: false } };
});
// Enforce read-only planning: block file mutators while in plan mode so code isn't written before
// the goals are agreed. The agent draws back to read/grep/find/ls and read-only bash to explore.
pi.on("tool_call", async (event, ctx) => {
if (!state.isPlanMode) return;
// edit/write: blocked, except writing goals.md itself (the deliverable of plan mode).
if (PLAN_MODE_BLOCKED_TOOLS.includes(event.toolName)) {
const target = (event.input as { path?: string }).path;
if (target && resolve(ctx.cwd, target) === resolve(planPath(ctx))) return;
return { block: true, reason: `Plan mode is read-only: agree the goals in ${PLAN_REL} and choose Ready before writing code (${event.toolName} is blocked while planning; only ${PLAN_REL} may be written).` };
}
// bash: blocked only when the command looks mutating; read-only exploration stays open.
if (event.toolName === "bash") {
const command = (event.input as { command?: string }).command ?? "";
if (MUTATING_BASH_PATTERNS.some((re) => re.test(command))) {
return { block: true, reason: `Plan mode is read-only: this bash command looks like it mutates state, so it's blocked while planning. Explore read-only, agree the goals in ${PLAN_REL}, then choose Ready.\nCommand: ${command}` };
}
}
});
pi.on("agent_end", async (_event, ctx) => {
if (!state.isPlanMode || !ctx.hasUI) return;
const doc = parse(readPlan(ctx));
if (doc.goals.length === 0) {
ctx.ui.notify("No goals found in plan.md yet — ask the agent to draft them.", "warning");
ctx.ui.notify("No goals found in goals.md yet — ask the agent to draft them.", "warning");
return;
}
await reviewLoop(ctx);
@@ -298,15 +437,33 @@ export default function piPlanExtension(pi: ExtensionAPI): void {
// --- helpers (module scope; pure enough to keep out of the closure) -------------------------------
/** Structured details returned by CompleteGoal so renderCall/renderResult can show metadata. */
interface SignOffDetails {
goal: string;
outcome: "accepted" | "rejected" | "verify_failed" | "running";
phase?: string; // "verifying" | "spawning" | "judging" — while running
durationMs: number;
verifyCommand?: string;
verifyExitCode?: number;
judgeModel?: string;
reasoning: string;
isError?: boolean;
}
function text(s: string, isError = false) {
return { content: [{ type: "text" as const, text: s }], details: { isError }, isError };
}
function textWithDetails(s: string, details: SignOffDetails, isError = false) {
return { content: [{ type: "text" as const, text: s }], details, isError };
}
function stamp(): string {
return new Date().toISOString().slice(0, 16).replace("T", " ");
}
/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge. */
/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge.
* Returns the outcome plus the judge's (or verify's) reasoning so CompleteGoal can show WHY. */
async function decideSignOff(
goal: Goal,
evidence: string,
@@ -314,16 +471,30 @@ async function decideSignOff(
judgeModel: string | null,
cwd: string,
signal: AbortSignal | undefined,
): Promise<SignOff> {
onUpdate?: (partial: { content: Array<{ type: "text"; text: string }>; details: SignOffDetails }) => void,
): Promise<{ outcome: SignOff; reasoning: string; durationMs: number }> {
const startedAt = Date.now();
const emit = (phase: string, text: string) => {
onUpdate?.({
content: [{ type: "text" as const, text }],
details: { goal: goal.subject, outcome: "running", phase, durationMs: Date.now() - startedAt, verifyCommand: goal.verify ?? undefined, judgeModel: judgeModel ?? undefined, reasoning: "" },
});
};
let verifyResult: { command: string; exitCode: number; outputTail: string } | null = null;
if (goal.verify) {
emit("verifying", `Running verify: ${goal.verify}`);
verifyResult = runVerify(goal.verify, cwd, signal);
if (verifyResult.exitCode !== 0) {
return { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail };
return {
outcome: { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail },
reasoning: `verify \`${goal.verify}\` exited ${verifyResult.exitCode}:\n${verifyResult.outputTail}`,
durationMs: Date.now() - startedAt,
};
}
}
const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal);
return verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing };
const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal, onUpdate);
const outcome: SignOff = verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing };
return { outcome, reasoning: verdict.reasoning, durationMs: verdict.durationMs };
}
/** Run the goal's verify command. It is agent-authored and trusted (single-user machine, guide-not-guard). */
@@ -351,33 +522,73 @@ async function runJudge(
judgeModel: string | null,
cwd: string,
signal: AbortSignal | undefined,
): Promise<{ accept: boolean; missing: string }> {
onUpdate?: (partial: { content: Array<{ type: "text"; text: string }>; details: SignOffDetails }) => void,
): Promise<{ accept: boolean; missing: string; reasoning: string; durationMs: number }> {
const startedAt = Date.now();
const emit = (phase: string, text: string) => {
onUpdate?.({
content: [{ type: "text" as const, text }],
details: { goal: goal.subject, outcome: "running", phase, durationMs: Date.now() - startedAt, verifyCommand: goal.verify ?? undefined, judgeModel: judgeModel ?? undefined, reasoning: "" },
});
};
const task = evidenceJudgeUser({
subject: goal.subject,
done_when: goal.done_when,
discriminator: goal.discriminator,
failure_modes: goal.failure_modes,
verify: goal.verify ?? null,
verifyResult,
failure_modes: goal.failure_modes,
evidence,
paths,
});
const args = ["-p", "--no-session", "--tools", READ_ONLY_TOOLS.join(","), "--append-system-prompt", evidenceJudgeSystem];
const args = ["-p", "--no-session", "--tools", JUDGE_TOOLS.join(","), "--exclude-tools", JUDGE_BLOCKED_TOOLS.join(","), "--append-system-prompt", evidenceJudgeSystem];
if (judgeModel) args.push("--model", judgeModel);
args.push(task);
emit("spawning", `Spawning read-only judge for: ${goal.subject}`);
const inv = getPiInvocation(args);
// FIXME(side-effect): pi -p --no-session clones the repo into the PARENT of cwd (so alongside
// the working dir), leaving a stale directory. The judge should run in a temp dir or inside the
// existing repo checkout so it doesn't pollute the user's workspace.
const JUDGE_TIMEOUT_MS = 120_000;
const output = await new Promise<string>((resolve) => {
let settled = false;
const timer = setTimeout(() => {
if (!settled) {
settled = true;
proc.kill();
resolve(`VERDICT: reject\nmissing: judge timed out after ${JUDGE_TIMEOUT_MS / 1000}s`);
}
}, JUDGE_TIMEOUT_MS);
const proc = spawn(inv.command, inv.args, { cwd, shell: false, stdio: ["ignore", "pipe", "pipe"], signal });
let out = "";
proc.stdout.on("data", (d) => (out += d));
proc.stderr.on("data", (d) => (out += d));
proc.on("close", () => resolve(out));
proc.on("error", (e) => resolve(`VERDICT: reject\nmissing: judge subprocess failed: ${e.message}`));
proc.on("close", () => {
if (!settled) {
settled = true;
clearTimeout(timer);
resolve(out);
}
});
proc.on("error", (e) => {
if (!settled) {
settled = true;
clearTimeout(timer);
resolve(`VERDICT: reject\nmissing: judge subprocess failed: ${e.message}`);
}
});
});
const verdictLine = output.split("\n").find((l) => /^\s*VERDICT\s*:/i.test(l)) ?? "";
// The subprocess emits ANSI/CSI control codes in -p mode; strip them so they don't leak into `missing`.
const clean = output.replace(/\u001b\[[0-9;?]*[ -/]*[@-~]/g, "");
const verdictLine = clean.split("\n").find((l) => /^\s*VERDICT\s*:/i.test(l)) ?? "";
const accept = /accept/i.test(verdictLine);
const missingMatch = output.match(/missing\s*:\s*([\s\S]*)$/i);
const missing = accept ? "" : (missingMatch?.[1].trim() || output.trim().slice(-500) || "judge gave no reason");
return { accept, missing };
const missingMatch = clean.match(/missing\s*:\s*([\s\S]*)$/i);
const missing = accept ? "" : (missingMatch?.[1].trim() || clean.trim().slice(-500) || "judge gave no reason");
// The judge's own words (inspection + verdict), so CompleteGoal can show them. The verdict is at the
// end, so keep the tail when it's long.
const trimmed = clean.trim();
const reasoning = trimmed.length > 1800 ? `...\n${trimmed.slice(-1800)}` : trimmed;
return { accept, missing, reasoning, durationMs: Date.now() - startedAt };
}
+140 -87
View File
@@ -1,100 +1,135 @@
/**
* plan-file.ts — read plan.md, and the two writes CompleteGoal needs. That is all.
* plan-file.ts — read goals.md, and the two writes CompleteGoal needs. That is all.
*
* Pure module, no pi deps, so it unit-tests without a runtime. The file is the canonical store and
* the agent edits it with its normal Edit tool (create goals, tick subtasks, append log), guided by
* the format in prompts.tsx and the reminder -- the form guides, it does not gate (spec D3). So this
* module does NOT render or create goals; the format's single source of truth is the planDrafting
* prompt. The only programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to
* record an accepted sign-off; both touch one line so the git diff stays readable.
* the agent edits it with its normal Edit tool (create goals, tick subtasks, fill evidence), guided
* by the format in prompts.ts and the reminder -- the form guides, it does not gate. The only
* programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to record an accepted
* sign-off; both touch one line so the diff stays readable.
*
* Format (spec §4):
* Format (markdown, checkbox-first, made to be skim-reviewed by a human):
*
* # Plan: <objective>
* # <plan title>
*
* ## Goal: <subject>
* <!-- id: <slug> -->
* status: open | active | done | cancelled
* done_when: <falsifiable check; plus the symptom if NOT met>
* verify: <shell command, optional>
* failure_modes:
* - <pre-mortem item>
* - [ ] <subtask>
* <context: the user's ask, preferences, decisions>
*
* ## Goals
*
* 1. [ ] goal: <desc> <- state in the checkbox: [ ] open [/] active [x] done [-] cancelled
* - discriminator: <positive observation that the goal succeeded, that no failure below could fake>
* - subtle failure mode: <a way this looks done but isn't>
* - verify: <optional shell command that exits 0 only when the discriminator passes>
* - tasks:
* 1. [x] <subtask> <- a subtask is any checkbox WITHOUT a "goal:" prefix
* 2. [/] <subtask>
* 3. [-] <subtask> <- [-] or ~~[ ]~~ both read as cancelled
* - evidence: <- empty at planning; filled at sign-off, read by CompleteGoal
* - > <artifact path / link / metric, plus a short read of it>
* 2. [ ] goal: <desc>
*
* # Future work / out of scope
*
* ## Log
* - <verbatim append-only line>
*
* A goal/subtask's state lives in its checkbox (single source of truth, renders natively). Goals are
* matched by their <desc> (the text after "goal:"); the list number is human-facing only. Only
* CompleteGoal writes a goal's [x]; the agent sets [/] when it starts one.
*/
export type GoalStatus = "open" | "active" | "done" | "cancelled";
export interface Subtask {
text: string;
done: boolean;
status: GoalStatus;
}
export interface Goal {
id: string;
/** The text after "goal:" in the header line; the handle CompleteGoal matches on. */
subject: string;
status: GoalStatus;
done_when: string;
verify?: string;
/** Positive observation(s) that the goal succeeded AND that no failure mode could fake. The success test. Written at planning. */
discriminator: string[];
/** Subtle ways a "done" could be wrong (look-like-success failures). Written at planning. */
failure_modes: string[];
/** Optional command that exits 0 only when the discriminator passes (the cheap deterministic gate). */
verify?: string;
/** Proof the discriminator passed, pointing at durable artifacts. Written at completion; read by CompleteGoal. */
evidence: string[];
subtasks: Subtask[];
}
export interface PlanDoc {
objective: string;
title: string;
goals: Goal[];
/** Verbatim ## Log lines, including the leading "- ". */
log: string[];
}
const GOAL_HEADER = /^##\s+Goal:\s*(.*)$/;
const ANY_HEADER = /^#{1,6}\s/;
const TITLE = /^#\s+(.+?)\s*$/; // the first single-# H1
const GOALS_HEADER = /^##\s+Goals\s*$/i;
const LOG_HEADER = /^##\s+Log\s*$/i;
const ID_COMMENT = /^<!--\s*id:\s*(.+?)\s*-->$/;
const CHECKBOX = /^- \[([ xX])\]\s+(.*)$/;
const ANY_HEADER = /^#{1,6}\s/;
// A goal: a numbered or bulleted checkbox item whose text begins "goal:".
const GOAL_ITEM = /^\s*(?:\d+\.|[-*])\s*\[([ xX/-])\]\s*goal:\s*(.*)$/i;
// A section marker bullet under a goal (the trailing colon is optional, e.g. "- tasks").
const KEY_LINE = /^\s*[-*]\s*(discriminator|subtle failure modes?|failure_modes?|verify|tasks?|evidence)\s*:?\s*(.*)$/i;
// Any list item (numbered or bulleted); used for subtasks and for list items inside the sections.
const LIST_ITEM = /^\s*(?:\d+\.|[-*])\s+(.*)$/;
// A checkbox inside a list-item body (subtask). A leading/trailing ~~ marks it cancelled.
const CHECKBOX_BODY = /^(~~)?\s*\[([ xX/-])\]\s*(.*)$/;
const CHAR_TO_STATUS: Record<string, GoalStatus> = { " ": "open", "/": "active", x: "done", "-": "cancelled" };
const STATUS_TO_CHAR: Record<GoalStatus, string> = { open: " ", active: "/", done: "x", cancelled: "-" };
function normalizeKey(raw: string): "discriminator" | "failure_modes" | "verify" | "tasks" | "evidence" {
const k = raw.toLowerCase();
if (k.startsWith("discriminator")) return "discriminator";
if (k.startsWith("verify")) return "verify";
if (k.startsWith("task")) return "tasks";
if (k.startsWith("evidence")) return "evidence";
return "failure_modes"; // "subtle failure mode(s)" / "failure_mode(s)"
}
export function parse(text: string): PlanDoc {
const lines = text.split("\n");
let objective = "";
let title = "";
const goals: Goal[] = [];
const log: string[] = [];
let cur: Goal | null = null;
let inFailureModes = false;
let curList: string[] | null = null; // the discriminator/failure_modes/evidence list "- " items append to
let inGoals = false;
let inLog = false;
const flush = () => {
if (cur) goals.push(cur);
cur = null;
inFailureModes = false;
curList = null;
};
for (const line of lines) {
const objMatch = /^#\s+Plan:\s*(.*)$/.exec(line);
if (objMatch) {
objective = objMatch[1].trim();
const tM = TITLE.exec(line);
if (tM && !title && !GOALS_HEADER.test(line) && !LOG_HEADER.test(line)) {
title = tM[1].trim();
continue;
}
const goalMatch = GOAL_HEADER.exec(line);
if (goalMatch) {
if (GOALS_HEADER.test(line)) {
flush();
inGoals = true;
inLog = false;
cur = { id: "", subject: goalMatch[1].trim(), status: "open", done_when: "", failure_modes: [], subtasks: [] };
continue;
}
if (LOG_HEADER.test(line)) {
flush();
inGoals = false;
inLog = true;
continue;
}
// Any other header ends the current goal / log section.
// Any other header (e.g. "# Future work") ends the goals / log section.
if (ANY_HEADER.test(line)) {
flush();
inGoals = false;
inLog = false;
continue;
}
@@ -103,50 +138,70 @@ export function parse(text: string): PlanDoc {
if (/^\s*-\s+/.test(line)) log.push(line);
continue;
}
if (!inGoals) continue; // title + context prose between the title and ## Goals
const goalM = GOAL_ITEM.exec(line);
if (goalM) {
flush();
cur = {
subject: goalM[2].trim(),
status: CHAR_TO_STATUS[goalM[1].toLowerCase()] ?? "open",
discriminator: [],
failure_modes: [],
evidence: [],
subtasks: [],
};
continue;
}
if (!cur) continue;
const idMatch = ID_COMMENT.exec(line.trim());
if (idMatch) {
cur.id = idMatch[1];
const keyM = KEY_LINE.exec(line);
if (keyM) {
const key = normalizeKey(keyM[1]);
const inlineVal = keyM[2].trim();
if (key === "verify") {
cur.verify = inlineVal || undefined;
curList = null;
} else if (key === "tasks") {
curList = null; // subtasks are identified by being a checkbox; this marker is cosmetic
} else {
curList = cur[key]; // discriminator | failure_modes | evidence
if (inlineVal) curList.push(inlineVal);
}
continue;
}
// A checkbox (column 0) is a subtask; checked first so it is never read as a failure mode.
const checkbox = CHECKBOX.exec(line);
if (checkbox) {
inFailureModes = false;
cur.subtasks.push({ done: checkbox[1].toLowerCase() === "x", text: checkbox[2].trim() });
continue;
}
const kv = /^(status|done_when|verify|failure_modes)\s*:\s*(.*)$/.exec(line);
if (kv) {
const [, key, value] = kv;
if (key === "status") cur.status = value.trim() as GoalStatus;
else if (key === "done_when") cur.done_when = value.trim();
else if (key === "verify") cur.verify = value.trim() || undefined;
else if (key === "failure_modes") inFailureModes = true;
continue;
}
// Indented "- " items under failure_modes: (a column-0 checkbox already returned above).
if (inFailureModes) {
const fm = /^\s*-\s+(.*)$/.exec(line);
if (fm) {
cur.failure_modes.push(fm[1].trim());
const listM = LIST_ITEM.exec(line);
if (listM) {
const body = listM[1];
const cb = CHECKBOX_BODY.exec(body);
if (cb) {
// A checkbox without a "goal:" prefix is a subtask of the current goal.
const cancelled = cb[1] === "~~" || body.includes("~~");
const status = cancelled ? "cancelled" : (CHAR_TO_STATUS[cb[2].toLowerCase()] ?? "open");
cur.subtasks.push({ text: cb[3].replace(/~~/g, "").trim(), status });
curList = null;
continue;
}
if (line.trim() !== "") inFailureModes = false;
// A plain "- " / "> " item belongs to the current section (discriminator/failure/evidence).
if (curList) curList.push(body.trim());
continue;
}
// A non-empty, non-"- " line continues the current item, so multi-line evidence (a block quote
// of a log, a table, an interpretation line) stays attached to its item. Blank lines are skipped.
if (curList && line.trim() !== "" && curList.length > 0) {
curList[curList.length - 1] += `\n${line.trim()}`;
}
}
flush();
return { objective, goals, log };
return { title, goals, log };
}
export function findGoal(doc: PlanDoc, id: string): Goal | undefined {
return doc.goals.find((g) => g.id === id);
export function findGoal(doc: PlanDoc, subject: string): Goal | undefined {
const want = subject.trim();
return doc.goals.find((g) => g.subject === want);
}
export function counts(doc: PlanDoc): { done: number; open: number; active: number } {
@@ -159,20 +214,18 @@ export function counts(doc: PlanDoc): { done: number; open: number; active: numb
return c;
}
/** Flip a goal's `status:` line in place (the one write CompleteGoal needs). */
export function setGoalStatus(text: string, id: string, status: GoalStatus): string {
/** Flip a goal's checkbox in place, matched by its subject (the one write CompleteGoal needs). */
export function setGoalStatus(text: string, subject: string, status: GoalStatus): string {
const lines = text.split("\n");
let i = lines.findIndex((l) => ID_COMMENT.test(l.trim()) && ID_COMMENT.exec(l.trim())?.[1] === id);
if (i === -1) throw new Error(`Goal #${id} not found`);
for (; i < lines.length; i++) {
if (i > 0 && ANY_HEADER.test(lines[i]) && !GOAL_HEADER.test(lines[i]) && !LOG_HEADER.test(lines[i])) break;
const kv = /^(status\s*:\s*)(.*)$/.exec(lines[i]);
if (kv) {
lines[i] = `${kv[1]}${status}`;
const want = subject.trim();
for (let i = 0; i < lines.length; i++) {
const m = GOAL_ITEM.exec(lines[i]);
if (m && m[2].trim() === want) {
lines[i] = lines[i].replace(/\[[ xX/-]\]/, `[${STATUS_TO_CHAR[status]}]`);
return lines.join("\n");
}
}
throw new Error(`Goal #${id} has no status: line`);
throw new Error(`Goal "${subject}" not found`);
}
/**
@@ -184,28 +237,28 @@ export type SignOff =
| { kind: "rejected"; missing: string }
| { kind: "accepted" };
/** Apply a sign-off outcome to plan.md text: accept flips status + logs; reject only logs. Pure. */
/** Apply a sign-off outcome to goals.md text: accept flips the goal checkbox to [x] + logs; reject only logs. Pure. */
export function recordSignOff(
text: string,
goalId: string,
subject: string,
when: string,
outcome: SignOff,
): { content: string; message: string; isError: boolean } {
const goal = findGoal(parse(text), goalId);
if (!goal) return { content: text, message: `No goal #${goalId} in plan.md.`, isError: true };
const goal = findGoal(parse(text), subject);
if (!goal) return { content: text, message: `No goal "${subject}" in goals.md.`, isError: true };
if (outcome.kind === "verify_failed") {
const content = appendLog(text, `${when} reject #${goalId}: verify exit ${outcome.exitCode}`);
const content = appendLog(text, `${when} reject "${subject}": verify exit ${outcome.exitCode}`);
return { content, message: `Sign-off rejected: verify failed (exit ${outcome.exitCode}).\n${outcome.outputTail}`, isError: true };
}
if (outcome.kind === "rejected") {
const oneLine = outcome.missing.replace(/\s+/g, " ").trim().slice(0, 200);
const content = appendLog(text, `${when} reject #${goalId}: ${oneLine}`);
const content = appendLog(text, `${when} reject "${subject}": ${oneLine}`);
return { content, message: `Sign-off rejected. Missing:\n${outcome.missing}`, isError: true };
}
const flipped = setGoalStatus(text, goalId, "done");
const content = appendLog(flipped, `${when} signed off #${goalId}: ${goal.subject} (oracle accept)`);
return { content, message: `Signed off #${goalId}: ${goal.subject}. Marked done in plan.md.`, isError: false };
const flipped = setGoalStatus(text, subject, "done");
const content = appendLog(flipped, `${when} signed off "${subject}" (judge accept)`);
return { content, message: `Signed off "${subject}". Marked done in goals.md.`, isError: false };
}
/** Append one verbatim line to ## Log (creating the section if absent). The other CompleteGoal write. */
+151 -80
View File
@@ -1,91 +1,134 @@
/**
* pi-plan — all model-facing text, in flow order.
* pi-goals — all model-facing text, in flow order.
*
* Philosophy: the form guides a process; it does not police one. The agent can
* edit plan.md freely. These prompts + the plan.md structure make the right path
* edit goals.md freely. These prompts + the goals.md structure make the right path
* the easy path. The only step that is genuinely rigorous is the evidence judge
* (6), and even that is reached by guiding the agent to call CompleteGoal, not by
* (7), and even that is reached by guiding the agent to call CompleteGoal, not by
* trapping it. Bypasses stay visible in the git diff and the widget.
*
* Flow:
* SETUP (plan mode) 1. planDrafting — strong/sticky model drafts goals
* Flow (this file is ordered the way the agent meets each text, so it reads as one pass):
* SETUP (plan mode) 1. planDrafting — drafts goals (read-only phase)
* EXEC, each turn start 2. planInjection — "here is your plan, where you are"
* EXEC, periodic 3. reminder — the typed nudge that drives upkeep + autonomy
* EXEC, loop continue 4. continuation — keep going toward the active goal
* EXEC, after each turn 5. loopJudge — continue / pause (cheap, foolable, ok)
* SIGN-OFF 6. evidenceJudge — read-only verify (rigorous; the one real check)
* SIGN-OFF, agent-side 6. completeGoalTool — the CompleteGoal tool desc + param the agent reads
* SIGN-OFF, judge-side 7. evidenceJudge — read-only verify (rigorous; the one real check)
*
* Read top to bottom to see the whole process. 5 and 6 are kept adjacent on
* purpose: the cheap-foolable vs must-not-be-fooled contrast is the design.
* Read top to bottom to see the whole process. 5 and 7 embody the design contrast:
* the cheap-foolable loop gate vs the must-not-be-fooled sign-off.
*
* WIRED in index.ts: 1 planDrafting, 2 planInjection, 3 reminder, 6 evidenceJudge.
* WIRED in index.ts: 1 planDrafting, 2 planInjection, 3 reminder, 6 completeGoalTool, 7 evidenceJudge.
* NOT YET WIRED: 4 continuation and 5 loopJudge define the autonomous re-prompt loop, which is
* intentionally not built in v1 (an until-done-style loop was judged too complex). They stay here so
* the full intended flow is reviewable; wire them if/when the loop is added.
*
* The goal's test is the DISCRIMINATOR: the concrete observation that tells real success from the
* named subtle failure mode. It replaces a vague "done_when". Evidence is empty at planning and
* filled at sign-off (you don't always know the exact artifacts up front; the judge checks them then).
*/
/* ─────────────────────────────────────────────────────────────────────────
* 1. planDrafting — SETUP, plan mode
*
* System guidance for the plan-phase agent. Runs on the plan model (may differ
* from the execution model; the choice is sticky — see oracle.json-style config).
* This phase is read-only: explore, then draft goals into plan.md. No code yet.
* The field requirements here are the whole "elicitation" — get them agreed up
* front, because the human reviews this output before any execution.
* System guidance for the plan-phase agent. This phase is read-only (edit/write
* and mutating bash are blocked by a tool hook): explore, then draft goals into
* goals.md. The fields here are the whole "elicitation"; the human reviews this
* output before any execution.
* ──────────────────────────────────────────────────────────────────────── */
export const planDrafting = `\
You are in plan mode. Explore the repository read-only, then draft a plan into plan.md.
Do not write or run code in this phase. Produce goals the human will review and approve.
You are in plan mode. The objective may arrive through conversation, not as one up-front command.
Explore the repository read-only first, then ask: resolve discoverable facts by looking them up, and
only ask the human when the answer is a genuine intent or preference choice that exploration can't
settle. Don't write goals that branch on something you could just check. Do not write or run code in
this phase (edit and write are blocked, and so is mutating bash). If the ask is itself read-only
(e.g. research, a search, a report), explore enough to scope it, but leave the actual deliverable for
after the human approves the plan. When the objective is clear, draft goals into goals.md and stop
for review. Produce a plan the human will review and approve.
Write each goal in this shape:
Right-size it, don't force structure that isn't there:
- Default to ONE goal. Add another only when it's a genuinely separate checkpoint you'd want signed
off on its own (it can pass or fail independently). Most objectives are 1-2 goals.
- Subtasks are the steps inside a goal. Add them when a goal has 3+ distinct steps; skip them for a
single-action goal. Don't pad with trivial steps.
- Don't invent goals to look thorough. When in doubt, merge.
## Goal: <one short imperative line>
status: open
done_when: <a falsifiable check, plus the symptom you'd see if it's NOT met>
verify: <a shell command that exits 0 only when the goal is met — include this whenever
success is expressible as tests/lint/build/a threshold; omit it otherwise>
failure_modes:
- <a concrete way this could look done but isn't>
- <another>
- <if verify exists: "verify passes on a trivial or gamed test">
- [ ] <first subtask>
- [ ] <next subtask>
Write the whole file in this shape (markdown checkboxes, made to be skim-reviewed):
Rules for a good plan:
- Keep goals small enough that done_when is checkable in one sitting.
- done_when must be falsifiable. "Works well" is not a criterion; "p95 < 50ms on bench-X,
else timeouts in load-test.log" is.
- failure_modes are a pre-mortem: the cheap, specific ways a later "done" could be wrong.
This is the highest-value part — it shapes what evidence you'll collect.
- Prefer a verify command. A green deterministic check is worth more than a paragraph of
description, and it's the first thing checked at sign-off.
# <short plan title>
When the plan is drafted, present it and stop for review. Do not begin execution.`;
<context: restate the user's ask, their stated preferences, and any decisions you've agreed on>
## Goals
1. [ ] goal: <one short imperative line>
- subtle failure mode: <a way this could look done but isn't>
- discriminator: <the concrete observation that tells real success from that failure>
- verify: <optional shell command that exits 0 only when the discriminator passes; omit if not testable>
- tasks:
1. [ ] <subtask>
2. [ ] <subtask>
- evidence:
- <leave empty now; filled at sign-off>
2. [ ] goal: <...>
# Future work / out of scope
- <anything deliberately not in these goals>
## Log
Keep it lean and legible:
- A goal is a checkbox line beginning "goal:"; its state is the checkbox ([ ] open, [/] active, [x]
done, [-] cancelled). Leave goals [ ] at planning. The number is just for the human to reference.
- subtle failure mode + discriminator are the heart of this. List the ways a "done" could look
achieved but not be (empty/zero-count output, a silently-errored step, a gamed test, a flat/no-op
result that dodged every trap and still showed nothing; these are examples, find the ones that fit).
- The discriminator is the POSITIVE observation that the goal actually succeeded AND that none of
those failure modes could have produced. It must show success happened -- the count moved the right
way, the test really exercised the path, the metric beat noise -- not merely that a failure was
ruled out: avoiding every failure mode is necessary, not sufficient. Name the success signal first,
then check it isn't something a failure mode could fake. Keep it terse.
- The discriminator is the success test, written now, in place of a vague "done": make it a concrete,
checkable observation about a real artifact (a file, a test result, a committed diff, a metric), not
about goals.md's own checkbox.
- subtasks: any checkbox WITHOUT a "goal:" prefix, under "- tasks:". Use [/] for in progress and [-]
for cancelled/impossible.
- verify: prefer one when the discriminator is a test, build, threshold, or metric: a green check or
a printed number beats prose. Omit it otherwise.
- evidence stays empty at planning. You don't always know the exact artifacts up front, and that's
fine: you fill evidence at sign-off, and a fresh read-only judge checks it then.
When the goals are drafted, present them and stop for review. Do not begin execution.`;
/* ─────────────────────────────────────────────────────────────────────────
* 2. planInjection — EXEC, injected at each agent start (and after compaction)
*
* A late user-role message, NOT a system-prompt mutation (keeps the prefix cache
* valid). Built from the parsed plan. MUST be byte-identical when nothing changed:
* fixed field order, no volatile timestamps in the body. Pass only the active
* goal + its open subtasks + the last log line not the whole file.
* fixed field order, no volatile timestamps. Pass only the active goal + its open
* subtasks + the last log line, not the whole file.
* ──────────────────────────────────────────────────────────────────────── */
export function planInjection(p: {
objective: string;
activeGoal: { subject: string; done_when: string; openSubtasks: string[] } | null;
title: string;
activeGoal: { subject: string; discriminator: string[]; openSubtasks: string[] } | null;
lastLogLine: string | null;
counts: { done: number; open: number };
}): string {
if (!p.activeGoal) {
return `Plan (plan.md): ${p.objective}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal or run /plan.`;
// FIXME(heading): user wants the heading to show ".pi/goals.md: <title>" so the filename is explicit
// even in the injection. Currently says "Goals (goals.md):" which is close but not the same.
return `.pi/goals.md: ${p.title}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal (set its checkbox to [/]) or run /goals.`;
}
const subtasks = p.activeGoal.openSubtasks.length
? p.activeGoal.openSubtasks.map((s) => ` - [ ] ${s}`).join("\n")
: " (no open subtasks)";
const disc = p.activeGoal.discriminator.length ? p.activeGoal.discriminator.join("; ") : "(none set)";
return `\
Plan (plan.md): ${p.objective}
.pi/goals.md: ${p.title}
Active goal: ${p.activeGoal.subject}
done_when: ${p.activeGoal.done_when}
discriminator (the success test): ${disc}
Open subtasks:
${subtasks}
Last log: ${p.lastLogLine ?? "(none yet)"}
@@ -96,19 +139,20 @@ Progress: ${p.counts.done} done, ${p.counts.open} open.`;
* 3. reminder — EXEC, periodic system-reminder
*
* The typed nudge. This is both the housekeeping and the autonomy engine — it is
* what makes the process get followed without a hard gate. Fires after N
* file-modifying turns since the last plan.md update while a goal is active.
* Keep the wording stable so it doesn't thrash the cache.
* what makes the process get followed without a hard gate. Fires after a turn that
* left goals.md untouched while a goal is active. Keep the wording stable so it
* doesn't thrash the cache.
* ──────────────────────────────────────────────────────────────────────── */
export const reminder = `\
<system-reminder>
Keep plan.md current as you work:
- tasks: tick the subtasks you've finished; add any new ones you've discovered.
- log: append ONE short line to ## Log (append don't rewrite earlier lines).
- goal: if the active goal's evidence is in, sign it off by calling CompleteGoal with that
evidence. Don't edit status to done by hand — CompleteGoal runs the check and records it.
- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely
blocked; if blocked, say what's blocking and why.
Keep goals.md current as you work:
- tasks: tick the subtasks you've finished ([/] for in progress); add any you've discovered.
- log: append ONE short line to ## Log (append, don't rewrite earlier lines).
- goal: when the active goal's discriminator is satisfied, fill its evidence: block in goals.md (a
list pointing at durable artifacts), then call CompleteGoal with the goal's desc. Don't tick the
goal [x] by hand; CompleteGoal reads the evidence, runs the check, and writes [x].
- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely blocked;
if blocked, say what's blocking it.
</system-reminder>`;
/* ─────────────────────────────────────────────────────────────────────────
@@ -118,9 +162,9 @@ Keep plan.md current as you work:
* continue. Does not mutate the system prompt, so the cache holds.
* ──────────────────────────────────────────────────────────────────────── */
export const continuation = `\
Continue toward the active goal in plan.md. If it now meets its done_when, call CompleteGoal
with your evidence (point to durable artifacts saved logs, committed diffs, files not just
claims). If you're blocked, state what's blocking it.`;
Continue toward the active goal in goals.md. If its discriminator is now satisfied, fill the goal's
evidence: block (durable artifacts, e.g. saved logs, committed diffs, files, not just claims) and
then call CompleteGoal with the goal's desc. If you're blocked, state what's blocking it.`;
/* ─────────────────────────────────────────────────────────────────────────
* 5. loopJudge — EXEC, runs after each turn to decide continue / pause
@@ -133,14 +177,14 @@ claims). If you're blocked, state what's blocking it.`;
export const loopJudgeSystem = `\
You decide whether an autonomous coding agent should keep working or pause for the human.
Be conservative: only pause when the work is plainly finished or plainly blocked. When in
doubt, continue. You are not verifying correctness a later read-only judge does that.
doubt, continue. You are not verifying correctness; a later read-only judge does that.
Reply with ONLY a JSON object, no other text: {"done": boolean, "reason": "<one sentence>"}.
Set done=true only if the agent's last message shows the active goal's done_when is met, or
the agent says it is blocked and needs the human.`;
Set done=true only if the agent's last message shows the active goal's discriminator is satisfied,
or the agent says it is blocked and needs the human.`;
export function loopJudgeUser(p: { activeGoalDoneWhen: string; lastResponse: string }): string {
export function loopJudgeUser(p: { discriminator: string; lastResponse: string }): string {
return `\
Active goal done_when: ${p.activeGoalDoneWhen}
Active goal discriminator (the success test): ${p.discriminator}
Agent's last message:
"""
@@ -151,24 +195,50 @@ ${p.lastResponse}
}
/* ─────────────────────────────────────────────────────────────────────────
* 6. evidenceJudge — SIGN-OFF, the one rigorous check
* 6. completeGoalTool — SIGN-OFF, agent-side
*
* Runs inside CompleteGoal, on the read-only oracle subprocess (fresh context,
* strongest reasoning on the chosen provider; override to a different vendor for
* high-stakes goals). It re-derives from the repo rather than trusting the
* agent's transcription, and it judges whether a verify command actually tests
* the criterion or could pass while a named failure mode holds (gaming).
* The description + param the agent reads on the one blessed tool, CompleteGoal.
* This is where the agent meets the sign-off: it fills evidence and calls the
* tool, which then runs verify + the judge (7). Kept here with the rest of the
* model-facing text so the whole process reads top to bottom.
* ──────────────────────────────────────────────────────────────────────── */
export const completeGoalDescription =
"Sign off a goal once its discriminator is satisfied. First fill the goal's evidence: block in " +
"goals.md: a list where each item pairs a durable artifact with a short read of it (a quoted+linked " +
"log, a table plus how to read it, or a metric plus what it shows; quote the key lines and link the " +
"rest, not a pasted blob or a bare claim). The read must show the success POSITIVELY happened (the " +
"result is present, the count moved the right way, the metric beat noise), not just that a failure " +
"was avoided; ruling out the failure modes is necessary but not sufficient. Then call this with the " +
"goal's desc (the text after 'goal:'). Runs the goal's verify command (if any) then a read-only " +
"subagent that inspects that evidence against the repo and the discriminator. On accept, the goal is " +
"marked done and logged; on reject, it stays open and you get what is missing. The subagent's " +
"reasoning is returned either way.";
export const completeGoalParamDescription = "The goal's desc: the exact text after 'goal:' in its line.";
/* ─────────────────────────────────────────────────────────────────────────
* 7. evidenceJudge — SIGN-OFF, judge-side; the one rigorous check
*
* Runs inside CompleteGoal, on a read-only pi subprocess (fresh context via
* --no-session, so it never sees the working agent's transcript; override to a
* different vendor for an independent cross-family check). It re-derives from the
* repo rather than trusting the agent's transcription, and judges whether the
* evidence satisfies the discriminator and rules out the named failure mode.
*
* The transport gives it read/grep/find/ls. The prompt below imposes the verdict
* contract — the oracle returns prose by default, so parse the VERDICT line.
* contract — the subprocess returns prose by default, so parse the VERDICT line.
* ──────────────────────────────────────────────────────────────────────── */
export const evidenceJudgeSystem = `\
You are a read-only reviewer signing off a coding goal. Do not trust claims verify.
You are a read-only reviewer signing off a coding goal. Do not trust claims; verify.
Use read/grep/find/ls to inspect the repository and the cited artifacts yourself. Re-read the
files, logs, and diffs the evidence points to; if something it asserts isn't on disk, you can't
confirm it. If a verify command was run, judge whether it genuinely tests the criterion or
could pass while one of the listed failure modes still holds — a tautological or skipped test
is a reject. Check each failure mode is actually ruled out, not just unmentioned.
confirm it. Judge whether the evidence shows the goal POSITIVELY succeeded -- the discriminator's
success signal is actually present, not just that the failure modes were dodged. Avoiding every
failure mode is necessary but not sufficient: a run can rule out each trap and still have produced
nothing, so reject "no problems found" that lacks the positive result. Then check the named subtle
failure modes are genuinely ruled out, not just unmentioned. If a verify command was run,
judge whether it really tests the discriminator or could pass while the failure mode still holds; a
tautological or skipped test is a reject.
Finish with exactly these two lines and nothing after:
VERDICT: accept | reject
@@ -176,10 +246,10 @@ missing: <empty if accept; otherwise a short list of what's needed before this c
export function evidenceJudgeUser(p: {
subject: string;
done_when: string;
discriminator: string[];
failure_modes: string[];
verify: string | null;
verifyResult: { command: string; exitCode: number; outputTail: string } | null;
failure_modes: string[];
evidence: string;
paths: string[];
}): string {
@@ -188,9 +258,10 @@ export function evidenceJudgeUser(p: {
: "verify command: none (no deterministic check for this goal)";
return `\
Goal: ${p.subject}
done_when: ${p.done_when}
failure_modes:
${p.failure_modes.map((f) => ` - ${f}`).join("\n")}
discriminator (must be satisfied):
${p.discriminator.map((d) => ` - ${d}`).join("\n") || " (none stated, note this)"}
subtle failure modes (must be ruled out):
${p.failure_modes.map((f) => ` - ${f}`).join("\n") || " (none stated)"}
${verifyBlock}
@@ -198,7 +269,7 @@ Agent's evidence:
${p.evidence}
Artifacts it points to (inspect these):
${p.paths.map((x) => ` - ${x}`).join("\n") || " (none listed note this)"}
${p.paths.map((x) => ` - ${x}`).join("\n") || " (none listed, note this)"}
Verify the goal against its done_when. Then give your VERDICT.`;
Verify the evidence satisfies the discriminator and rules out the failure modes. Then give your VERDICT.`;
}
+79 -79
View File
@@ -1,26 +1,30 @@
import { describe, expect, it } from "vitest";
import { appendLog, counts, findGoal, parse, recordSignOff, setGoalStatus } from "../src/plan-file.js";
const SAMPLE = `# Plan: ship the cache layer
const SAMPLE = `# papers audit
## Goal: Implement cache layer
<!-- id: cache-layer-1 -->
status: active
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
verify: pytest tests/cache -q
failure_modes:
- cache silently bypassed (hit-rate ~0, latency ok by luck)
- bench too small to exercise eviction
- [x] wire cache client
- [ ] eviction policy
- [ ] load test
Clean up steering/ metadata and kill empty dirs. Keep it read-only until I approve.
## Goal: Document the API
<!-- id: document-the-api-1 -->
status: open
done_when: every public fn has a docstring; else sphinx warns
failure_modes:
- docstrings exist but are stale
## Goals
1. [/] goal: Implement cache layer
- discriminator: hit-rate > 0.8 in load-test.log (a bypass reads ~0)
- subtle failure mode: cache silently bypassed, latency ok by luck
- verify: pytest tests/cache -q
- tasks:
1. [x] wire cache client
2. [/] eviction policy
3. ~~[ ]~~ distributed cache, out of scope
- evidence:
- > load-test.log: p95=41ms
- > hit-rate 0.93 (not bypassed)
2. [ ] goal: Document the API
- discriminator: every public fn has a docstring; sphinx warns on none
- subtle failure mode: docstrings exist but are stale
# Future work / out of scope
- distributed cache
## Log
- 2026-06-15 14:02 cache client wired; eviction next
@@ -48,27 +52,45 @@ function lineDelta(a: string, b: string): { added: number; removed: number } {
describe("parse", () => {
const doc = parse(SAMPLE);
it("reads the objective and both goals", () => {
expect(doc.objective).toBe("ship the cache layer");
expect(doc.goals.map((g) => g.id)).toEqual(["cache-layer-1", "document-the-api-1"]);
it("reads the title and both goals (matched by subject)", () => {
expect(doc.title).toBe("papers audit");
expect(doc.goals.map((g) => g.subject)).toEqual(["Implement cache layer", "Document the API"]);
});
it("reads goal fields", () => {
const g = findGoal(doc, "cache-layer-1");
expect(g?.subject).toBe("Implement cache layer");
expect(g?.status).toBe("active");
expect(g?.done_when).toBe("p95 < 50ms on bench-X. If wrong: timeouts in load-test.log");
it("reads goal status from the checkbox", () => {
expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // [/]
expect(findGoal(doc, "Document the API")?.status).toBe("open"); // [ ]
});
it("reads discriminator, subtle failure mode, and verify as separate fields", () => {
const g = findGoal(doc, "Implement cache layer");
expect(g?.discriminator).toEqual(["hit-rate > 0.8 in load-test.log (a bypass reads ~0)"]);
expect(g?.failure_modes).toEqual(["cache silently bypassed, latency ok by luck"]);
expect(g?.verify).toBe("pytest tests/cache -q");
});
it("separates failure_modes from subtasks", () => {
const g = findGoal(doc, "cache-layer-1");
expect(g?.failure_modes).toHaveLength(2);
expect(g?.failure_modes[0]).toContain("cache silently bypassed");
it("reads subtasks with their checkbox state, strikethrough as cancelled", () => {
const g = findGoal(doc, "Implement cache layer");
expect(g?.subtasks).toEqual([
{ text: "wire cache client", done: true },
{ text: "eviction policy", done: false },
{ text: "load test", done: false },
{ text: "wire cache client", status: "done" },
{ text: "eviction policy", status: "active" },
{ text: "distributed cache, out of scope", status: "cancelled" },
]);
});
it("reads the evidence block separate from the other lists", () => {
const g = findGoal(doc, "Implement cache layer");
expect(g?.evidence).toEqual(["> load-test.log: p95=41ms", "> hit-rate 0.93 (not bypassed)"]);
expect(findGoal(doc, "Document the API")?.evidence).toEqual([]); // a goal with no evidence parses to []
});
it("keeps a multi-line evidence item together (quote + interpretation)", () => {
const doc2 = parse(
`# x\n\n## Goals\n\n1. [ ] goal: G\n - discriminator: report has non-zero counts\n - evidence:\n - > report.txt: counts 52 -> 4\n remaining 4 = index + 3 notes\n almost certain the discriminator passes\n - > second item, single line\n`,
);
expect(findGoal(doc2, "G")?.evidence).toEqual([
"> report.txt: counts 52 -> 4\nremaining 4 = index + 3 notes\nalmost certain the discriminator passes",
"> second item, single line",
]);
});
@@ -76,50 +98,28 @@ describe("parse", () => {
expect(doc.log).toEqual(["- 2026-06-15 14:02 cache client wired; eviction next"]);
expect(counts(doc)).toEqual({ done: 0, open: 1, active: 1 });
});
});
describe("failure_modes vs subtask disambiguation", () => {
it("a column-0 checkbox right after failure_modes: is a SUBTASK", () => {
const doc = parse(
`# Plan: x\n\n## Goal: G\n<!-- id: g-1 -->\nstatus: open\ndone_when: z\nfailure_modes:\n- [ ] first subtask\n- [x] second subtask\n`,
);
const g = findGoal(doc, "g-1");
expect(g?.failure_modes).toEqual([]);
expect(g?.subtasks).toEqual([
{ text: "first subtask", done: false },
{ text: "second subtask", done: true },
]);
});
it("an indented checkbox-shaped item inside failure_modes is a FAILURE MODE", () => {
const doc = parse(
`# Plan: x\n\n## Goal: G\n<!-- id: g-2 -->\nstatus: open\ndone_when: z\nfailure_modes:\n - [ ] prose that looks like a checkbox\n- [ ] real subtask\n`,
);
const g = findGoal(doc, "g-2");
expect(g?.failure_modes).toEqual(["[ ] prose that looks like a checkbox"]);
expect(g?.subtasks).toEqual([{ text: "real subtask", done: false }]);
});
it("a goal with no failure_modes keeps its subtasks", () => {
const doc = parse(`# Plan: x\n\n## Goal: G\n<!-- id: g-3 -->\nstatus: open\ndone_when: z\n- [ ] only subtask\n`);
const g = findGoal(doc, "g-3");
expect(g?.failure_modes).toEqual([]);
expect(g?.subtasks).toEqual([{ text: "only subtask", done: false }]);
it("ignores the Future work section, does not read it as goals or log", () => {
expect(doc.goals).toHaveLength(2);
expect(doc.log).toHaveLength(1);
});
});
describe("the two CompleteGoal writes (minimal diff)", () => {
it("setGoalStatus replaces exactly one line, scoped to the right goal", () => {
const next = setGoalStatus(SAMPLE, "cache-layer-1", "done");
const next = setGoalStatus(SAMPLE, "Implement cache layer", "done");
expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 1 });
expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("done");
expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("open"); // untouched
expect(findGoal(parse(next), "Implement cache layer")?.status).toBe("done");
expect(findGoal(parse(next), "Document the API")?.status).toBe("open"); // untouched
});
it("setGoalStatus targets the second goal without touching the first", () => {
const next = setGoalStatus(SAMPLE, "document-the-api-1", "active");
expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("active");
expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("active");
it("setGoalStatus keeps the number and goal: prefix, flips only the checkbox", () => {
expect(setGoalStatus(SAMPLE, "Implement cache layer", "done")).toContain("1. [x] goal: Implement cache layer");
expect(setGoalStatus(SAMPLE, "Document the API", "cancelled")).toContain("2. [-] goal: Document the API");
});
it("setGoalStatus throws on an unknown subject", () => {
expect(() => setGoalStatus(SAMPLE, "no such goal", "done")).toThrow();
});
it("appendLog adds exactly one line under ## Log", () => {
@@ -132,7 +132,7 @@ describe("the two CompleteGoal writes (minimal diff)", () => {
});
it("appendLog creates the section when absent", () => {
const noLog = "# Plan: x\n\n## Goal: y\n<!-- id: y-1 -->\nstatus: open\ndone_when: z\n";
const noLog = "# x\n\n## Goals\n\n1. [ ] goal: y\n - discriminator: z\n";
expect(parse(appendLog(noLog, "first entry")).log).toEqual(["- first entry"]);
});
});
@@ -141,30 +141,30 @@ describe("recordSignOff (CompleteGoal's pure record logic)", () => {
const WHEN = "2026-06-15 16:00";
it("accept flips status:done and logs a sign-off line", () => {
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "accepted" });
const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "accepted" });
expect(r.isError).toBe(false);
const doc = parse(r.content);
expect(findGoal(doc, "cache-layer-1")?.status).toBe("done");
expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off #cache-layer-1: Implement cache layer (oracle accept)`);
expect(findGoal(doc, "Implement cache layer")?.status).toBe("done");
expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off "Implement cache layer" (judge accept)`);
});
it("verify_failed only logs a reject line, status stays active", () => {
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" });
const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" });
expect(r.isError).toBe(true);
const doc = parse(r.content);
expect(findGoal(doc, "cache-layer-1")?.status).toBe("active"); // NOT marked done
expect(doc.log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: verify exit 1`);
expect(findGoal(doc, "Implement cache layer")?.status).toBe("active"); // NOT marked done
expect(doc.log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": verify exit 1`);
});
it("rejected logs the (one-lined) missing reason, status stays", () => {
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" });
const r = recordSignOff(SAMPLE, "Implement cache layer", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" });
expect(r.isError).toBe(true);
expect(findGoal(parse(r.content), "cache-layer-1")?.status).toBe("active");
expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: no saved bench log`);
expect(findGoal(parse(r.content), "Implement cache layer")?.status).toBe("active");
expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject "Implement cache layer": no saved bench log`);
});
it("unknown goal returns an error and does not touch the file", () => {
const r = recordSignOff(SAMPLE, "nope-1", WHEN, { kind: "accepted" });
const r = recordSignOff(SAMPLE, "nope", WHEN, { kind: "accepted" });
expect(r.isError).toBe(true);
expect(r.content).toBe(SAMPLE);
});