mirror of
https://github.com/wassname/pi-goals.git
synced 2026-06-27 16:46:16 +08:00
pi-plan: plan-mode goals + evidence in one plan.md, subagent sign-off
Small, guide-not-gate plan/goal tracker for pi. The agent edits plan.md with its normal Edit tool; CompleteGoal is the one blessed path that runs verify + a read-only judge and records the result. Plan mode drafts goals (done_when + failure_modes + subtasks), a per-turn injection keeps the active goal alive through compaction, and a reminder drives upkeep + autonomy. - src/plan-file.ts: pure parse + the two writes CompleteGoal needs + recordSignOff - src/index.ts: plan mode, review menu, injection, reminder, widget, CompleteGoal, oracle spawn - src/prompts.ts: all model-facing text in flow order - test/: 15 unit tests (parser, disambiguation, sign-off record logic) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,3 @@
|
|||||||
|
node_modules/
|
||||||
|
dist/
|
||||||
|
*.log
|
||||||
@@ -0,0 +1,114 @@
|
|||||||
|
# pi-plan
|
||||||
|
|
||||||
|
A [pi](https://github.com/badlogic/pi-mono) extension for plan-driven, goal-tracked work in one
|
||||||
|
`plan.md`. Set up goals (with evidence and failure modes) in plan mode, work them, and sign a goal
|
||||||
|
off only when a read-only subagent has checked the evidence.
|
||||||
|
|
||||||
|
Successor to [pi-lgtm](https://github.com/wassname/pi-lgtm), kept deliberately small: about
|
||||||
|
[burneikis/pi-plan](https://github.com/burneikis/pi-plan) plus the additions, goals with evidence,
|
||||||
|
a sign-off check, a widget, and a reminder.
|
||||||
|
|
||||||
|
The form guides; it does not gate. The agent edits `plan.md` with its normal Edit tool. The one
|
||||||
|
blessed tool is `CompleteGoal`, which runs the sign-off check and records the result. The reminder,
|
||||||
|
the injected plan summary, and git/widget visibility carry the process. It trusts the agent's
|
||||||
|
judgement rather than guarding it.
|
||||||
|
|
||||||
|
## Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pi install npm:@wassname2/pi-plan
|
||||||
|
```
|
||||||
|
|
||||||
|
Or run without installing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pi -e npm:@wassname2/pi-plan
|
||||||
|
```
|
||||||
|
|
||||||
|
## Use
|
||||||
|
|
||||||
|
```
|
||||||
|
/plan add CSV export to the report view
|
||||||
|
```
|
||||||
|
|
||||||
|
1. Plan. The agent explores read-only and writes goals into `plan.md` (see format below).
|
||||||
|
2. Review. You get a menu: Ready, Edit (ask the agent to revise), Open in `$EDITOR`, or Cancel.
|
||||||
|
On Ready you choose whether to keep the current context or start fresh and compacted.
|
||||||
|
3. Work. Each turn the active goal is injected (so it survives compaction) and a reminder nudges
|
||||||
|
the agent to keep `plan.md` current and work autonomously. When a goal's `done_when` is met the
|
||||||
|
agent calls `CompleteGoal`, which runs `verify` and a read-only judge and, on accept, marks it
|
||||||
|
done and logs it.
|
||||||
|
|
||||||
|
Other commands: `/plan` (print the plan), `/plan clear` (empty `plan.md`, history kept in git),
|
||||||
|
`/plan judge <model-ref>` (use a specific model for the sign-off judge; default is your current
|
||||||
|
model).
|
||||||
|
|
||||||
|
## plan.md format
|
||||||
|
|
||||||
|
One file holds the objective, the goals, and a short append-only log.
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# Plan: ship the cache layer
|
||||||
|
|
||||||
|
## Goal: Implement cache layer
|
||||||
|
<!-- id: cache-layer-1 -->
|
||||||
|
status: active
|
||||||
|
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
|
||||||
|
verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
|
||||||
|
failure_modes:
|
||||||
|
- cache silently bypassed (hit-rate ~0, latency ok by luck)
|
||||||
|
- bench too small to exercise eviction
|
||||||
|
- [x] wire cache client
|
||||||
|
- [ ] eviction policy
|
||||||
|
|
||||||
|
## Log
|
||||||
|
- 2026-06-15 14:02 cache client wired; eviction next
|
||||||
|
```
|
||||||
|
|
||||||
|
- A goal is a `## Goal:` header with an `<!-- id -->`, a `status:`
|
||||||
|
(`open` | `active` | `done` | `cancelled`), a falsifiable `done_when:` (what you expect, and the
|
||||||
|
symptom if it is NOT met), an optional `verify:` shell command, a `failure_modes:` pre-mortem
|
||||||
|
list, and `- [ ]` subtasks.
|
||||||
|
- `done_when` names the evidence that distinguishes real success from a subtle failure. `verify`,
|
||||||
|
when present, is the deterministic first stage of the sign-off check.
|
||||||
|
- The agent ticks subtasks, appends to `## Log`, and sets `status` as it works. Multiple goals may
|
||||||
|
be `active`.
|
||||||
|
|
||||||
|
## The sign-off check (`CompleteGoal`)
|
||||||
|
|
||||||
|
`CompleteGoal(goal_id, evidence, paths?)` is the one blessed completion path:
|
||||||
|
|
||||||
|
1. If the goal has a `verify:` command, it is run. A non-zero exit rejects immediately, with no model
|
||||||
|
call.
|
||||||
|
2. Otherwise a read-only `pi` subprocess (the judge) inspects the evidence against the repo and the
|
||||||
|
named failure modes and returns a verdict. It re-derives from the artifacts you point it at
|
||||||
|
rather than trusting the claim, so point `evidence`/`paths` at durable artifacts (saved logs,
|
||||||
|
committed diffs, files).
|
||||||
|
3. On accept, the goal's `status` flips to `done` and a `## Log` line is written. On reject, the
|
||||||
|
goal stays open and the agent is told what is missing.
|
||||||
|
|
||||||
|
The judge defaults to your current model (guaranteed authorized and capable). Set a different one
|
||||||
|
with `/plan judge <provider/model>` for an independent cross-family check.
|
||||||
|
|
||||||
|
## Prompts
|
||||||
|
|
||||||
|
All model-facing text lives in [`src/prompts.ts`](src/prompts.ts), in flow order, so the process is
|
||||||
|
easy to review end to end.
|
||||||
|
|
||||||
|
## Develop
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pi -e ./src/index.ts # load locally
|
||||||
|
npm test # vitest: parser + sign-off record logic
|
||||||
|
npm run typecheck
|
||||||
|
npm run lint
|
||||||
|
```
|
||||||
|
|
||||||
|
## Not (yet) included
|
||||||
|
|
||||||
|
No autonomous re-prompt loop (an until-done-style loop judge). Autonomy comes from the reminder, not
|
||||||
|
a harness. Plan-phase model stickiness is a documented next step.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
+23
@@ -0,0 +1,23 @@
|
|||||||
|
{
|
||||||
|
"$schema": "https://biomejs.dev/schemas/2.4.8/schema.json",
|
||||||
|
"linter": {
|
||||||
|
"enabled": true,
|
||||||
|
"rules": {
|
||||||
|
"recommended": true,
|
||||||
|
"style": {
|
||||||
|
"recommended": false
|
||||||
|
},
|
||||||
|
"suspicious": {
|
||||||
|
"noExplicitAny": "off",
|
||||||
|
"noControlCharactersInRegex": "off",
|
||||||
|
"noEmptyInterface": "off"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"formatter": {
|
||||||
|
"enabled": false
|
||||||
|
},
|
||||||
|
"files": {
|
||||||
|
"includes": ["src/**/*.ts", "src/**/*.tsx", "test/**/*.ts"]
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,272 @@
|
|||||||
|
# pi-plan — design spec
|
||||||
|
|
||||||
|
Working title. A pi extension: set up goals (with subtasks and evidence) through plan mode, work them autonomously, and sign a goal off only when a check passes. One markdown file holds everything. The form guides a process; it does not police one. Successor to `pi-lgtm`, deliberately smaller.
|
||||||
|
|
||||||
|
Status: draft for review. Names, defaults, field shapes provisional.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Original ask → this spec
|
||||||
|
|
||||||
|
| Ask | Mechanism |
|
||||||
|
|-----|-----------|
|
||||||
|
| Set up goals + subtasks + evidence via **plan mode** | §3a — plan mode drafts the goal contract, you approve it |
|
||||||
|
| **Subagent check** of evidence on sign-off | §5, §9 — oracle inside `CompleteGoal` |
|
||||||
|
| Goals shown in a **task-list widget** | §7 — `/plan` renders goals + subtask checkboxes |
|
||||||
|
| Store **all in `plan.md`** | §4 — single file, no sidecar store |
|
||||||
|
| A **small manus-style append log** | §4 — short `## Log` section inside `plan.md` |
|
||||||
|
| **Typed reminders** to update tasks | §8a — recurring nudge |
|
||||||
|
| **Work autonomously** toward goals | §3b, §8a — the loop, driven by the reminder |
|
||||||
|
| Persist through **compaction**; pi-tasks but simpler | §8 injection; minimal tool surface |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Decisions and preferences
|
||||||
|
|
||||||
|
Separates the opinionated forks from the mechanical body (§4 on).
|
||||||
|
|
||||||
|
### 2a. Preferences driving the design
|
||||||
|
|
||||||
|
- **Guidance over guardrails.** None of the surveyed extensions hard-enforce. The form (plan.md structure) + the reminder + the prompts guide the agent through a process; the one genuinely rigorous step is the sign-off check; git + widget visibility is the backstop. The agent can edit anything — we make the right path the easy path, not the only path.
|
||||||
|
- **Anti-complexity.** One file, minimal tools, plain-file editing for anything with no cheat incentive.
|
||||||
|
- **Reward-hacking / honesty focus.** The sign-off check must resist assertion and test-gaming, not just check a box.
|
||||||
|
- **Cost-sensitivity (single 3090 / metered API).** KV-cache hygiene, judge-once-per-goal, cheap loop judge.
|
||||||
|
- **Scout mindset.** Make false completion visible rather than paper over it.
|
||||||
|
|
||||||
|
### 2b. Decisions
|
||||||
|
|
||||||
|
`[decided]` = settled; `[open]` = your call.
|
||||||
|
|
||||||
|
| # | Decision | Alternative rejected | Why | Status |
|
||||||
|
|---|----------|----------------------|-----|--------|
|
||||||
|
| D1 | **Everything in one `plan.md`** | Separate `.plan/log.jsonl` sidecar | Asked for; simpler, one diff to read | decided |
|
||||||
|
| D2 | Plan mode **is** the goal-setup-and-agreement phase | Agent-only creation | Approval is where `done_when` + `failure_modes` get agreed before any code | decided |
|
||||||
|
| D3 | **Guide the process; don't gate it.** The only special path is `CompleteGoal` (the sign-off check) | Pre-tool-use interceptor that blocks `status: done` edits | No surveyed extension enforces at that level; the reminder + form carry it; bypass is visible in git | decided |
|
||||||
|
| D4 | **Two-stage sign-off check**: deterministic `verify:` then oracle | Oracle only; tests only (Codex) | Tests unfakeable-by-assertion but gameable; oracle catches gaming + non-test criteria | decided |
|
||||||
|
| D5 | Two **separate** judges: cheap loop + oracle sign-off | One judge for both | Loop judge reads assertions (foolable, ok); sign-off judge reads artifacts | decided |
|
||||||
|
| D6 | Sign-off judge = oracle subprocess, **copied not depended** | In-process; pi-subagents | Shell-free spawn dodges noclobber/cropping; copying avoids flaky coupling | decided |
|
||||||
|
| D7 | Contract tamper-check = **git visibility** | Append-only frozen log | All-in-one-file gives up the hard freeze; git diff + guided sign-off are enough for a single user | decided |
|
||||||
|
| D8 | Completed goals **archived, not deleted** | Auto-clear after idle | A plan is a durable record | decided |
|
||||||
|
| D9 | **Goals are flexible: multiple may be `active`** | One active goal forced | Operator wants flexibility; the agent picks focus, injection lists the active set | decided |
|
||||||
|
| D10 | Loop judge default = main model, tiny prompt | Dedicated cheap aux model | Zero setup; switch if cost bites | open |
|
||||||
|
| D11 | Sign-off judge default = **the session's current model** | Auto-pick "strongest on provider" (oracle-style) | Current model is guaranteed authorized + capable; provider lists hold dead/weak/unauthorized entries. Cross-vendor is a **setting** (§9) | decided |
|
||||||
|
| D12 | **Plan-phase model is selectable and sticky** | Always the working model | Plan benefits from a stronger reasoner; persist the choice (oracle.json-style). Optionally the oracle drafts the plan (read-only + strong already) | decided |
|
||||||
|
| D13 | **Offer to compact after plan accepted** | Always fresh session (burneikis); or never | Some runs want a clean execution context, some want to keep it. Make it a post-Ready choice | decided |
|
||||||
|
|
||||||
|
### 2c. Cuts (non-goals)
|
||||||
|
|
||||||
|
DAG / `blocks` edges. Parallel subagent execution (the flaky part). `findings.md`. Hard pre-tool-use enforcement (D3). Sign-off judge every turn (cost).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Two phases: setup, then execution
|
||||||
|
|
||||||
|
### 3a. Setup — plan mode
|
||||||
|
|
||||||
|
Goals are created and *agreed* through plan mode (burneikis-style). Stock plan mode; the deltas are the output format and the hand-off.
|
||||||
|
|
||||||
|
1. `/plan <objective>` enters plan mode. The agent explores read-only and drafts goals into `plan.md` in the contract format (§4). This phase runs on the **plan-phase model** (selectable + sticky, D12; optionally the read-only oracle drafts it).
|
||||||
|
2. You review: **Ready** / **Edit** (NL rewrite) / **$EDITOR** (hand-edit) / **Cancel**. The agreement point — you sanity-check `done_when` and `failure_modes` before any code.
|
||||||
|
3. On **Ready**, offer **compact context? (y/n)** (D13). Yes → execution starts in a cleared context with the approved `plan.md` re-injected. No → execution continues in the same context.
|
||||||
|
|
||||||
|
Direct `plan.md` edits remain a quick-add path for a one-off goal.
|
||||||
|
|
||||||
|
### 3b. Execution — the loop ↔ check cycle
|
||||||
|
|
||||||
|
Multiple goals may be `active`; the agent works whichever it's focused on, in the order it judges best.
|
||||||
|
|
||||||
|
1. The session works an `active` goal under an iteration budget (or `/goal` (re)starts the loop on the current plan).
|
||||||
|
2. Each turn, the **loop judge** reads the agent's last response → continue/pause (fail-open; the **budget is the real backstop**).
|
||||||
|
3. When the agent judges a goal done, the reminder steers it to call `CompleteGoal` (not hand-tick `status`).
|
||||||
|
4. `CompleteGoal` runs the **two-stage check**:
|
||||||
|
- **reject** → `missing[]` fed back; work continues toward the gap.
|
||||||
|
- **accept** → goal marked done; the agent moves to another active/open goal, or the loop stops.
|
||||||
|
|
||||||
|
The loop judge can be fooled (reads assertions); worst case is a premature pause, caught by you or the budget. The sign-off check re-derives from artifacts, so it is not fooled cheaply. That asymmetry is the point.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. The one file: `plan.md`
|
||||||
|
|
||||||
|
cwd root, git-tracked. Goals, subtasks, and a short log. The agent maintains all of it through its normal Edit tool — no separate store machinery.
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# Plan: <one-line objective>
|
||||||
|
|
||||||
|
## Goal: Implement cache layer
|
||||||
|
<!-- id: cache-layer-1 -->
|
||||||
|
status: active
|
||||||
|
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
|
||||||
|
verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
|
||||||
|
failure_modes:
|
||||||
|
- cache silently bypassed (hit-rate ~0, latency ok by luck)
|
||||||
|
- bench too small to exercise eviction
|
||||||
|
- verify passes on a trivial/gamed test
|
||||||
|
- [x] wire cache client
|
||||||
|
- [ ] eviction policy
|
||||||
|
- [ ] load test
|
||||||
|
|
||||||
|
## Goal: ...
|
||||||
|
|
||||||
|
## Log
|
||||||
|
- 2026-06-15 14:02 cache client wired; eviction next
|
||||||
|
- 2026-06-15 14:31 eviction done; p95 bench reads 47ms (load-test.log)
|
||||||
|
- 2026-06-15 14:33 cache-layer-1 signed off (verify green, oracle accept)
|
||||||
|
```
|
||||||
|
|
||||||
|
Conventions:
|
||||||
|
|
||||||
|
- **Goals carry `status:` and no checkbox; subtasks are `- [ ]`.** `status` ∈ `open | active | done | cancelled`. Multiple goals may be `active` (D9). Subtasks tick freely.
|
||||||
|
- **`<!-- id -->`** assigned at creation; stable key (survives renaming the subject).
|
||||||
|
- **`verify:`** (optional) is the deterministic stage-1 command.
|
||||||
|
- **`failure_modes`** should name "verify could pass while still wrong" whenever a `verify:` exists.
|
||||||
|
- **`## Log`** is manus-style: append-only **by convention**, one short line per event. The reminder (§8a) enforces appending. Terse — "where it's up to" + error memory, not a transcript.
|
||||||
|
|
||||||
|
Parsing: a line scanner suffices for v0. `mdast` + `remark-gfm` only if it bites. Parse for *reading*; for the rare programmatic write (status flip, checkbox reconcile) use exact-line string patching, never a full AST serialize.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Tools
|
||||||
|
|
||||||
|
`CompleteGoal` is the one blessed path (it runs the check and records it). Everything else — create goal, edit plan, tick subtasks, append to log — is plain Edit, guided by the reminder.
|
||||||
|
|
||||||
|
### `CompleteGoal(id, evidence, paths[])` — the sign-off check
|
||||||
|
|
||||||
|
1. Read `done_when` + `verify` + `failure_modes` for the goal from `plan.md` (git diff is the tamper-check, D7).
|
||||||
|
2. **Evidence must point to durable artifacts** the read-only judge can inspect (saved logs, committed diffs, files). Ephemeral claims fail stage 2.
|
||||||
|
3. **Stage 1 — deterministic.** If `verify` exists, run it shell-free, capture exit + output tail. Non-zero → reject immediately, return the tail. No model call spent.
|
||||||
|
4. **Stage 2 — oracle.** Spawn the read-only judge (D11 default = current model; §9) with the criterion, failure modes, evidence, and verify result; it inspects the repo and checks the verify command was not gamed against the named failure modes.
|
||||||
|
5. Verdict: **accept** → string-patch `status: done`, append a `## Log` line. **reject** → status stays `active`, append `missing[]` to `## Log`, return `missing`.
|
||||||
|
|
||||||
|
### `CancelGoal(id, reason)` — optional
|
||||||
|
|
||||||
|
open/active → cancelled is not a sign-off, so it skips the check. A tool only to guarantee a `## Log` line lands.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Guiding sign-off (no hard gate)
|
||||||
|
|
||||||
|
Per D3, there is no pre-tool-use interceptor blocking `status: done`. Sign-off is guided, not gated:
|
||||||
|
|
||||||
|
- the **reminder** (§8a) tells the agent to complete a goal through `CompleteGoal`, not by hand-editing status;
|
||||||
|
- `CompleteGoal` is the obvious, blessed path that runs the check and writes the log line;
|
||||||
|
- the **widget** (§7) can flag a goal whose `status: done` has no corresponding `## Log` sign-off line — visibility, not a block;
|
||||||
|
- `plan.md` is git-tracked, so any hand-tick shows in the diff.
|
||||||
|
|
||||||
|
The agent *can* bypass it. The bet — borne out by how the other extensions actually run — is that a clear form plus a standing reminder makes the blessed path the path taken, and visibility catches the rare bypass.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Commands
|
||||||
|
|
||||||
|
- `/plan <desc>` — **enter plan mode** (§3a): read-only explore → draft goals → review. Ready offers the compact choice, then starts execution.
|
||||||
|
- `/plan` (no args) — render the **task-list widget**: each goal with status + its subtask checkboxes + "N done hidden"; flag any `done` goal lacking a sign-off log line; offer archive-completed and cancel-goal.
|
||||||
|
- `/goal` — (re)start the loop on the current plan.
|
||||||
|
- `/goal pause | resume | clear | status` — loop controls.
|
||||||
|
- `/subgoal <text>` — append an acceptance criterion to a goal mid-loop. Optional.
|
||||||
|
- `/judge model <ref>` — set the sign-off judge model (default: current model; set a cross-vendor ref here for stronger independence, §9).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Hooks / lifecycle
|
||||||
|
|
||||||
|
- **`before_agent_start`** — parse `plan.md`; inject a fixed-shape summary (active goals + focus + last log line) as a late **user-role** message. Compaction-persistence.
|
||||||
|
- **reminder** — §8a.
|
||||||
|
- **pre-compact** — flush state to `plan.md` before compaction.
|
||||||
|
|
||||||
|
(No pre-tool-use gate — D3.)
|
||||||
|
|
||||||
|
### 8a. The reminder (typed; what it says)
|
||||||
|
|
||||||
|
Fires when a goal is `active` and there have been **N file-modifying turns since the last `plan.md` update**. One `<system-reminder>` covering both task upkeep and goal progress:
|
||||||
|
|
||||||
|
- **task** — tick completed subtask checkboxes; add new ones discovered.
|
||||||
|
- **log** — append **one short line** to `## Log` (append, don't rewrite).
|
||||||
|
- **goal** — if a goal's evidence is in, **sign it off via `CompleteGoal`** — don't hand-tick `status: done`.
|
||||||
|
- **autonomy** — keep working toward an active goal; don't stop to ask unless genuinely blocked.
|
||||||
|
|
||||||
|
Both the housekeeping and the autonomy engine, and — with no hard gate — the main thing making the process get followed. Keep the wording stable so it doesn't thrash the cache.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Judges
|
||||||
|
|
||||||
|
| | Loop judge | Sign-off judge (stage 2) |
|
||||||
|
|---|---|---|
|
||||||
|
| Drives | continue / pause each turn | accept / reject a sign-off |
|
||||||
|
| Cost | cheap, every turn | costly, once per goal |
|
||||||
|
| Reads | the agent's last response (~4 KB) | the repo, independently |
|
||||||
|
| Transport | one small model call (D10) | read-only oracle subprocess |
|
||||||
|
| On failure | fail-open → continue; **budget** is the backstop | fail-closed → goal stays active |
|
||||||
|
| Foolable? | yes — asserted "done" passes; bounded by budget | hard: re-reads artifacts + runs `verify` |
|
||||||
|
|
||||||
|
### Sign-off judge: model choice (D11)
|
||||||
|
|
||||||
|
- **Default: the session's current model.** Guaranteed authorized and capable, because you're already running it. Auto-picking "strongest on provider" (oracle-style) is rejected as the default — those lists carry dead, weak, and unauthorized entries.
|
||||||
|
- **Most of the value is model-independent.** The read-only judge re-derives from artifacts: does the evidence match the repo, is the `verify` tautological, is each failure mode actually ruled out. Any capable model does that regardless of family.
|
||||||
|
- **Cross-vendor is the stronger-independence setting** (`/judge model`), for the residual *shared-reasoning-error* class, when you have a known-good alternative. Mirror the oracle's curated provider list for that override menu; don't auto-select from it.
|
||||||
|
|
||||||
|
### Transport (oracle pattern, copied)
|
||||||
|
|
||||||
|
- **Shell-free spawn.** `spawn(command, argsArray)`, no `shell:true`; capture stdout via pipe and parse. Why it avoids the noclobber/cropping pain of `pi -p … > out.json` under zsh. ~40 lines.
|
||||||
|
- **Read-only toolset.** `read / grep / find / ls`, optional non-mutating `bash`. Separate process = fresh context, no anchoring — the independence you reliably get even from the same model.
|
||||||
|
- **Verdict contract.** Oracle returns prose by default; impose `VERDICT: accept|reject` + `missing:` in the prompt and parse that block.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. `prompts.tsx`
|
||||||
|
|
||||||
|
All model-facing text in one file, in flow order (drafted separately):
|
||||||
|
|
||||||
|
1. **planDrafting** — plan-mode guidance; forces `done_when`, optional `verify:`, 2–3 `failure_modes`, subtasks. Human approves it.
|
||||||
|
2. **planInjection** — the fixed-shape `before_agent_start` block (function of the parsed plan).
|
||||||
|
3. **reminder** — the typed nudge (§8a).
|
||||||
|
4. **continuation** — Hermes-style "keep going" user-role message.
|
||||||
|
5. **loopJudge** — conservative, strict JSON `{done, reason}`.
|
||||||
|
6. **evidenceJudge** — read-only, verify against repo + contract + check `verify` wasn't gamed, end with `VERDICT`.
|
||||||
|
|
||||||
|
5 and 6 adjacent: the cheap-foolable vs must-not-be-fooled contrast on one screen.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. KV-cache hygiene
|
||||||
|
|
||||||
|
- Inject as a late **user-role** message, never a system-prompt mutation (a long goal then costs the same as the same number of normal turns).
|
||||||
|
- Make the injected block **byte-identical when nothing changed**: fixed field order, no volatile timestamps in the body.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Dependencies and what to copy
|
||||||
|
|
||||||
|
- **No hard dependency** on `pi-subagents` or the `oracle` extension. Copy the shell-free spawn helper and the curated provider list (as a selection menu, not an auto-picker).
|
||||||
|
- Markdown: line scanner first; `mdast` + `remark-gfm` only if needed.
|
||||||
|
- Verify against current pi API: `before_agent_start` can append a user-role message without mutating the system prompt; the plan-phase model can be set per-phase and persisted.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Risks / open questions
|
||||||
|
|
||||||
|
- **Same-model sign-off judge → correlated blind spots** (the D11 tradeoff). Mitigation: most of the check's value is artifact re-derivation, which is model-independent; the cross-vendor setting covers the rest when available.
|
||||||
|
- **No hard gate (D3)** — the agent can hand-tick `status: done` and skip the check. Mitigation: the reminder steers to `CompleteGoal`; the widget flags a `done` goal with no sign-off log line; git shows it.
|
||||||
|
- **Contract tampering (D7)** — editable `plan.md` means `done_when`/`failure_modes` can be softened pre-sign-off. Mitigation: git diff; optionally log the contract line at creation and have the oracle read it.
|
||||||
|
- **Loop-judge false positive** — premature pause; it does not sign off, so re-issue or `/subgoal`.
|
||||||
|
- **`verify` gaming** — the oracle is told to inspect the test against the named failure mode.
|
||||||
|
- **`## Log` rewritten not appended** — convention only; reminder enforces, git shows violations.
|
||||||
|
- **Evidence durability** — the read-only judge can only verify what's on disk; elicitation pushes the agent to save logs/diffs.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Build order
|
||||||
|
|
||||||
|
Each step independently testable; model calls enter late.
|
||||||
|
|
||||||
|
1. `plan.md` format + line parser (incl. `<!-- id -->` and `## Log`) + `/plan` task-list widget. Pure file, no model calls.
|
||||||
|
2. Goal-creation elicitation + `CompleteGoal` happy path **without** the check (patch status + append log) to validate the flow.
|
||||||
|
3. Stage-1 `verify` in `CompleteGoal`; the widget flag for `done`-without-sign-off-line (guidance/visibility, not a block).
|
||||||
|
4. Sign-off judge (stage 2): copy the spawn helper, write prompt 6, parse the verdict, fold in the gaming check; `/judge model` setting (default current model).
|
||||||
|
5. `before_agent_start` injection (cache-safe) + the reminder (§8a).
|
||||||
|
6. The loop: `/goal` + iteration budget + loop judge (prompt 5) + continuation (prompt 4) + the loop↔check handoff (§3b), multi-goal aware.
|
||||||
|
7. Plan mode (§3a): `/plan <desc>` read-only draft → review → compact choice → hand-off. Plan-phase model selection + stickiness (D12). (Until built, create goals by direct `plan.md` edit.)
|
||||||
|
8. Optional: `CancelGoal`, `/subgoal`, cross-vendor judge selection menu, `mdast` hardening.
|
||||||
|
|
||||||
|
`prompts.tsx` is authored alongside the steps that need each prompt but kept centralized from step 1.
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
{
|
||||||
|
"name": "@wassname2/pi-plan",
|
||||||
|
"version": "0.0.1",
|
||||||
|
"description": "One plan.md: set goals via plan mode, work them, sign off only when a read-only check passes. Successor to pi-lgtm.",
|
||||||
|
"author": "wassname",
|
||||||
|
"license": "MIT",
|
||||||
|
"type": "module",
|
||||||
|
"repository": {
|
||||||
|
"type": "git",
|
||||||
|
"url": "https://github.com/wassname/pi-plan.git"
|
||||||
|
},
|
||||||
|
"keywords": [
|
||||||
|
"pi-package",
|
||||||
|
"pi",
|
||||||
|
"pi-extension",
|
||||||
|
"plan",
|
||||||
|
"goal",
|
||||||
|
"proof",
|
||||||
|
"uat",
|
||||||
|
"evidence",
|
||||||
|
"judge"
|
||||||
|
],
|
||||||
|
"dependencies": {
|
||||||
|
"@earendil-works/pi-coding-agent": "^0.79.0",
|
||||||
|
"@earendil-works/pi-tui": "*",
|
||||||
|
"@sinclair/typebox": "latest"
|
||||||
|
},
|
||||||
|
"scripts": {
|
||||||
|
"build": "tsc",
|
||||||
|
"test": "vitest run",
|
||||||
|
"test:watch": "vitest",
|
||||||
|
"typecheck": "tsc --noEmit",
|
||||||
|
"lint": "biome check src/ test/",
|
||||||
|
"lint:fix": "biome check --fix src/ test/"
|
||||||
|
},
|
||||||
|
"devDependencies": {
|
||||||
|
"@types/node": "^20.0.0",
|
||||||
|
"typescript": "^5.0.0",
|
||||||
|
"@biomejs/biome": "^2.4.8",
|
||||||
|
"vitest": "^4.0.18"
|
||||||
|
},
|
||||||
|
"pi": {
|
||||||
|
"extensions": [
|
||||||
|
"./src/index.ts"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
+380
@@ -0,0 +1,380 @@
|
|||||||
|
/**
|
||||||
|
* pi-plan — plan mode that sets up goals with evidence, tracked in one plan.md, signed off by a
|
||||||
|
* read-only subagent check. A successor to pi-lgtm, kept deliberately small (≈ burneikis/pi-plan
|
||||||
|
* plus the additions: goals + failure_modes + subtasks, a sign-off check, a widget, a reminder).
|
||||||
|
*
|
||||||
|
* Philosophy (spec D3): the form guides, it does not gate. The agent edits plan.md with its normal
|
||||||
|
* Edit tool. The one blessed tool is CompleteGoal, which runs the sign-off check and records it. The
|
||||||
|
* reminder + the injected plan + git/widget visibility carry the process; we trust the agent's
|
||||||
|
* judgement rather than guarding it.
|
||||||
|
*
|
||||||
|
* Flow:
|
||||||
|
* /plan <objective> -> plan mode: agent explores, drafts goals into plan.md (planDrafting guides)
|
||||||
|
* agent_end -> review menu (Ready / Edit / $EDITOR / Cancel); Ready offers compaction
|
||||||
|
* execution -> each turn, inject the plan summary (survives compaction) + a reminder;
|
||||||
|
* agent works goals, ticks subtasks, appends ## Log, calls CompleteGoal
|
||||||
|
* CompleteGoal -> optional deterministic verify, then a read-only oracle judge -> accept
|
||||||
|
* flips status:done + logs; reject returns what's missing
|
||||||
|
*
|
||||||
|
* All model-facing text lives in prompts.tsx, in flow order.
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { spawn, spawnSync } from "node:child_process";
|
||||||
|
import { existsSync, readFileSync, writeFileSync } from "node:fs";
|
||||||
|
import { basename, join } from "node:path";
|
||||||
|
import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@earendil-works/pi-coding-agent";
|
||||||
|
import { Type } from "@sinclair/typebox";
|
||||||
|
import { counts, findGoal, type Goal, type PlanDoc, parse, recordSignOff, type SignOff } from "./plan-file.js";
|
||||||
|
import { evidenceJudgeSystem, evidenceJudgeUser, planDrafting, planInjection, reminder } from "./prompts.js";
|
||||||
|
|
||||||
|
const STATE = "pi-plan-state";
|
||||||
|
const PLAN_CONTEXT = "pi-plan-context"; // injected plan-mode guidance, stripped from history later
|
||||||
|
const STATUS_KEY = "pi-plan";
|
||||||
|
const WIDGET_KEY = "pi-plan-widget";
|
||||||
|
const READ_ONLY_TOOLS = ["read", "grep", "find", "ls", "bash"];
|
||||||
|
|
||||||
|
interface PlanState {
|
||||||
|
isPlanMode: boolean;
|
||||||
|
objective: string | null;
|
||||||
|
/** Optional model ref for the sign-off judge; unset => the subprocess uses pi's default model. */
|
||||||
|
judgeModel: string | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function piPlanExtension(pi: ExtensionAPI): void {
|
||||||
|
let state: PlanState = { isPlanMode: false, objective: null, judgeModel: null };
|
||||||
|
// Reminder cadence: fire when an active goal exists but plan.md was not touched since last turn.
|
||||||
|
let lastInjectedPlan = "";
|
||||||
|
|
||||||
|
const planPath = (ctx: ExtensionContext) => join(ctx.cwd, "plan.md");
|
||||||
|
const readPlan = (ctx: ExtensionContext): string => (existsSync(planPath(ctx)) ? readFileSync(planPath(ctx), "utf-8") : "");
|
||||||
|
|
||||||
|
function persist(): void {
|
||||||
|
pi.appendEntry<PlanState>(STATE, state);
|
||||||
|
}
|
||||||
|
|
||||||
|
function updateWidget(ctx: ExtensionContext): void {
|
||||||
|
if (state.isPlanMode) {
|
||||||
|
ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("warning", "planning"));
|
||||||
|
ctx.ui.setWidget(WIDGET_KEY, ["pi-plan: drafting goals", "Write goals to plan.md, then review."]);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
const doc = parse(readPlan(ctx));
|
||||||
|
if (doc.goals.length === 0) {
|
||||||
|
ctx.ui.setStatus(STATUS_KEY, undefined);
|
||||||
|
ctx.ui.setWidget(WIDGET_KEY, undefined);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
const c = counts(doc);
|
||||||
|
ctx.ui.setStatus(STATUS_KEY, ctx.ui.theme.fg("accent", `◷ ${c.done}/${doc.goals.length} goals`));
|
||||||
|
ctx.ui.setWidget(WIDGET_KEY, goalWidgetLines(doc));
|
||||||
|
}
|
||||||
|
|
||||||
|
function goalWidgetLines(doc: PlanDoc): string[] {
|
||||||
|
const mark: Record<Goal["status"], string> = { done: "✔", active: "▸", open: "◻", cancelled: "✗" };
|
||||||
|
const lines = [`Plan: ${doc.objective || "(untitled)"}`];
|
||||||
|
for (const g of doc.goals) {
|
||||||
|
if (g.status === "done") continue; // hide finished goals; they stay in the file
|
||||||
|
const open = g.subtasks.filter((s) => !s.done).length;
|
||||||
|
lines.push(`${mark[g.status]} ${g.subject}${open ? ` (${open} todo)` : ""}`);
|
||||||
|
}
|
||||||
|
const c = counts(doc);
|
||||||
|
if (c.done) lines.push(`(${c.done} done, hidden)`);
|
||||||
|
return lines;
|
||||||
|
}
|
||||||
|
|
||||||
|
// --- plan mode: setup -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
pi.registerCommand("plan", {
|
||||||
|
description: "Plan mode: set up goals (with evidence) in plan.md, then work them. /plan <objective>",
|
||||||
|
handler: async (args, ctx) => {
|
||||||
|
const arg = args.trim();
|
||||||
|
if (arg === "clear") {
|
||||||
|
await clearPlan(ctx);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (arg.startsWith("judge")) {
|
||||||
|
setJudge(arg.slice("judge".length).trim(), ctx);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (!arg) {
|
||||||
|
showPlan(ctx);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
state = { ...state, isPlanMode: true, objective: arg };
|
||||||
|
persist();
|
||||||
|
updateWidget(ctx);
|
||||||
|
pi.sendUserMessage(
|
||||||
|
`Enter plan mode for this objective: ${arg}\n\nExplore read-only, then write the plan to ${planPath(ctx)}.`,
|
||||||
|
{ deliverAs: "followUp" },
|
||||||
|
);
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
function setJudge(ref: string, ctx: ExtensionContext): void {
|
||||||
|
state = { ...state, judgeModel: ref || null };
|
||||||
|
persist();
|
||||||
|
ctx.ui.notify(ref ? `Sign-off judge model set to ${ref}` : "Sign-off judge reset to the default model", "info");
|
||||||
|
}
|
||||||
|
|
||||||
|
async function clearPlan(ctx: ExtensionContext): Promise<void> {
|
||||||
|
if (!existsSync(planPath(ctx))) {
|
||||||
|
ctx.ui.notify("No plan.md to clear.", "info");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (ctx.hasUI) {
|
||||||
|
const ok = await ctx.ui.select("Clear plan.md? (it stays in git history)", ["Cancel", "Clear plan.md"]);
|
||||||
|
if (ok !== "Clear plan.md") return;
|
||||||
|
}
|
||||||
|
writeFileSync(planPath(ctx), "");
|
||||||
|
state = { ...state, isPlanMode: false, objective: null };
|
||||||
|
persist();
|
||||||
|
updateWidget(ctx);
|
||||||
|
ctx.ui.notify("Cleared plan.md.", "info");
|
||||||
|
}
|
||||||
|
|
||||||
|
function showPlan(ctx: ExtensionContext): void {
|
||||||
|
const content = readPlan(ctx);
|
||||||
|
if (!content.trim()) {
|
||||||
|
ctx.ui.notify("No plan yet. Use /plan <objective> to start.", "info");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
ctx.ui.notify(content, "info");
|
||||||
|
}
|
||||||
|
|
||||||
|
// --- review loop (after the agent drafts the plan) --------------------------------------------
|
||||||
|
|
||||||
|
async function reviewLoop(ctx: ExtensionContext, cmdCtx: ExtensionCommandContext): Promise<void> {
|
||||||
|
while (true) {
|
||||||
|
const doc = parse(readPlan(ctx));
|
||||||
|
const choice = await ctx.ui.select(`Plan: ${doc.goals.length} goal(s). What next?`, [
|
||||||
|
"Ready — start working the plan",
|
||||||
|
"Edit — ask the agent to revise",
|
||||||
|
"Open in $EDITOR",
|
||||||
|
"Cancel — leave plan mode",
|
||||||
|
]);
|
||||||
|
if (!choice || choice.startsWith("Cancel")) {
|
||||||
|
exitPlanMode(ctx);
|
||||||
|
ctx.ui.notify("Left plan mode. plan.md kept.", "info");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (choice.startsWith("Ready")) return startExecution(ctx, cmdCtx);
|
||||||
|
if (choice.startsWith("Edit")) {
|
||||||
|
const changes = await ctx.ui.editor("What should change about the plan?", "");
|
||||||
|
if (changes?.trim()) {
|
||||||
|
pi.sendUserMessage(`Revise the plan at ${planPath(ctx)} with these changes, same format:\n\n${changes.trim()}`);
|
||||||
|
return; // agent_end re-opens the review loop
|
||||||
|
}
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (choice.startsWith("Open")) {
|
||||||
|
const editor = process.env.EDITOR || process.env.VISUAL || "vi";
|
||||||
|
spawnSync(editor, [planPath(ctx)], { stdio: "inherit" });
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function exitPlanMode(ctx: ExtensionContext): void {
|
||||||
|
state = { ...state, isPlanMode: false };
|
||||||
|
persist();
|
||||||
|
updateWidget(ctx);
|
||||||
|
}
|
||||||
|
|
||||||
|
async function startExecution(ctx: ExtensionContext, cmdCtx: ExtensionCommandContext): Promise<void> {
|
||||||
|
// Offer a clean execution context (D13): some runs want the fresh handoff, some want to keep it.
|
||||||
|
let fresh = false;
|
||||||
|
if (ctx.hasUI) {
|
||||||
|
const choice = await ctx.ui.select("Start working the plan in...", [
|
||||||
|
"This context (keep history)",
|
||||||
|
"A fresh, compacted context",
|
||||||
|
]);
|
||||||
|
fresh = choice?.startsWith("A fresh") ?? false;
|
||||||
|
}
|
||||||
|
exitPlanMode(ctx);
|
||||||
|
const doc = parse(readPlan(ctx));
|
||||||
|
if (doc.objective) pi.setSessionName(`Plan: ${doc.objective}`);
|
||||||
|
|
||||||
|
if (fresh) {
|
||||||
|
const result = await cmdCtx.newSession({ parentSession: ctx.sessionManager.getSessionFile() });
|
||||||
|
if (result.cancelled) {
|
||||||
|
ctx.ui.notify("Execution cancelled.", "warning");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
pi.sendUserMessage(
|
||||||
|
`Work the plan in ${planPath(ctx)}. Pick an open goal, set it active, work its subtasks, and when its done_when is met call CompleteGoal with the evidence. Keep plan.md current as you go.`,
|
||||||
|
{ deliverAs: "followUp" },
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
// --- the one blessed tool: CompleteGoal -------------------------------------------------------
|
||||||
|
|
||||||
|
pi.registerTool({
|
||||||
|
name: "CompleteGoal",
|
||||||
|
label: "Complete goal",
|
||||||
|
description:
|
||||||
|
"Sign off a goal once its done_when is met. Runs the goal's verify command (if any) then a " +
|
||||||
|
"read-only subagent that inspects your evidence against the repo. On accept, the goal is marked " +
|
||||||
|
"done and logged; on reject, it stays open and you get what is missing. Point evidence at durable " +
|
||||||
|
"artifacts (saved logs, committed diffs, files), not claims.",
|
||||||
|
parameters: Type.Object({
|
||||||
|
goal_id: Type.String({ description: "The goal's <!-- id --> from plan.md" }),
|
||||||
|
evidence: Type.String({ description: "What shows the done_when is met, and where to verify it" }),
|
||||||
|
paths: Type.Optional(Type.Array(Type.String(), { description: "Durable artifacts the judge should inspect" })),
|
||||||
|
}),
|
||||||
|
async execute(_id, params, signal, _onUpdate, ctx) {
|
||||||
|
const content = readPlan(ctx);
|
||||||
|
const goal = findGoal(parse(content), params.goal_id);
|
||||||
|
if (!goal) return text(`No goal #${params.goal_id} in plan.md.`, true);
|
||||||
|
|
||||||
|
// Decide the outcome (the I/O); recordSignOff applies it to the file (the pure write).
|
||||||
|
const outcome = await decideSignOff(goal, params.evidence, params.paths ?? [], state.judgeModel, ctx.cwd, signal);
|
||||||
|
const res = recordSignOff(content, goal.id, stamp(), outcome);
|
||||||
|
if (res.content !== content) writeFileSync(planPath(ctx), res.content);
|
||||||
|
updateWidget(ctx);
|
||||||
|
return text(res.message, res.isError);
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
// --- hooks ------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
pi.on("before_agent_start", async (_event, ctx) => {
|
||||||
|
if (state.isPlanMode) {
|
||||||
|
return { message: { customType: PLAN_CONTEXT, content: `${planDrafting}\n\nWrite the plan to ${planPath(ctx)}.`, display: false } };
|
||||||
|
}
|
||||||
|
const doc = parse(readPlan(ctx));
|
||||||
|
if (doc.goals.length === 0) return;
|
||||||
|
|
||||||
|
const active = doc.goals.find((g) => g.status === "active") ?? doc.goals.find((g) => g.status === "open") ?? null;
|
||||||
|
const c = counts(doc);
|
||||||
|
let body = planInjection({
|
||||||
|
objective: doc.objective,
|
||||||
|
activeGoal: active
|
||||||
|
? { subject: active.subject, done_when: active.done_when, openSubtasks: active.subtasks.filter((s) => !s.done).map((s) => s.text) }
|
||||||
|
: null,
|
||||||
|
lastLogLine: doc.log.at(-1) ?? null,
|
||||||
|
counts: { done: c.done, open: c.open + c.active },
|
||||||
|
});
|
||||||
|
// Reminder fires when there is an active goal but plan.md was untouched since the last turn.
|
||||||
|
const planNow = readPlan(ctx);
|
||||||
|
if (active && planNow === lastInjectedPlan) body += `\n\n${reminder}`;
|
||||||
|
lastInjectedPlan = planNow;
|
||||||
|
return { message: { customType: PLAN_CONTEXT, content: body, display: false } };
|
||||||
|
});
|
||||||
|
|
||||||
|
pi.on("agent_end", async (_event, ctx) => {
|
||||||
|
if (!state.isPlanMode || !ctx.hasUI) return;
|
||||||
|
const doc = parse(readPlan(ctx));
|
||||||
|
if (doc.goals.length === 0) {
|
||||||
|
ctx.ui.notify("No goals found in plan.md yet — ask the agent to draft them.", "warning");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
await reviewLoop(ctx, ctx as ExtensionCommandContext);
|
||||||
|
});
|
||||||
|
|
||||||
|
// Keep only the freshest injected plan summary; strip stale ones so history does not bloat and
|
||||||
|
// the model never sees an out-of-date plan. (The current turn's injection is the one kept.)
|
||||||
|
pi.on("context", async (event) => {
|
||||||
|
const isCtx = (m: unknown) => (m as { customType?: string }).customType === PLAN_CONTEXT;
|
||||||
|
let lastIdx = -1;
|
||||||
|
event.messages.forEach((m, i) => {
|
||||||
|
if (isCtx(m)) lastIdx = i;
|
||||||
|
});
|
||||||
|
return { messages: event.messages.filter((m, i) => !isCtx(m) || i === lastIdx) };
|
||||||
|
});
|
||||||
|
|
||||||
|
pi.on("session_start", async (_event, ctx) => {
|
||||||
|
const last = ctx.sessionManager
|
||||||
|
.getEntries()
|
||||||
|
.filter((e: { type?: string; customType?: string }) => e.type === "custom" && e.customType === STATE)
|
||||||
|
.pop() as { data?: PlanState } | undefined;
|
||||||
|
if (last?.data) state = { ...state, ...last.data };
|
||||||
|
updateWidget(ctx);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// --- helpers (module scope; pure enough to keep out of the closure) -------------------------------
|
||||||
|
|
||||||
|
function text(s: string, isError = false) {
|
||||||
|
return { content: [{ type: "text" as const, text: s }], details: { isError }, isError };
|
||||||
|
}
|
||||||
|
|
||||||
|
function stamp(): string {
|
||||||
|
return new Date().toISOString().slice(0, 16).replace("T", " ");
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Decide a sign-off: deterministic verify first (cheap; skip the model call if it fails), then the judge. */
|
||||||
|
async function decideSignOff(
|
||||||
|
goal: Goal,
|
||||||
|
evidence: string,
|
||||||
|
paths: string[],
|
||||||
|
judgeModel: string | null,
|
||||||
|
cwd: string,
|
||||||
|
signal: AbortSignal | undefined,
|
||||||
|
): Promise<SignOff> {
|
||||||
|
let verifyResult: { command: string; exitCode: number; outputTail: string } | null = null;
|
||||||
|
if (goal.verify) {
|
||||||
|
verifyResult = runVerify(goal.verify, cwd, signal);
|
||||||
|
if (verifyResult.exitCode !== 0) {
|
||||||
|
return { kind: "verify_failed", exitCode: verifyResult.exitCode, outputTail: verifyResult.outputTail };
|
||||||
|
}
|
||||||
|
}
|
||||||
|
const verdict = await runJudge(goal, evidence, paths, verifyResult, judgeModel, cwd, signal);
|
||||||
|
return verdict.accept ? { kind: "accepted" } : { kind: "rejected", missing: verdict.missing };
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Run the goal's verify command. It is agent-authored and trusted (single-user machine, guide-not-guard). */
|
||||||
|
function runVerify(command: string, cwd: string, signal: AbortSignal | undefined): { command: string; exitCode: number; outputTail: string } {
|
||||||
|
const res = spawnSync("sh", ["-c", command], { cwd, encoding: "utf-8", signal, timeout: 600_000 });
|
||||||
|
const out = `${res.stdout ?? ""}${res.stderr ?? ""}`;
|
||||||
|
return { command, exitCode: res.status ?? 1, outputTail: out.split("\n").slice(-30).join("\n") };
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Locate the pi binary the same way the oracle extension does, so spawning works under bun or node. */
|
||||||
|
function getPiInvocation(args: string[]): { command: string; args: string[] } {
|
||||||
|
const script = process.argv[1];
|
||||||
|
if (script && !script.startsWith("/$bunfs/root/") && existsSync(script)) return { command: process.execPath, args: [script, ...args] };
|
||||||
|
const execName = basename(process.execPath).toLowerCase();
|
||||||
|
if (!/^(node|bun)(\.exe)?$/.test(execName)) return { command: process.execPath, args };
|
||||||
|
return { command: "pi", args };
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Stage 2: a read-only pi subprocess inspects the evidence against the repo and returns a verdict. */
|
||||||
|
async function runJudge(
|
||||||
|
goal: Goal,
|
||||||
|
evidence: string,
|
||||||
|
paths: string[],
|
||||||
|
verifyResult: { command: string; exitCode: number; outputTail: string } | null,
|
||||||
|
judgeModel: string | null,
|
||||||
|
cwd: string,
|
||||||
|
signal: AbortSignal | undefined,
|
||||||
|
): Promise<{ accept: boolean; missing: string }> {
|
||||||
|
const task = evidenceJudgeUser({
|
||||||
|
subject: goal.subject,
|
||||||
|
done_when: goal.done_when,
|
||||||
|
verify: goal.verify ?? null,
|
||||||
|
verifyResult,
|
||||||
|
failure_modes: goal.failure_modes,
|
||||||
|
evidence,
|
||||||
|
paths,
|
||||||
|
});
|
||||||
|
const args = ["-p", "--no-session", "--tools", READ_ONLY_TOOLS.join(","), "--append-system-prompt", evidenceJudgeSystem];
|
||||||
|
if (judgeModel) args.push("--model", judgeModel);
|
||||||
|
args.push(task);
|
||||||
|
|
||||||
|
const inv = getPiInvocation(args);
|
||||||
|
const output = await new Promise<string>((resolve) => {
|
||||||
|
const proc = spawn(inv.command, inv.args, { cwd, shell: false, stdio: ["ignore", "pipe", "pipe"], signal });
|
||||||
|
let out = "";
|
||||||
|
proc.stdout.on("data", (d) => (out += d));
|
||||||
|
proc.stderr.on("data", (d) => (out += d));
|
||||||
|
proc.on("close", () => resolve(out));
|
||||||
|
proc.on("error", (e) => resolve(`VERDICT: reject\nmissing: judge subprocess failed: ${e.message}`));
|
||||||
|
});
|
||||||
|
|
||||||
|
const verdictLine = output.split("\n").find((l) => /^\s*VERDICT\s*:/i.test(l)) ?? "";
|
||||||
|
const accept = /accept/i.test(verdictLine);
|
||||||
|
const missingMatch = output.match(/missing\s*:\s*([\s\S]*)$/i);
|
||||||
|
const missing = accept ? "" : (missingMatch?.[1].trim() || output.trim().slice(-500) || "judge gave no reason");
|
||||||
|
return { accept, missing };
|
||||||
|
}
|
||||||
@@ -0,0 +1,225 @@
|
|||||||
|
/**
|
||||||
|
* plan-file.ts — read plan.md, and the two writes CompleteGoal needs. That is all.
|
||||||
|
*
|
||||||
|
* Pure module, no pi deps, so it unit-tests without a runtime. The file is the canonical store and
|
||||||
|
* the agent edits it with its normal Edit tool (create goals, tick subtasks, append log), guided by
|
||||||
|
* the format in prompts.tsx and the reminder -- the form guides, it does not gate (spec D3). So this
|
||||||
|
* module does NOT render or create goals; the format's single source of truth is the planDrafting
|
||||||
|
* prompt. The only programmatic writers are setGoalStatus + appendLog, used by CompleteGoal to
|
||||||
|
* record an accepted sign-off; both touch one line so the git diff stays readable.
|
||||||
|
*
|
||||||
|
* Format (spec §4):
|
||||||
|
*
|
||||||
|
* # Plan: <objective>
|
||||||
|
*
|
||||||
|
* ## Goal: <subject>
|
||||||
|
* <!-- id: <slug> -->
|
||||||
|
* status: open | active | done | cancelled
|
||||||
|
* done_when: <falsifiable check; plus the symptom if NOT met>
|
||||||
|
* verify: <shell command, optional>
|
||||||
|
* failure_modes:
|
||||||
|
* - <pre-mortem item>
|
||||||
|
* - [ ] <subtask>
|
||||||
|
*
|
||||||
|
* ## Log
|
||||||
|
* - <verbatim append-only line>
|
||||||
|
*/
|
||||||
|
|
||||||
|
export type GoalStatus = "open" | "active" | "done" | "cancelled";
|
||||||
|
|
||||||
|
export interface Subtask {
|
||||||
|
text: string;
|
||||||
|
done: boolean;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface Goal {
|
||||||
|
id: string;
|
||||||
|
subject: string;
|
||||||
|
status: GoalStatus;
|
||||||
|
done_when: string;
|
||||||
|
verify?: string;
|
||||||
|
failure_modes: string[];
|
||||||
|
subtasks: Subtask[];
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface PlanDoc {
|
||||||
|
objective: string;
|
||||||
|
goals: Goal[];
|
||||||
|
/** Verbatim ## Log lines, including the leading "- ". */
|
||||||
|
log: string[];
|
||||||
|
}
|
||||||
|
|
||||||
|
const GOAL_HEADER = /^##\s+Goal:\s*(.*)$/;
|
||||||
|
const ANY_HEADER = /^#{1,6}\s/;
|
||||||
|
const LOG_HEADER = /^##\s+Log\s*$/i;
|
||||||
|
const ID_COMMENT = /^<!--\s*id:\s*(.+?)\s*-->$/;
|
||||||
|
const CHECKBOX = /^- \[([ xX])\]\s+(.*)$/;
|
||||||
|
|
||||||
|
export function parse(text: string): PlanDoc {
|
||||||
|
const lines = text.split("\n");
|
||||||
|
let objective = "";
|
||||||
|
const goals: Goal[] = [];
|
||||||
|
const log: string[] = [];
|
||||||
|
|
||||||
|
let cur: Goal | null = null;
|
||||||
|
let inFailureModes = false;
|
||||||
|
let inLog = false;
|
||||||
|
|
||||||
|
const flush = () => {
|
||||||
|
if (cur) goals.push(cur);
|
||||||
|
cur = null;
|
||||||
|
inFailureModes = false;
|
||||||
|
};
|
||||||
|
|
||||||
|
for (const line of lines) {
|
||||||
|
const objMatch = /^#\s+Plan:\s*(.*)$/.exec(line);
|
||||||
|
if (objMatch) {
|
||||||
|
objective = objMatch[1].trim();
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
const goalMatch = GOAL_HEADER.exec(line);
|
||||||
|
if (goalMatch) {
|
||||||
|
flush();
|
||||||
|
inLog = false;
|
||||||
|
cur = { id: "", subject: goalMatch[1].trim(), status: "open", done_when: "", failure_modes: [], subtasks: [] };
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (LOG_HEADER.test(line)) {
|
||||||
|
flush();
|
||||||
|
inLog = true;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Any other header ends the current goal / log section.
|
||||||
|
if (ANY_HEADER.test(line)) {
|
||||||
|
flush();
|
||||||
|
inLog = false;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (inLog) {
|
||||||
|
if (/^\s*-\s+/.test(line)) log.push(line);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!cur) continue;
|
||||||
|
|
||||||
|
const idMatch = ID_COMMENT.exec(line.trim());
|
||||||
|
if (idMatch) {
|
||||||
|
cur.id = idMatch[1];
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// A checkbox (column 0) is a subtask; checked first so it is never read as a failure mode.
|
||||||
|
const checkbox = CHECKBOX.exec(line);
|
||||||
|
if (checkbox) {
|
||||||
|
inFailureModes = false;
|
||||||
|
cur.subtasks.push({ done: checkbox[1].toLowerCase() === "x", text: checkbox[2].trim() });
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
const kv = /^(status|done_when|verify|failure_modes)\s*:\s*(.*)$/.exec(line);
|
||||||
|
if (kv) {
|
||||||
|
const [, key, value] = kv;
|
||||||
|
if (key === "status") cur.status = value.trim() as GoalStatus;
|
||||||
|
else if (key === "done_when") cur.done_when = value.trim();
|
||||||
|
else if (key === "verify") cur.verify = value.trim() || undefined;
|
||||||
|
else if (key === "failure_modes") inFailureModes = true;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Indented "- " items under failure_modes: (a column-0 checkbox already returned above).
|
||||||
|
if (inFailureModes) {
|
||||||
|
const fm = /^\s*-\s+(.*)$/.exec(line);
|
||||||
|
if (fm) {
|
||||||
|
cur.failure_modes.push(fm[1].trim());
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if (line.trim() !== "") inFailureModes = false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
flush();
|
||||||
|
|
||||||
|
return { objective, goals, log };
|
||||||
|
}
|
||||||
|
|
||||||
|
export function findGoal(doc: PlanDoc, id: string): Goal | undefined {
|
||||||
|
return doc.goals.find((g) => g.id === id);
|
||||||
|
}
|
||||||
|
|
||||||
|
export function counts(doc: PlanDoc): { done: number; open: number; active: number } {
|
||||||
|
const c = { done: 0, open: 0, active: 0 };
|
||||||
|
for (const g of doc.goals) {
|
||||||
|
if (g.status === "done") c.done++;
|
||||||
|
else if (g.status === "active") c.active++;
|
||||||
|
else if (g.status === "open") c.open++;
|
||||||
|
}
|
||||||
|
return c;
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Flip a goal's `status:` line in place (the one write CompleteGoal needs). */
|
||||||
|
export function setGoalStatus(text: string, id: string, status: GoalStatus): string {
|
||||||
|
const lines = text.split("\n");
|
||||||
|
let i = lines.findIndex((l) => ID_COMMENT.test(l.trim()) && ID_COMMENT.exec(l.trim())?.[1] === id);
|
||||||
|
if (i === -1) throw new Error(`Goal #${id} not found`);
|
||||||
|
for (; i < lines.length; i++) {
|
||||||
|
if (i > 0 && ANY_HEADER.test(lines[i]) && !GOAL_HEADER.test(lines[i]) && !LOG_HEADER.test(lines[i])) break;
|
||||||
|
const kv = /^(status\s*:\s*)(.*)$/.exec(lines[i]);
|
||||||
|
if (kv) {
|
||||||
|
lines[i] = `${kv[1]}${status}`;
|
||||||
|
return lines.join("\n");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
throw new Error(`Goal #${id} has no status: line`);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* The outcome of a sign-off attempt, decided by CompleteGoal (which runs verify + the judge). Kept
|
||||||
|
* separate from the I/O so the record logic below is pure and testable.
|
||||||
|
*/
|
||||||
|
export type SignOff =
|
||||||
|
| { kind: "verify_failed"; exitCode: number; outputTail: string }
|
||||||
|
| { kind: "rejected"; missing: string }
|
||||||
|
| { kind: "accepted" };
|
||||||
|
|
||||||
|
/** Apply a sign-off outcome to plan.md text: accept flips status + logs; reject only logs. Pure. */
|
||||||
|
export function recordSignOff(
|
||||||
|
text: string,
|
||||||
|
goalId: string,
|
||||||
|
when: string,
|
||||||
|
outcome: SignOff,
|
||||||
|
): { content: string; message: string; isError: boolean } {
|
||||||
|
const goal = findGoal(parse(text), goalId);
|
||||||
|
if (!goal) return { content: text, message: `No goal #${goalId} in plan.md.`, isError: true };
|
||||||
|
|
||||||
|
if (outcome.kind === "verify_failed") {
|
||||||
|
const content = appendLog(text, `${when} reject #${goalId}: verify exit ${outcome.exitCode}`);
|
||||||
|
return { content, message: `Sign-off rejected: verify failed (exit ${outcome.exitCode}).\n${outcome.outputTail}`, isError: true };
|
||||||
|
}
|
||||||
|
if (outcome.kind === "rejected") {
|
||||||
|
const oneLine = outcome.missing.replace(/\s+/g, " ").trim().slice(0, 200);
|
||||||
|
const content = appendLog(text, `${when} reject #${goalId}: ${oneLine}`);
|
||||||
|
return { content, message: `Sign-off rejected. Missing:\n${outcome.missing}`, isError: true };
|
||||||
|
}
|
||||||
|
const flipped = setGoalStatus(text, goalId, "done");
|
||||||
|
const content = appendLog(flipped, `${when} signed off #${goalId}: ${goal.subject} (oracle accept)`);
|
||||||
|
return { content, message: `Signed off #${goalId}: ${goal.subject}. Marked done in plan.md.`, isError: false };
|
||||||
|
}
|
||||||
|
|
||||||
|
/** Append one verbatim line to ## Log (creating the section if absent). The other CompleteGoal write. */
|
||||||
|
export function appendLog(text: string, entry: string): string {
|
||||||
|
const lines = text.split("\n");
|
||||||
|
const line = `- ${entry}`;
|
||||||
|
const header = lines.findIndex((l) => LOG_HEADER.test(l));
|
||||||
|
if (header === -1) return `${text.replace(/\n+$/, "")}\n\n## Log\n${line}\n`;
|
||||||
|
|
||||||
|
let insertAt = header + 1;
|
||||||
|
for (let i = header + 1; i < lines.length; i++) {
|
||||||
|
if (ANY_HEADER.test(lines[i])) break;
|
||||||
|
if (/^\s*-\s+/.test(lines[i])) insertAt = i + 1;
|
||||||
|
}
|
||||||
|
lines.splice(insertAt, 0, line);
|
||||||
|
return lines.join("\n");
|
||||||
|
}
|
||||||
+199
@@ -0,0 +1,199 @@
|
|||||||
|
/**
|
||||||
|
* pi-plan — all model-facing text, in flow order.
|
||||||
|
*
|
||||||
|
* Philosophy: the form guides a process; it does not police one. The agent can
|
||||||
|
* edit plan.md freely. These prompts + the plan.md structure make the right path
|
||||||
|
* the easy path. The only step that is genuinely rigorous is the evidence judge
|
||||||
|
* (6), and even that is reached by guiding the agent to call CompleteGoal, not by
|
||||||
|
* trapping it. Bypasses stay visible in the git diff and the widget.
|
||||||
|
*
|
||||||
|
* Flow:
|
||||||
|
* SETUP (plan mode) 1. planDrafting — strong/sticky model drafts goals
|
||||||
|
* EXEC, each turn start 2. planInjection — "here is your plan, where you are"
|
||||||
|
* EXEC, periodic 3. reminder — the typed nudge that drives upkeep + autonomy
|
||||||
|
* EXEC, loop continue 4. continuation — keep going toward the active goal
|
||||||
|
* EXEC, after each turn 5. loopJudge — continue / pause (cheap, foolable, ok)
|
||||||
|
* SIGN-OFF 6. evidenceJudge — read-only verify (rigorous; the one real check)
|
||||||
|
*
|
||||||
|
* Read top to bottom to see the whole process. 5 and 6 are kept adjacent on
|
||||||
|
* purpose: the cheap-foolable vs must-not-be-fooled contrast is the design.
|
||||||
|
*/
|
||||||
|
|
||||||
|
/* ─────────────────────────────────────────────────────────────────────────
|
||||||
|
* 1. planDrafting — SETUP, plan mode
|
||||||
|
*
|
||||||
|
* System guidance for the plan-phase agent. Runs on the plan model (may differ
|
||||||
|
* from the execution model; the choice is sticky — see oracle.json-style config).
|
||||||
|
* This phase is read-only: explore, then draft goals into plan.md. No code yet.
|
||||||
|
* The field requirements here are the whole "elicitation" — get them agreed up
|
||||||
|
* front, because the human reviews this output before any execution.
|
||||||
|
* ──────────────────────────────────────────────────────────────────────── */
|
||||||
|
export const planDrafting = `\
|
||||||
|
You are in plan mode. Explore the repository read-only, then draft a plan into plan.md.
|
||||||
|
Do not write or run code in this phase. Produce goals the human will review and approve.
|
||||||
|
|
||||||
|
Write each goal in this shape:
|
||||||
|
|
||||||
|
## Goal: <one short imperative line>
|
||||||
|
status: open
|
||||||
|
done_when: <a falsifiable check, plus the symptom you'd see if it's NOT met>
|
||||||
|
verify: <a shell command that exits 0 only when the goal is met — include this whenever
|
||||||
|
success is expressible as tests/lint/build/a threshold; omit it otherwise>
|
||||||
|
failure_modes:
|
||||||
|
- <a concrete way this could look done but isn't>
|
||||||
|
- <another>
|
||||||
|
- <if verify exists: "verify passes on a trivial or gamed test">
|
||||||
|
- [ ] <first subtask>
|
||||||
|
- [ ] <next subtask>
|
||||||
|
|
||||||
|
Rules for a good plan:
|
||||||
|
- Keep goals small enough that done_when is checkable in one sitting.
|
||||||
|
- done_when must be falsifiable. "Works well" is not a criterion; "p95 < 50ms on bench-X,
|
||||||
|
else timeouts in load-test.log" is.
|
||||||
|
- failure_modes are a pre-mortem: the cheap, specific ways a later "done" could be wrong.
|
||||||
|
This is the highest-value part — it shapes what evidence you'll collect.
|
||||||
|
- Prefer a verify command. A green deterministic check is worth more than a paragraph of
|
||||||
|
description, and it's the first thing checked at sign-off.
|
||||||
|
|
||||||
|
When the plan is drafted, present it and stop for review. Do not begin execution.`;
|
||||||
|
|
||||||
|
/* ─────────────────────────────────────────────────────────────────────────
|
||||||
|
* 2. planInjection — EXEC, injected at each agent start (and after compaction)
|
||||||
|
*
|
||||||
|
* A late user-role message, NOT a system-prompt mutation (keeps the prefix cache
|
||||||
|
* valid). Built from the parsed plan. MUST be byte-identical when nothing changed:
|
||||||
|
* fixed field order, no volatile timestamps in the body. Pass only the active
|
||||||
|
* goal + its open subtasks + the last log line — not the whole file.
|
||||||
|
* ──────────────────────────────────────────────────────────────────────── */
|
||||||
|
export function planInjection(p: {
|
||||||
|
objective: string;
|
||||||
|
activeGoal: { subject: string; done_when: string; openSubtasks: string[] } | null;
|
||||||
|
lastLogLine: string | null;
|
||||||
|
counts: { done: number; open: number };
|
||||||
|
}): string {
|
||||||
|
if (!p.activeGoal) {
|
||||||
|
return `Plan (plan.md): ${p.objective}\nNo active goal. ${p.counts.open} open, ${p.counts.done} done. Pick the next goal or run /plan.`;
|
||||||
|
}
|
||||||
|
const subtasks = p.activeGoal.openSubtasks.length
|
||||||
|
? p.activeGoal.openSubtasks.map((s) => ` - [ ] ${s}`).join("\n")
|
||||||
|
: " (no open subtasks)";
|
||||||
|
return `\
|
||||||
|
Plan (plan.md): ${p.objective}
|
||||||
|
Active goal: ${p.activeGoal.subject}
|
||||||
|
done_when: ${p.activeGoal.done_when}
|
||||||
|
Open subtasks:
|
||||||
|
${subtasks}
|
||||||
|
Last log: ${p.lastLogLine ?? "(none yet)"}
|
||||||
|
Progress: ${p.counts.done} done, ${p.counts.open} open.`;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ─────────────────────────────────────────────────────────────────────────
|
||||||
|
* 3. reminder — EXEC, periodic system-reminder
|
||||||
|
*
|
||||||
|
* The typed nudge. This is both the housekeeping and the autonomy engine — it is
|
||||||
|
* what makes the process get followed without a hard gate. Fires after N
|
||||||
|
* file-modifying turns since the last plan.md update while a goal is active.
|
||||||
|
* Keep the wording stable so it doesn't thrash the cache.
|
||||||
|
* ──────────────────────────────────────────────────────────────────────── */
|
||||||
|
export const reminder = `\
|
||||||
|
<system-reminder>
|
||||||
|
Keep plan.md current as you work:
|
||||||
|
- tasks: tick the subtasks you've finished; add any new ones you've discovered.
|
||||||
|
- log: append ONE short line to ## Log (append — don't rewrite earlier lines).
|
||||||
|
- goal: if the active goal's evidence is in, sign it off by calling CompleteGoal with that
|
||||||
|
evidence. Don't edit status to done by hand — CompleteGoal runs the check and records it.
|
||||||
|
- otherwise: keep working toward the active goal. Don't stop to ask unless you're genuinely
|
||||||
|
blocked; if blocked, say what's blocking and why.
|
||||||
|
</system-reminder>`;
|
||||||
|
|
||||||
|
/* ─────────────────────────────────────────────────────────────────────────
|
||||||
|
* 4. continuation — EXEC, the loop's "keep going" turn
|
||||||
|
*
|
||||||
|
* Hermes-style. A plain user-role message appended when the loop judge (5) says
|
||||||
|
* continue. Does not mutate the system prompt, so the cache holds.
|
||||||
|
* ──────────────────────────────────────────────────────────────────────── */
|
||||||
|
export const continuation = `\
|
||||||
|
Continue toward the active goal in plan.md. If it now meets its done_when, call CompleteGoal
|
||||||
|
with your evidence (point to durable artifacts — saved logs, committed diffs, files — not just
|
||||||
|
claims). If you're blocked, state what's blocking it.`;
|
||||||
|
|
||||||
|
/* ─────────────────────────────────────────────────────────────────────────
|
||||||
|
* 5. loopJudge — EXEC, runs after each turn to decide continue / pause
|
||||||
|
*
|
||||||
|
* Cheap, conservative, fail-open. Reads only the agent's last response, so it CAN
|
||||||
|
* be fooled by an asserted "done" — that's acceptable: its worst case is a
|
||||||
|
* premature pause, caught by you or the iteration budget. It does NOT sign goals
|
||||||
|
* off; that's the evidence judge's job. Return strict JSON, no prose.
|
||||||
|
* ──────────────────────────────────────────────────────────────────────── */
|
||||||
|
export const loopJudgeSystem = `\
|
||||||
|
You decide whether an autonomous coding agent should keep working or pause for the human.
|
||||||
|
Be conservative: only pause when the work is plainly finished or plainly blocked. When in
|
||||||
|
doubt, continue. You are not verifying correctness — a later read-only judge does that.
|
||||||
|
Reply with ONLY a JSON object, no other text: {"done": boolean, "reason": "<one sentence>"}.
|
||||||
|
Set done=true only if the agent's last message shows the active goal's done_when is met, or
|
||||||
|
the agent says it is blocked and needs the human.`;
|
||||||
|
|
||||||
|
export function loopJudgeUser(p: { activeGoalDoneWhen: string; lastResponse: string }): string {
|
||||||
|
return `\
|
||||||
|
Active goal done_when: ${p.activeGoalDoneWhen}
|
||||||
|
|
||||||
|
Agent's last message:
|
||||||
|
"""
|
||||||
|
${p.lastResponse}
|
||||||
|
"""
|
||||||
|
|
||||||
|
{"done": ?, "reason": ?}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* ─────────────────────────────────────────────────────────────────────────
|
||||||
|
* 6. evidenceJudge — SIGN-OFF, the one rigorous check
|
||||||
|
*
|
||||||
|
* Runs inside CompleteGoal, on the read-only oracle subprocess (fresh context,
|
||||||
|
* strongest reasoning on the chosen provider; override to a different vendor for
|
||||||
|
* high-stakes goals). It re-derives from the repo rather than trusting the
|
||||||
|
* agent's transcription, and it judges whether a verify command actually tests
|
||||||
|
* the criterion or could pass while a named failure mode holds (gaming).
|
||||||
|
*
|
||||||
|
* The transport gives it read/grep/find/ls. The prompt below imposes the verdict
|
||||||
|
* contract — the oracle returns prose by default, so parse the VERDICT line.
|
||||||
|
* ──────────────────────────────────────────────────────────────────────── */
|
||||||
|
export const evidenceJudgeSystem = `\
|
||||||
|
You are a read-only reviewer signing off a coding goal. Do not trust claims — verify.
|
||||||
|
Use read/grep/find/ls to inspect the repository and the cited artifacts yourself. Re-read the
|
||||||
|
files, logs, and diffs the evidence points to; if something it asserts isn't on disk, you can't
|
||||||
|
confirm it. If a verify command was run, judge whether it genuinely tests the criterion or
|
||||||
|
could pass while one of the listed failure modes still holds — a tautological or skipped test
|
||||||
|
is a reject. Check each failure mode is actually ruled out, not just unmentioned.
|
||||||
|
|
||||||
|
Finish with exactly these two lines and nothing after:
|
||||||
|
VERDICT: accept | reject
|
||||||
|
missing: <empty if accept; otherwise a short list of what's needed before this can be accepted>`;
|
||||||
|
|
||||||
|
export function evidenceJudgeUser(p: {
|
||||||
|
subject: string;
|
||||||
|
done_when: string;
|
||||||
|
verify: string | null;
|
||||||
|
verifyResult: { command: string; exitCode: number; outputTail: string } | null;
|
||||||
|
failure_modes: string[];
|
||||||
|
evidence: string;
|
||||||
|
paths: string[];
|
||||||
|
}): string {
|
||||||
|
const verifyBlock = p.verify
|
||||||
|
? `verify command: ${p.verify}\nverify result: exit ${p.verifyResult?.exitCode ?? "n/a"}\n${p.verifyResult?.outputTail ?? ""}`
|
||||||
|
: "verify command: none (no deterministic check for this goal)";
|
||||||
|
return `\
|
||||||
|
Goal: ${p.subject}
|
||||||
|
done_when: ${p.done_when}
|
||||||
|
failure_modes:
|
||||||
|
${p.failure_modes.map((f) => ` - ${f}`).join("\n")}
|
||||||
|
|
||||||
|
${verifyBlock}
|
||||||
|
|
||||||
|
Agent's evidence:
|
||||||
|
${p.evidence}
|
||||||
|
|
||||||
|
Artifacts it points to (inspect these):
|
||||||
|
${p.paths.map((x) => ` - ${x}`).join("\n") || " (none listed — note this)"}
|
||||||
|
|
||||||
|
Verify the goal against its done_when. Then give your VERDICT.`;
|
||||||
|
}
|
||||||
@@ -0,0 +1,171 @@
|
|||||||
|
import { describe, expect, it } from "vitest";
|
||||||
|
import { appendLog, counts, findGoal, parse, recordSignOff, setGoalStatus } from "../src/plan-file.js";
|
||||||
|
|
||||||
|
const SAMPLE = `# Plan: ship the cache layer
|
||||||
|
|
||||||
|
## Goal: Implement cache layer
|
||||||
|
<!-- id: cache-layer-1 -->
|
||||||
|
status: active
|
||||||
|
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
|
||||||
|
verify: pytest tests/cache -q
|
||||||
|
failure_modes:
|
||||||
|
- cache silently bypassed (hit-rate ~0, latency ok by luck)
|
||||||
|
- bench too small to exercise eviction
|
||||||
|
- [x] wire cache client
|
||||||
|
- [ ] eviction policy
|
||||||
|
- [ ] load test
|
||||||
|
|
||||||
|
## Goal: Document the API
|
||||||
|
<!-- id: document-the-api-1 -->
|
||||||
|
status: open
|
||||||
|
done_when: every public fn has a docstring; else sphinx warns
|
||||||
|
failure_modes:
|
||||||
|
- docstrings exist but are stale
|
||||||
|
|
||||||
|
## Log
|
||||||
|
- 2026-06-15 14:02 cache client wired; eviction next
|
||||||
|
`;
|
||||||
|
|
||||||
|
/** Multiset line diff: lines b adds vs removes vs a (order-insensitive, so insertions score added:1). */
|
||||||
|
function lineDelta(a: string, b: string): { added: number; removed: number } {
|
||||||
|
const count = (s: string) => {
|
||||||
|
const m = new Map<string, number>();
|
||||||
|
for (const l of s.split("\n")) m.set(l, (m.get(l) ?? 0) + 1);
|
||||||
|
return m;
|
||||||
|
};
|
||||||
|
const ma = count(a);
|
||||||
|
const mb = count(b);
|
||||||
|
let added = 0;
|
||||||
|
let removed = 0;
|
||||||
|
for (const k of new Set([...ma.keys(), ...mb.keys()])) {
|
||||||
|
const d = (mb.get(k) ?? 0) - (ma.get(k) ?? 0);
|
||||||
|
if (d > 0) added += d;
|
||||||
|
else if (d < 0) removed += -d;
|
||||||
|
}
|
||||||
|
return { added, removed };
|
||||||
|
}
|
||||||
|
|
||||||
|
describe("parse", () => {
|
||||||
|
const doc = parse(SAMPLE);
|
||||||
|
|
||||||
|
it("reads the objective and both goals", () => {
|
||||||
|
expect(doc.objective).toBe("ship the cache layer");
|
||||||
|
expect(doc.goals.map((g) => g.id)).toEqual(["cache-layer-1", "document-the-api-1"]);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("reads goal fields", () => {
|
||||||
|
const g = findGoal(doc, "cache-layer-1");
|
||||||
|
expect(g?.subject).toBe("Implement cache layer");
|
||||||
|
expect(g?.status).toBe("active");
|
||||||
|
expect(g?.done_when).toBe("p95 < 50ms on bench-X. If wrong: timeouts in load-test.log");
|
||||||
|
expect(g?.verify).toBe("pytest tests/cache -q");
|
||||||
|
});
|
||||||
|
|
||||||
|
it("separates failure_modes from subtasks", () => {
|
||||||
|
const g = findGoal(doc, "cache-layer-1");
|
||||||
|
expect(g?.failure_modes).toHaveLength(2);
|
||||||
|
expect(g?.failure_modes[0]).toContain("cache silently bypassed");
|
||||||
|
expect(g?.subtasks).toEqual([
|
||||||
|
{ text: "wire cache client", done: true },
|
||||||
|
{ text: "eviction policy", done: false },
|
||||||
|
{ text: "load test", done: false },
|
||||||
|
]);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("reads the log verbatim and counts by status", () => {
|
||||||
|
expect(doc.log).toEqual(["- 2026-06-15 14:02 cache client wired; eviction next"]);
|
||||||
|
expect(counts(doc)).toEqual({ done: 0, open: 1, active: 1 });
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
describe("failure_modes vs subtask disambiguation", () => {
|
||||||
|
it("a column-0 checkbox right after failure_modes: is a SUBTASK", () => {
|
||||||
|
const doc = parse(
|
||||||
|
`# Plan: x\n\n## Goal: G\n<!-- id: g-1 -->\nstatus: open\ndone_when: z\nfailure_modes:\n- [ ] first subtask\n- [x] second subtask\n`,
|
||||||
|
);
|
||||||
|
const g = findGoal(doc, "g-1");
|
||||||
|
expect(g?.failure_modes).toEqual([]);
|
||||||
|
expect(g?.subtasks).toEqual([
|
||||||
|
{ text: "first subtask", done: false },
|
||||||
|
{ text: "second subtask", done: true },
|
||||||
|
]);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("an indented checkbox-shaped item inside failure_modes is a FAILURE MODE", () => {
|
||||||
|
const doc = parse(
|
||||||
|
`# Plan: x\n\n## Goal: G\n<!-- id: g-2 -->\nstatus: open\ndone_when: z\nfailure_modes:\n - [ ] prose that looks like a checkbox\n- [ ] real subtask\n`,
|
||||||
|
);
|
||||||
|
const g = findGoal(doc, "g-2");
|
||||||
|
expect(g?.failure_modes).toEqual(["[ ] prose that looks like a checkbox"]);
|
||||||
|
expect(g?.subtasks).toEqual([{ text: "real subtask", done: false }]);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("a goal with no failure_modes keeps its subtasks", () => {
|
||||||
|
const doc = parse(`# Plan: x\n\n## Goal: G\n<!-- id: g-3 -->\nstatus: open\ndone_when: z\n- [ ] only subtask\n`);
|
||||||
|
const g = findGoal(doc, "g-3");
|
||||||
|
expect(g?.failure_modes).toEqual([]);
|
||||||
|
expect(g?.subtasks).toEqual([{ text: "only subtask", done: false }]);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
describe("the two CompleteGoal writes (minimal diff)", () => {
|
||||||
|
it("setGoalStatus replaces exactly one line, scoped to the right goal", () => {
|
||||||
|
const next = setGoalStatus(SAMPLE, "cache-layer-1", "done");
|
||||||
|
expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 1 });
|
||||||
|
expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("done");
|
||||||
|
expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("open"); // untouched
|
||||||
|
});
|
||||||
|
|
||||||
|
it("setGoalStatus targets the second goal without touching the first", () => {
|
||||||
|
const next = setGoalStatus(SAMPLE, "document-the-api-1", "active");
|
||||||
|
expect(findGoal(parse(next), "cache-layer-1")?.status).toBe("active");
|
||||||
|
expect(findGoal(parse(next), "document-the-api-1")?.status).toBe("active");
|
||||||
|
});
|
||||||
|
|
||||||
|
it("appendLog adds exactly one line under ## Log", () => {
|
||||||
|
const next = appendLog(SAMPLE, "2026-06-15 15:00 eviction done");
|
||||||
|
expect(lineDelta(SAMPLE, next)).toEqual({ added: 1, removed: 0 });
|
||||||
|
expect(parse(next).log).toEqual([
|
||||||
|
"- 2026-06-15 14:02 cache client wired; eviction next",
|
||||||
|
"- 2026-06-15 15:00 eviction done",
|
||||||
|
]);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("appendLog creates the section when absent", () => {
|
||||||
|
const noLog = "# Plan: x\n\n## Goal: y\n<!-- id: y-1 -->\nstatus: open\ndone_when: z\n";
|
||||||
|
expect(parse(appendLog(noLog, "first entry")).log).toEqual(["- first entry"]);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
describe("recordSignOff (CompleteGoal's pure record logic)", () => {
|
||||||
|
const WHEN = "2026-06-15 16:00";
|
||||||
|
|
||||||
|
it("accept flips status:done and logs a sign-off line", () => {
|
||||||
|
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "accepted" });
|
||||||
|
expect(r.isError).toBe(false);
|
||||||
|
const doc = parse(r.content);
|
||||||
|
expect(findGoal(doc, "cache-layer-1")?.status).toBe("done");
|
||||||
|
expect(doc.log.at(-1)).toBe(`- ${WHEN} signed off #cache-layer-1: Implement cache layer (oracle accept)`);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("verify_failed only logs a reject line, status stays active", () => {
|
||||||
|
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "verify_failed", exitCode: 1, outputTail: "boom" });
|
||||||
|
expect(r.isError).toBe(true);
|
||||||
|
const doc = parse(r.content);
|
||||||
|
expect(findGoal(doc, "cache-layer-1")?.status).toBe("active"); // NOT marked done
|
||||||
|
expect(doc.log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: verify exit 1`);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("rejected logs the (one-lined) missing reason, status stays", () => {
|
||||||
|
const r = recordSignOff(SAMPLE, "cache-layer-1", WHEN, { kind: "rejected", missing: "no\nsaved\nbench log" });
|
||||||
|
expect(r.isError).toBe(true);
|
||||||
|
expect(findGoal(parse(r.content), "cache-layer-1")?.status).toBe("active");
|
||||||
|
expect(parse(r.content).log.at(-1)).toBe(`- ${WHEN} reject #cache-layer-1: no saved bench log`);
|
||||||
|
});
|
||||||
|
|
||||||
|
it("unknown goal returns an error and does not touch the file", () => {
|
||||||
|
const r = recordSignOff(SAMPLE, "nope-1", WHEN, { kind: "accepted" });
|
||||||
|
expect(r.isError).toBe(true);
|
||||||
|
expect(r.content).toBe(SAMPLE);
|
||||||
|
});
|
||||||
|
});
|
||||||
@@ -0,0 +1,15 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"target": "ES2022",
|
||||||
|
"module": "ES2022",
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"strict": true,
|
||||||
|
"esModuleInterop": true,
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"jsx": "react-jsx",
|
||||||
|
"outDir": "dist",
|
||||||
|
"rootDir": "src",
|
||||||
|
"declaration": true
|
||||||
|
},
|
||||||
|
"include": ["src/**/*.ts", "src/**/*.tsx"]
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user