mirror of
https://github.com/wassname/pi-plan.git
synced 2026-06-27 17:31:19 +08:00
pi-plan: plan-mode goals + evidence in one plan.md, subagent sign-off
Small, guide-not-gate plan/goal tracker for pi. The agent edits plan.md with its normal Edit tool; CompleteGoal is the one blessed path that runs verify + a read-only judge and records the result. Plan mode drafts goals (done_when + failure_modes + subtasks), a per-turn injection keeps the active goal alive through compaction, and a reminder drives upkeep + autonomy. - src/plan-file.ts: pure parse + the two writes CompleteGoal needs + recordSignOff - src/index.ts: plan mode, review menu, injection, reminder, widget, CompleteGoal, oracle spawn - src/prompts.ts: all model-facing text in flow order - test/: 15 unit tests (parser, disambiguation, sign-off record logic) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,272 @@
|
||||
# pi-plan — design spec
|
||||
|
||||
Working title. A pi extension: set up goals (with subtasks and evidence) through plan mode, work them autonomously, and sign a goal off only when a check passes. One markdown file holds everything. The form guides a process; it does not police one. Successor to `pi-lgtm`, deliberately smaller.
|
||||
|
||||
Status: draft for review. Names, defaults, field shapes provisional.
|
||||
|
||||
---
|
||||
|
||||
## 1. Original ask → this spec
|
||||
|
||||
| Ask | Mechanism |
|
||||
|-----|-----------|
|
||||
| Set up goals + subtasks + evidence via **plan mode** | §3a — plan mode drafts the goal contract, you approve it |
|
||||
| **Subagent check** of evidence on sign-off | §5, §9 — oracle inside `CompleteGoal` |
|
||||
| Goals shown in a **task-list widget** | §7 — `/plan` renders goals + subtask checkboxes |
|
||||
| Store **all in `plan.md`** | §4 — single file, no sidecar store |
|
||||
| A **small manus-style append log** | §4 — short `## Log` section inside `plan.md` |
|
||||
| **Typed reminders** to update tasks | §8a — recurring nudge |
|
||||
| **Work autonomously** toward goals | §3b, §8a — the loop, driven by the reminder |
|
||||
| Persist through **compaction**; pi-tasks but simpler | §8 injection; minimal tool surface |
|
||||
|
||||
---
|
||||
|
||||
## 2. Decisions and preferences
|
||||
|
||||
Separates the opinionated forks from the mechanical body (§4 on).
|
||||
|
||||
### 2a. Preferences driving the design
|
||||
|
||||
- **Guidance over guardrails.** None of the surveyed extensions hard-enforce. The form (plan.md structure) + the reminder + the prompts guide the agent through a process; the one genuinely rigorous step is the sign-off check; git + widget visibility is the backstop. The agent can edit anything — we make the right path the easy path, not the only path.
|
||||
- **Anti-complexity.** One file, minimal tools, plain-file editing for anything with no cheat incentive.
|
||||
- **Reward-hacking / honesty focus.** The sign-off check must resist assertion and test-gaming, not just check a box.
|
||||
- **Cost-sensitivity (single 3090 / metered API).** KV-cache hygiene, judge-once-per-goal, cheap loop judge.
|
||||
- **Scout mindset.** Make false completion visible rather than paper over it.
|
||||
|
||||
### 2b. Decisions
|
||||
|
||||
`[decided]` = settled; `[open]` = your call.
|
||||
|
||||
| # | Decision | Alternative rejected | Why | Status |
|
||||
|---|----------|----------------------|-----|--------|
|
||||
| D1 | **Everything in one `plan.md`** | Separate `.plan/log.jsonl` sidecar | Asked for; simpler, one diff to read | decided |
|
||||
| D2 | Plan mode **is** the goal-setup-and-agreement phase | Agent-only creation | Approval is where `done_when` + `failure_modes` get agreed before any code | decided |
|
||||
| D3 | **Guide the process; don't gate it.** The only special path is `CompleteGoal` (the sign-off check) | Pre-tool-use interceptor that blocks `status: done` edits | No surveyed extension enforces at that level; the reminder + form carry it; bypass is visible in git | decided |
|
||||
| D4 | **Two-stage sign-off check**: deterministic `verify:` then oracle | Oracle only; tests only (Codex) | Tests unfakeable-by-assertion but gameable; oracle catches gaming + non-test criteria | decided |
|
||||
| D5 | Two **separate** judges: cheap loop + oracle sign-off | One judge for both | Loop judge reads assertions (foolable, ok); sign-off judge reads artifacts | decided |
|
||||
| D6 | Sign-off judge = oracle subprocess, **copied not depended** | In-process; pi-subagents | Shell-free spawn dodges noclobber/cropping; copying avoids flaky coupling | decided |
|
||||
| D7 | Contract tamper-check = **git visibility** | Append-only frozen log | All-in-one-file gives up the hard freeze; git diff + guided sign-off are enough for a single user | decided |
|
||||
| D8 | Completed goals **archived, not deleted** | Auto-clear after idle | A plan is a durable record | decided |
|
||||
| D9 | **Goals are flexible: multiple may be `active`** | One active goal forced | Operator wants flexibility; the agent picks focus, injection lists the active set | decided |
|
||||
| D10 | Loop judge default = main model, tiny prompt | Dedicated cheap aux model | Zero setup; switch if cost bites | open |
|
||||
| D11 | Sign-off judge default = **the session's current model** | Auto-pick "strongest on provider" (oracle-style) | Current model is guaranteed authorized + capable; provider lists hold dead/weak/unauthorized entries. Cross-vendor is a **setting** (§9) | decided |
|
||||
| D12 | **Plan-phase model is selectable and sticky** | Always the working model | Plan benefits from a stronger reasoner; persist the choice (oracle.json-style). Optionally the oracle drafts the plan (read-only + strong already) | decided |
|
||||
| D13 | **Offer to compact after plan accepted** | Always fresh session (burneikis); or never | Some runs want a clean execution context, some want to keep it. Make it a post-Ready choice | decided |
|
||||
|
||||
### 2c. Cuts (non-goals)
|
||||
|
||||
DAG / `blocks` edges. Parallel subagent execution (the flaky part). `findings.md`. Hard pre-tool-use enforcement (D3). Sign-off judge every turn (cost).
|
||||
|
||||
---
|
||||
|
||||
## 3. Two phases: setup, then execution
|
||||
|
||||
### 3a. Setup — plan mode
|
||||
|
||||
Goals are created and *agreed* through plan mode (burneikis-style). Stock plan mode; the deltas are the output format and the hand-off.
|
||||
|
||||
1. `/plan <objective>` enters plan mode. The agent explores read-only and drafts goals into `plan.md` in the contract format (§4). This phase runs on the **plan-phase model** (selectable + sticky, D12; optionally the read-only oracle drafts it).
|
||||
2. You review: **Ready** / **Edit** (NL rewrite) / **$EDITOR** (hand-edit) / **Cancel**. The agreement point — you sanity-check `done_when` and `failure_modes` before any code.
|
||||
3. On **Ready**, offer **compact context? (y/n)** (D13). Yes → execution starts in a cleared context with the approved `plan.md` re-injected. No → execution continues in the same context.
|
||||
|
||||
Direct `plan.md` edits remain a quick-add path for a one-off goal.
|
||||
|
||||
### 3b. Execution — the loop ↔ check cycle
|
||||
|
||||
Multiple goals may be `active`; the agent works whichever it's focused on, in the order it judges best.
|
||||
|
||||
1. The session works an `active` goal under an iteration budget (or `/goal` (re)starts the loop on the current plan).
|
||||
2. Each turn, the **loop judge** reads the agent's last response → continue/pause (fail-open; the **budget is the real backstop**).
|
||||
3. When the agent judges a goal done, the reminder steers it to call `CompleteGoal` (not hand-tick `status`).
|
||||
4. `CompleteGoal` runs the **two-stage check**:
|
||||
- **reject** → `missing[]` fed back; work continues toward the gap.
|
||||
- **accept** → goal marked done; the agent moves to another active/open goal, or the loop stops.
|
||||
|
||||
The loop judge can be fooled (reads assertions); worst case is a premature pause, caught by you or the budget. The sign-off check re-derives from artifacts, so it is not fooled cheaply. That asymmetry is the point.
|
||||
|
||||
---
|
||||
|
||||
## 4. The one file: `plan.md`
|
||||
|
||||
cwd root, git-tracked. Goals, subtasks, and a short log. The agent maintains all of it through its normal Edit tool — no separate store machinery.
|
||||
|
||||
```markdown
|
||||
# Plan: <one-line objective>
|
||||
|
||||
## Goal: Implement cache layer
|
||||
<!-- id: cache-layer-1 -->
|
||||
status: active
|
||||
done_when: p95 < 50ms on bench-X. If wrong: timeouts in load-test.log
|
||||
verify: pytest tests/cache -q && python bench/p95.py --max-ms 50
|
||||
failure_modes:
|
||||
- cache silently bypassed (hit-rate ~0, latency ok by luck)
|
||||
- bench too small to exercise eviction
|
||||
- verify passes on a trivial/gamed test
|
||||
- [x] wire cache client
|
||||
- [ ] eviction policy
|
||||
- [ ] load test
|
||||
|
||||
## Goal: ...
|
||||
|
||||
## Log
|
||||
- 2026-06-15 14:02 cache client wired; eviction next
|
||||
- 2026-06-15 14:31 eviction done; p95 bench reads 47ms (load-test.log)
|
||||
- 2026-06-15 14:33 cache-layer-1 signed off (verify green, oracle accept)
|
||||
```
|
||||
|
||||
Conventions:
|
||||
|
||||
- **Goals carry `status:` and no checkbox; subtasks are `- [ ]`.** `status` ∈ `open | active | done | cancelled`. Multiple goals may be `active` (D9). Subtasks tick freely.
|
||||
- **`<!-- id -->`** assigned at creation; stable key (survives renaming the subject).
|
||||
- **`verify:`** (optional) is the deterministic stage-1 command.
|
||||
- **`failure_modes`** should name "verify could pass while still wrong" whenever a `verify:` exists.
|
||||
- **`## Log`** is manus-style: append-only **by convention**, one short line per event. The reminder (§8a) enforces appending. Terse — "where it's up to" + error memory, not a transcript.
|
||||
|
||||
Parsing: a line scanner suffices for v0. `mdast` + `remark-gfm` only if it bites. Parse for *reading*; for the rare programmatic write (status flip, checkbox reconcile) use exact-line string patching, never a full AST serialize.
|
||||
|
||||
---
|
||||
|
||||
## 5. Tools
|
||||
|
||||
`CompleteGoal` is the one blessed path (it runs the check and records it). Everything else — create goal, edit plan, tick subtasks, append to log — is plain Edit, guided by the reminder.
|
||||
|
||||
### `CompleteGoal(id, evidence, paths[])` — the sign-off check
|
||||
|
||||
1. Read `done_when` + `verify` + `failure_modes` for the goal from `plan.md` (git diff is the tamper-check, D7).
|
||||
2. **Evidence must point to durable artifacts** the read-only judge can inspect (saved logs, committed diffs, files). Ephemeral claims fail stage 2.
|
||||
3. **Stage 1 — deterministic.** If `verify` exists, run it shell-free, capture exit + output tail. Non-zero → reject immediately, return the tail. No model call spent.
|
||||
4. **Stage 2 — oracle.** Spawn the read-only judge (D11 default = current model; §9) with the criterion, failure modes, evidence, and verify result; it inspects the repo and checks the verify command was not gamed against the named failure modes.
|
||||
5. Verdict: **accept** → string-patch `status: done`, append a `## Log` line. **reject** → status stays `active`, append `missing[]` to `## Log`, return `missing`.
|
||||
|
||||
### `CancelGoal(id, reason)` — optional
|
||||
|
||||
open/active → cancelled is not a sign-off, so it skips the check. A tool only to guarantee a `## Log` line lands.
|
||||
|
||||
---
|
||||
|
||||
## 6. Guiding sign-off (no hard gate)
|
||||
|
||||
Per D3, there is no pre-tool-use interceptor blocking `status: done`. Sign-off is guided, not gated:
|
||||
|
||||
- the **reminder** (§8a) tells the agent to complete a goal through `CompleteGoal`, not by hand-editing status;
|
||||
- `CompleteGoal` is the obvious, blessed path that runs the check and writes the log line;
|
||||
- the **widget** (§7) can flag a goal whose `status: done` has no corresponding `## Log` sign-off line — visibility, not a block;
|
||||
- `plan.md` is git-tracked, so any hand-tick shows in the diff.
|
||||
|
||||
The agent *can* bypass it. The bet — borne out by how the other extensions actually run — is that a clear form plus a standing reminder makes the blessed path the path taken, and visibility catches the rare bypass.
|
||||
|
||||
---
|
||||
|
||||
## 7. Commands
|
||||
|
||||
- `/plan <desc>` — **enter plan mode** (§3a): read-only explore → draft goals → review. Ready offers the compact choice, then starts execution.
|
||||
- `/plan` (no args) — render the **task-list widget**: each goal with status + its subtask checkboxes + "N done hidden"; flag any `done` goal lacking a sign-off log line; offer archive-completed and cancel-goal.
|
||||
- `/goal` — (re)start the loop on the current plan.
|
||||
- `/goal pause | resume | clear | status` — loop controls.
|
||||
- `/subgoal <text>` — append an acceptance criterion to a goal mid-loop. Optional.
|
||||
- `/judge model <ref>` — set the sign-off judge model (default: current model; set a cross-vendor ref here for stronger independence, §9).
|
||||
|
||||
---
|
||||
|
||||
## 8. Hooks / lifecycle
|
||||
|
||||
- **`before_agent_start`** — parse `plan.md`; inject a fixed-shape summary (active goals + focus + last log line) as a late **user-role** message. Compaction-persistence.
|
||||
- **reminder** — §8a.
|
||||
- **pre-compact** — flush state to `plan.md` before compaction.
|
||||
|
||||
(No pre-tool-use gate — D3.)
|
||||
|
||||
### 8a. The reminder (typed; what it says)
|
||||
|
||||
Fires when a goal is `active` and there have been **N file-modifying turns since the last `plan.md` update**. One `<system-reminder>` covering both task upkeep and goal progress:
|
||||
|
||||
- **task** — tick completed subtask checkboxes; add new ones discovered.
|
||||
- **log** — append **one short line** to `## Log` (append, don't rewrite).
|
||||
- **goal** — if a goal's evidence is in, **sign it off via `CompleteGoal`** — don't hand-tick `status: done`.
|
||||
- **autonomy** — keep working toward an active goal; don't stop to ask unless genuinely blocked.
|
||||
|
||||
Both the housekeeping and the autonomy engine, and — with no hard gate — the main thing making the process get followed. Keep the wording stable so it doesn't thrash the cache.
|
||||
|
||||
---
|
||||
|
||||
## 9. Judges
|
||||
|
||||
| | Loop judge | Sign-off judge (stage 2) |
|
||||
|---|---|---|
|
||||
| Drives | continue / pause each turn | accept / reject a sign-off |
|
||||
| Cost | cheap, every turn | costly, once per goal |
|
||||
| Reads | the agent's last response (~4 KB) | the repo, independently |
|
||||
| Transport | one small model call (D10) | read-only oracle subprocess |
|
||||
| On failure | fail-open → continue; **budget** is the backstop | fail-closed → goal stays active |
|
||||
| Foolable? | yes — asserted "done" passes; bounded by budget | hard: re-reads artifacts + runs `verify` |
|
||||
|
||||
### Sign-off judge: model choice (D11)
|
||||
|
||||
- **Default: the session's current model.** Guaranteed authorized and capable, because you're already running it. Auto-picking "strongest on provider" (oracle-style) is rejected as the default — those lists carry dead, weak, and unauthorized entries.
|
||||
- **Most of the value is model-independent.** The read-only judge re-derives from artifacts: does the evidence match the repo, is the `verify` tautological, is each failure mode actually ruled out. Any capable model does that regardless of family.
|
||||
- **Cross-vendor is the stronger-independence setting** (`/judge model`), for the residual *shared-reasoning-error* class, when you have a known-good alternative. Mirror the oracle's curated provider list for that override menu; don't auto-select from it.
|
||||
|
||||
### Transport (oracle pattern, copied)
|
||||
|
||||
- **Shell-free spawn.** `spawn(command, argsArray)`, no `shell:true`; capture stdout via pipe and parse. Why it avoids the noclobber/cropping pain of `pi -p … > out.json` under zsh. ~40 lines.
|
||||
- **Read-only toolset.** `read / grep / find / ls`, optional non-mutating `bash`. Separate process = fresh context, no anchoring — the independence you reliably get even from the same model.
|
||||
- **Verdict contract.** Oracle returns prose by default; impose `VERDICT: accept|reject` + `missing:` in the prompt and parse that block.
|
||||
|
||||
---
|
||||
|
||||
## 10. `prompts.tsx`
|
||||
|
||||
All model-facing text in one file, in flow order (drafted separately):
|
||||
|
||||
1. **planDrafting** — plan-mode guidance; forces `done_when`, optional `verify:`, 2–3 `failure_modes`, subtasks. Human approves it.
|
||||
2. **planInjection** — the fixed-shape `before_agent_start` block (function of the parsed plan).
|
||||
3. **reminder** — the typed nudge (§8a).
|
||||
4. **continuation** — Hermes-style "keep going" user-role message.
|
||||
5. **loopJudge** — conservative, strict JSON `{done, reason}`.
|
||||
6. **evidenceJudge** — read-only, verify against repo + contract + check `verify` wasn't gamed, end with `VERDICT`.
|
||||
|
||||
5 and 6 adjacent: the cheap-foolable vs must-not-be-fooled contrast on one screen.
|
||||
|
||||
---
|
||||
|
||||
## 11. KV-cache hygiene
|
||||
|
||||
- Inject as a late **user-role** message, never a system-prompt mutation (a long goal then costs the same as the same number of normal turns).
|
||||
- Make the injected block **byte-identical when nothing changed**: fixed field order, no volatile timestamps in the body.
|
||||
|
||||
---
|
||||
|
||||
## 12. Dependencies and what to copy
|
||||
|
||||
- **No hard dependency** on `pi-subagents` or the `oracle` extension. Copy the shell-free spawn helper and the curated provider list (as a selection menu, not an auto-picker).
|
||||
- Markdown: line scanner first; `mdast` + `remark-gfm` only if needed.
|
||||
- Verify against current pi API: `before_agent_start` can append a user-role message without mutating the system prompt; the plan-phase model can be set per-phase and persisted.
|
||||
|
||||
---
|
||||
|
||||
## 13. Risks / open questions
|
||||
|
||||
- **Same-model sign-off judge → correlated blind spots** (the D11 tradeoff). Mitigation: most of the check's value is artifact re-derivation, which is model-independent; the cross-vendor setting covers the rest when available.
|
||||
- **No hard gate (D3)** — the agent can hand-tick `status: done` and skip the check. Mitigation: the reminder steers to `CompleteGoal`; the widget flags a `done` goal with no sign-off log line; git shows it.
|
||||
- **Contract tampering (D7)** — editable `plan.md` means `done_when`/`failure_modes` can be softened pre-sign-off. Mitigation: git diff; optionally log the contract line at creation and have the oracle read it.
|
||||
- **Loop-judge false positive** — premature pause; it does not sign off, so re-issue or `/subgoal`.
|
||||
- **`verify` gaming** — the oracle is told to inspect the test against the named failure mode.
|
||||
- **`## Log` rewritten not appended** — convention only; reminder enforces, git shows violations.
|
||||
- **Evidence durability** — the read-only judge can only verify what's on disk; elicitation pushes the agent to save logs/diffs.
|
||||
|
||||
---
|
||||
|
||||
## 14. Build order
|
||||
|
||||
Each step independently testable; model calls enter late.
|
||||
|
||||
1. `plan.md` format + line parser (incl. `<!-- id -->` and `## Log`) + `/plan` task-list widget. Pure file, no model calls.
|
||||
2. Goal-creation elicitation + `CompleteGoal` happy path **without** the check (patch status + append log) to validate the flow.
|
||||
3. Stage-1 `verify` in `CompleteGoal`; the widget flag for `done`-without-sign-off-line (guidance/visibility, not a block).
|
||||
4. Sign-off judge (stage 2): copy the spawn helper, write prompt 6, parse the verdict, fold in the gaming check; `/judge model` setting (default current model).
|
||||
5. `before_agent_start` injection (cache-safe) + the reminder (§8a).
|
||||
6. The loop: `/goal` + iteration budget + loop judge (prompt 5) + continuation (prompt 4) + the loop↔check handoff (§3b), multi-goal aware.
|
||||
7. Plan mode (§3a): `/plan <desc>` read-only draft → review → compact choice → hand-off. Plan-phase model selection + stickiness (D12). (Until built, create goals by direct `plan.md` edit.)
|
||||
8. Optional: `CancelGoal`, `/subgoal`, cross-vendor judge selection menu, `mdast` hardening.
|
||||
|
||||
`prompts.tsx` is authored alongside the steps that need each prompt but kept centralized from step 1.
|
||||
Reference in New Issue
Block a user