wassname/pi-lgtm

Fork 0

mirror of https://github.com/wassname/pi-lgtm.git synced 2026-06-27 15:31:29 +08:00

Files

T

wassname 927a482d79 rename to pi-proof-tasks and simplify proof log

2026-06-14 11:59:47 +08:00

9.0 KiB

Raw Blame History

@wassname2/pi-proof-tasks

Original ask:

I would like a task list where

the top level tasks are goals and proof.

A create_goal that makes the agent think about "what are the most likely, subtle, failure modes and what cheap and easy to review evidence can distinguish between them and goal success".

A submit_proof form where a subagent provides independant sanity check of on the evidence before completing

wassname

Hermes-style evidence + judge task list for Pi.

A pi extension that adds proof-gated top-level tasks to task tracking. Fork of @tintinweb/pi-tasks with an evidence/review layer inspired by /until-done.

The core idea: subtasks are normal checklist items, but top-level tasks are goals. Agents cannot mark top-level tasks complete directly. They must call TaskClaimDone with auditable evidence, UAT hints, and explicit failure-mode analysis. A fresh judge then accepts or rejects the claim. Accepted review completes the task; rejected review leaves it open with suggestions.

Humans can use /lgtm to view the proof log and sanity-check the reviewer notes later. /lgtm is intentionally thin: proof viewing lives there, task management stays in /tasks.

Install

pi install npm:@wassname2/pi-proof-tasks

Or for development:

pi -e ./src/index.ts

What is different from pi-tasks

pi-tasks	pi-proof-tasks
Agent calls `TaskUpdate { status: "completed" }` on any task	Allowed only for subtasks; top-level tasks reject direct completion
No evidence required	`TaskClaimDone` requires evidence, likely/subtle/unknown failures, falsification test, and uncertainty
Tasks complete immediately	Top-level tasks complete only after accepted automatic proof review
No done criterion	`done_criterion` required on create: falsifiable observation

Stripped: TaskExecute, TaskOutput, TaskStop, process-tracker.ts, subagent RPC, settings menu.

● 3 tasks (1 done, 1 in progress, 1 open)
  ✔ #1 Design schema
  ✳ #2 Implementing cache layer… (2m 49s · ↑ 4.1k ↓ 1.2k)
  ◻ #3 Load test

Collapsed rows stay simple. Proof details live in TaskGet and /lgtm, not in the widget row itself.

Tools

`TaskCreate`

subject, description, done_criterion (required), progress_label (optional), parentId (optional)

Omit parentId for a proof-gated top-level goal. Set parentId for a directly tickable subtask.

done_criterion must be a falsifiable observation: what you expect to see AND what you would see if it is wrong. Example: "All 125 tests pass. If wrong: type errors in build or failures in task-store.test.ts."

`TaskList`

Lists all tasks in the same compact one-line style as the widget. Proof details live in TaskGet and /lgtm.

`TaskGet`

Full task details including done_criterion, task kind, completion mode, review state, gate status, evidence packet, review iterations, and evidence history.

`TaskUpdate`

Update status (pending | in_progress | completed | deleted), subject, description, done_criterion, dependencies, metadata, or parentId.

status=completed is allowed for subtasks only. Top-level tasks reject with a message telling the agent to use TaskClaimDone.

`TaskClaimDone`

The epistemic gate for proof and UAT. Required fields:

Field	Description
`taskId`	Top-level task to claim done
`evidence`	Exact command output, commit hash, config/seeds, file paths. Verbatim proof, not a summary
`failure_likely`	Most likely way this is wrong despite evidence
`failure_sneaky`	Subtle/silent failure that looks like success superficially
`failure_unknown`	Unknown or untested failure class that could remain
`falsification_test`	What you ran and what you got, with literal output
`evidence_reasoning`	Why this evidence cheaply distinguishes success from the named failures
`verification_hints`	Where to look and what to check, with specific content quoted
`remaining_uncertainty`	What is NOT tested, deferred edge cases, known limitations
`commands`	Optional structured command records: `{ cmd, exit_code, stdout_path?, stderr_path? }`
`evidence_paths` / `falsification_paths`	Optional local artifact paths. Stored as absolute path + sha256 + byte size
`supersede_reason`	Optional reason when this replaces older evidence on the same task

The tool stores a compact canonical proof packet. The automatic reviewer sees that exact packet. Humans later see the same packet via /lgtm.

If the reviewer accepts, the task is completed. If it rejects, the task remains open with missing evidence and suggestions. If the reviewer fails to run, the task still completes and the failure note is stored in the proof log for later inspection.

`lgtm_supersede`

Explicitly retire the current evidence package without completing the task.

Use this when the claim changed or the prior evidence is stale. The tool archives the current evidence, current robot reviews, and reviewer-failure context into history with your reason. Submit a fresh TaskClaimDone claim to complete the task.

`robot_review_ask`

Attach a fresh-perspective robot review to a task.

Use this from a separate model/subagent when possible. Reviews append as iterations and are advisory. They do not complete tasks; the automatic gate runs through TaskClaimDone or robot_review_run.

`robot_review_run`

Run the automatic robot reviewer against the current task evidence using the current session model.

Default reviewer stage:

pi --mode json -p --no-session --no-tools --no-extensions --model <current-session-model>

The reviewer deliberately reuses the active session model in a fresh Pi process. That keeps model selection simple and avoids choosing a registry-listed judge model that exists but does not have working auth.

The reviewer returns an explicit accepted boolean plus observations, concerns, suggestions, blind spots, missing evidence, and rubric reasons. Rejection keeps the task open. Reviewer infrastructure failure is fail-open: autonomy continues and the failure note is stored in the proof log.

Commands

`/lgtm`

Proof-log viewer. Use /lgtm to pick a task, /lgtm <id> to open specific proof logs, and /lgtm * to open all open proof logs. It does not complete, delete, or clear tasks.

`/tasks`

Interactive task-management menu: view tasks, create task, delete a selected task, clear completed, or clear all.

Task lifecycle

Top-level task:
pending -> in_progress -> TaskClaimDone
                       -> current evidence iteration N 🛠
                       -> robot review iteration(s) 🤖
                       -> completed ✓          if latest robot review accepts
                       -> remains open         if reviewer rejects
                       -> completed            if reviewer infrastructure fails (fail-open, note logged)
                       -> lgtm_supersede or newer TaskClaimDone -> superseded history + fresh current evidence
                       -> deleted

Subtask:
pending -> in_progress -> TaskUpdate(status=completed) -> completed

Storage

Controlled by taskScope in .pi/tasks-config.json:

Mode	File	Behaviour
`memory`	none	In-memory, lost on session end
`session` (default)	`.pi/tasks/tasks-<sessionId>.json`	Per-session, survives resume
`project`	`.pi/tasks/tasks.json`	Shared across all sessions

Override via env:

PI_TASKS=off          # in-memory (CI)
PI_TASKS=sprint-1     # named shared list at ~/.pi/tasks/sprint-1.json
PI_TASKS=/abs/path    # explicit path
PI_TASKS_DEBUG=1      # trace to stderr

Architecture

src/
├── index.ts          # tools + /tasks + /lgtm evidence viewer + widget + event handlers
├── review-badges.ts # Review badge helpers for evidence/review/completion lanes
├── robot-review.ts  # Robot review iteration storage + compatibility helpers
├── types.ts         # Task, TaskStatus types
├── task-store.ts    # File-backed store with CRUD, locking, complete() method
├── auto-clear.ts    # Turn-based auto-clearing of completed tasks
├── tasks-config.ts  # Config persistence -> .pi/tasks-config.json
└── ui/
    └── task-widget.ts  # Widget with status icons and spinner

UI split

/tasks is the management surface.
/lgtm is the proof-log viewer.
TaskClaimDone is the completion gate for top-level tasks.

That split is deliberate. It keeps proof inspection separate from task mutation and stays closer to the simpler pre-fork task UI.

UAT and development

The intended proof mix is one end-to-end functional test, one live UAT in the extension itself, and only a few targeted unit tests for invariants.

npm install
npm run typecheck
npm test
npm run build
npm run lint

License

MIT -- based on tintinweb/pi-tasks (MIT)

9.0 KiB Raw Blame History

@wassname2/pi-proof-tasks

Install

What is different from pi-tasks

Widget

Tools

TaskCreate

TaskList

TaskGet

TaskUpdate

TaskClaimDone

lgtm_supersede

robot_review_ask

robot_review_run