rename to pi-proof-tasks and simplify proof log

2026-06-27 15:46:13 +08:00 · 2026-06-14 10:00:50 +08:00
parent 8f66b85901
commit 927a482d79
15 changed files with 677 additions and 574 deletions
@@ -1,4 +1,4 @@
-# @wassname/pi-lgtm
+# @wassname2/pi-proof-tasks

 Original ask:
 > I would like a task list where
@@ -7,18 +7,18 @@ Original ask:
 > 3) A submit_proof form where a subagent provides independant sanity check of on the evidence before completing
 > 4) - wassname

-Help your agent track goals and aim for human sign off.
+Hermes-style evidence + judge task list for Pi.

-A [pi](https://pi.dev) extension that adds structured human sign-off to task tracking. Fork of [@tintinweb/pi-tasks](https://github.com/tintinweb/pi-tasks) with a minimal LGTM layer.
+A [pi](https://pi.dev) extension that adds proof-gated top-level tasks to task tracking. Fork of [@tintinweb/pi-tasks](https://github.com/tintinweb/pi-tasks) with an evidence/review layer inspired by `/until-done`.

-The core idea: agents cannot mark tasks complete themselves. They must call `lgtm_ask` with auditable evidence and explicit failure-mode analysis, then a human signs off via `/lgtm <id>`.
+The core idea: subtasks are normal checklist items, but top-level tasks are goals. Agents cannot mark top-level tasks complete directly. They must call `TaskClaimDone` with auditable evidence, UAT hints, and explicit failure-mode analysis. A fresh judge then accepts or rejects the claim. Accepted review completes the task; rejected review leaves it open with suggestions.

-Tasks can also carry a separate fresh-perspective robot review from a subagent or other model family. Robot reviews can iterate: if the latest review says the evidence is incomplete or unconvincing, human sign-off is held back until the agent strengthens the evidence and reruns review.
+Humans can use `/lgtm` to view the proof log and sanity-check the reviewer notes later. `/lgtm` is intentionally thin: proof viewing lives there, task management stays in `/tasks`.

 ## Install

 ```bash
-pi install npm:@wassname2/pi-lgtm
+pi install npm:@wassname2/pi-proof-tasks
 ```

 Or for development:
@@ -32,11 +32,11 @@ pi -e ./src/index.ts

 ## What is different from pi-tasks

-| pi-tasks | pi-lgtm |
+| pi-tasks | pi-proof-tasks |
 |---|---|
-| Agent calls `TaskUpdate { status: "completed" }` | Blocked -- throws error |
-| No evidence required | `lgtm_ask` requires evidence, 2 failure modes, falsification test |
-| Tasks complete immediately | Agent sets `pending_approval`, human runs `/lgtm <id>` |
+| Agent calls `TaskUpdate { status: "completed" }` on any task | Allowed only for subtasks; top-level tasks reject direct completion |
+| No evidence required | `TaskClaimDone` requires evidence, likely/subtle/unknown failures, falsification test, and uncertainty |
+| Tasks complete immediately | Top-level tasks complete only after accepted automatic proof review |
 | No done criterion | `done_criterion` required on create: falsifiable observation |

 Stripped: `TaskExecute`, `TaskOutput`, `TaskStop`, `process-tracker.ts`, subagent RPC, settings menu.
@@ -47,85 +47,71 @@ Stripped: `TaskExecute`, `TaskOutput`, `TaskStop`, `process-tracker.ts`, subagen
 ● 3 tasks (1 done, 1 in progress, 1 open)
  ✔ #1 Design schema
  ✳ #2 Implementing cache layer… (2m 49s · ↑ 4.1k ↓ 1.2k)
-  ◻ #3 Load test 🛠 🤖 👀
+  ◻ #3 Load test
 ```

-Badges:
-
- `🛠` tool evidence attached via `lgtm_ask`
- `🤖` one or more robot review iterations attached
- `👀` pending human sign-off via `/lgtm`
+Collapsed rows stay simple. Proof details live in `TaskGet` and `/lgtm`, not in the widget row itself.

 ## Tools

 ### `TaskCreate`

 ```
-subject, description, done_criterion (required), progress_label (optional)
+subject, description, done_criterion (required), progress_label (optional), parentId (optional)
 ```

-`done_criterion` must be a falsifiable observation: what you expect to see AND what you would see if it is wrong. Example: `"All 92 tests pass. If wrong: type errors in build or failures in task-store.test.ts."`
+Omit `parentId` for a proof-gated top-level goal. Set `parentId` for a directly tickable subtask.
+
+`done_criterion` must be a falsifiable observation: what you expect to see AND what you would see if it is wrong. Example: `"All 125 tests pass. If wrong: type errors in build or failures in task-store.test.ts."`

 ### `TaskList`

-Lists all tasks. `👀` indicates pending sign-off.
+Lists all tasks in the same compact one-line style as the widget. Proof details live in `TaskGet` and `/lgtm`.

 ### `TaskGet`

-Full task details including `done_criterion`, approval state, `completion mode`, `review state`, a one-line gate status such as `ready for human sign-off via /lgtm 5` or `blocked: automatic robot review failed: ...`, and evidence-iteration history.
+Full task details including `done_criterion`, task kind, `completion mode`, `review state`, gate status, evidence packet, review iterations, and evidence history.

 ### `TaskUpdate`

-Update status (`pending | in_progress | deleted`), subject, description, done_criterion, dependencies. Cannot set `completed` -- use `/lgtm`.
+Update status (`pending | in_progress | completed | deleted`), subject, description, done_criterion, dependencies, metadata, or `parentId`.

-### `lgtm_ask`
+`status=completed` is allowed for subtasks only. Top-level tasks reject with a message telling the agent to use `TaskClaimDone`.

-The epistemic gate. Required fields:
+### `TaskClaimDone`
+
+The epistemic gate for proof and UAT. Required fields:

 | Field | Description |
 |---|---|
-| `taskId` | Task to submit |
-| `evidence` | Exact command run + output, commit hash, config/seeds, file paths. "I ran X and got Y" not "I wrote X". |
+| `taskId` | Top-level task to claim done |
+| `evidence` | Exact command output, commit hash, config/seeds, file paths. Verbatim proof, not a summary |
 | `failure_likely` | Most likely way this is wrong despite evidence |
-| `failure_sneaky` | Perverse/silent failure that looks like success superficially |
-| `falsification_test` | What you ran and what you got, so both you and the human can sanity-check it. Why that result could not occur if a failure mode were real. |
-| `verification_hints` | Where to look and what to check. These still force the agent to think, but weak hints are advisory rather than a hard block when the verbatim evidence already proves the claim. Core evidence still has to pass on its own. |
+| `failure_sneaky` | Subtle/silent failure that looks like success superficially |
+| `failure_unknown` | Unknown or untested failure class that could remain |
+| `falsification_test` | What you ran and what you got, with literal output |
+| `evidence_reasoning` | Why this evidence cheaply distinguishes success from the named failures |
+| `verification_hints` | Where to look and what to check, with specific content quoted |
 | `remaining_uncertainty` | What is NOT tested, deferred edge cases, known limitations |
 | `commands` | Optional structured command records: `{ cmd, exit_code, stdout_path?, stderr_path? }` |
 | `evidence_paths` / `falsification_paths` | Optional local artifact paths. Stored as absolute path + sha256 + byte size |
 | `supersede_reason` | Optional reason when this replaces older evidence on the same task |

-After calling this, the task shows `👀` and is only completable via `/lgtm <id>`. Evidence is stored on the task so the human can review it hours later without scrolling back. Re-submitting evidence archives the prior package into superseded history instead of silently overwriting it.
+The tool stores a compact canonical proof packet. The automatic reviewer sees that exact packet. Humans later see the same packet via `/lgtm`.

-The tool result includes a non-blocking self-check prompt asking whether the evidence directly addresses the `done_criterion` and whether a skeptical reviewer would find it convincing.
-
-`lgtm_ask` always runs the robot-review stage immediately after storing evidence. A robot review that rejects the evidence clears `pending_approval` until the evidence is strengthened and reviewed again. Weak verification hints are advisory if the core verbatim evidence already proves the done criterion. A reviewer crash, auth failure, timeout, or malformed output is recorded as a warning and leaves human sign-off open.
+If the reviewer accepts, the task is completed. If it rejects, the task remains open with missing evidence and suggestions. If the reviewer fails to run, the task still completes and the failure note is stored in the proof log for later inspection.

 ### `lgtm_supersede`

 Explicitly retire the current evidence package without completing the task.

-Use this when the claim changed or the prior evidence is stale. The tool archives the current evidence, current robot reviews, and reviewer-failure context into history with your reason, then closes the human gate until new evidence is submitted.
+Use this when the claim changed or the prior evidence is stale. The tool archives the current evidence, current robot reviews, and reviewer-failure context into history with your reason. Submit a fresh `TaskClaimDone` claim to complete the task.

 ### `robot_review_ask`

 Attach a fresh-perspective robot review to a task.

-Required fields:
-
-| Field | Description |
-|---|---|
-| `taskId` | Task to annotate |
-| `reviewer` | Model/provider/family/class used for the review |
-| `scope` | What the reviewer inspected |
-| `observations` | Concrete observations only. No advice, verdicts, or editorial |
-| `blind_spots` | What the reviewer did not inspect or could not verify |
-| `accepted` | Overall accept/reject decision for whether the task is ready to advance |
-| `evidence_complete` | Whether the supplied evidence actually covers the done criterion |
-| `evidence_convincing` | Whether the supplied evidence would convince a skeptical reviewer |
-| `missing_evidence` | Concrete missing checks or artifacts needed before human sign-off |
-
-Use this from a separate subagent or other model when possible. Reviews append as iterations; the latest one is what gates human sign-off. If stored LGTM evidence already exists, an accepted manual review reopens the human sign-off gate.
+Use this from a separate model/subagent when possible. Reviews append as iterations and are advisory. They do not complete tasks; the automatic gate runs through `TaskClaimDone` or `robot_review_run`.

 ### `robot_review_run`

@@ -137,29 +123,35 @@ Default reviewer stage:
 pi --mode json -p --no-session --no-tools --no-extensions --model <current-session-model>
 ```

-This appends a new robot-review iteration. The reviewer returns an explicit `accepted` boolean as well as detailed observations, blind spots, and missing evidence. If the latest robot review rejects the evidence, `/lgtm` is blocked until stronger evidence is submitted and reviewed again. If the reviewer process fails to run or returns malformed output, the failure is recorded but human sign-off stays open.
+The reviewer deliberately reuses the active session model in a fresh Pi process. That keeps model selection simple and avoids choosing a registry-listed judge model that exists but does not have working auth.
+
+The reviewer returns an explicit `accepted` boolean plus observations, concerns, suggestions, blind spots, missing evidence, and rubric reasons. Rejection keeps the task open. Reviewer infrastructure failure is fail-open: autonomy continues and the failure note is stored in the proof log.

 ## Commands

-### `/lgtm <id>`
+### `/lgtm`

-Human-only sign-off. Shows stored evidence, falsification output, failure modes, review status, and remaining uncertainty in structured sections for review, then asks for confirmation. Without `<id>`, shows a list of pending-approval tasks.
+Proof-log viewer. Use `/lgtm` to pick a task, `/lgtm <id>` to open specific proof logs, and `/lgtm *` to open all open proof logs. It does not complete, delete, or clear tasks.

 ### `/tasks`

-Interactive menu: view tasks, create task, clear completed/all.
+Interactive task-management menu: view tasks, create task, delete a selected task, clear completed, or clear all.

 ## Task lifecycle

-```
-pending -> in_progress -> (lgtm_ask)
-                       -> current evidence iteration N
-                       -> robot review iteration(s) on current evidence 🤖
-                       -> pending_approval 👀   if latest robot review passes, or reviewer infra fails
-                       -> reviewer_rejected
-                       -> lgtm_supersede or newer lgtm_ask -> superseded history + fresh current evidence
-                       -> (/lgtm) -> completed
+```text
+Top-level task:
+pending -> in_progress -> TaskClaimDone
+                       -> current evidence iteration N 🛠
+                       -> robot review iteration(s) 🤖
+                       -> completed ✓          if latest robot review accepts
+                       -> remains open         if reviewer rejects
+                       -> completed            if reviewer infrastructure fails (fail-open, note logged)
+                       -> lgtm_supersede or newer TaskClaimDone -> superseded history + fresh current evidence
                       -> deleted
+
+Subtask:
+pending -> in_progress -> TaskUpdate(status=completed) -> completed
 ```

 ## Storage
@@ -183,26 +175,37 @@ PI_TASKS_DEBUG=1      # trace to stderr

 ## Architecture

-```
+```text
 src/
-├── index.ts        # 8 tools + /tasks + /lgtm commands + widget + event handlers
-├── review-badges.ts # Review badge helpers for tool/robot/human lanes
+├── index.ts          # tools + /tasks + /lgtm evidence viewer + widget + event handlers
+├── review-badges.ts # Review badge helpers for evidence/review/completion lanes
 ├── robot-review.ts  # Robot review iteration storage + compatibility helpers
 ├── types.ts         # Task, TaskStatus types
 ├── task-store.ts    # File-backed store with CRUD, locking, complete() method
 ├── auto-clear.ts    # Turn-based auto-clearing of completed tasks
 ├── tasks-config.ts  # Config persistence -> .pi/tasks-config.json
 └── ui/
-    └── task-widget.ts  # Widget with status icons, spinner, 👀 indicator
+    └── task-widget.ts  # Widget with status icons and spinner
 ```

-## Development
+## UI split
+
+- `/tasks` is the management surface.
+- `/lgtm` is the proof-log viewer.
+- `TaskClaimDone` is the completion gate for top-level tasks.
+
+That split is deliberate. It keeps proof inspection separate from task mutation and stays closer to the simpler pre-fork task UI.
+
+## UAT and development
+
+The intended proof mix is one end-to-end functional test, one live UAT in the extension itself, and only a few targeted unit tests for invariants.

 ```bash
 npm install
 npm run typecheck
-npm test            # 92 tests
+npm test
 npm run build
+npm run lint
 ```

 ## License
@@ -1,11 +1,11 @@
 {
-  "name": "@wassname/pi-lgtm",
+  "name": "@wassname2/pi-proof-tasks",
  "version": "0.4.2",
  "lockfileVersion": 3,
  "requires": true,
  "packages": {
    "": {
-      "name": "@wassname/pi-lgtm",
+      "name": "@wassname2/pi-proof-tasks",
      "version": "0.4.2",
      "license": "MIT",
      "dependencies": {
@@ -1,7 +1,7 @@
 {
-  "name": "@wassname2/pi-lgtm",
+  "name": "@wassname2/pi-proof-tasks",
  "version": "0.4.2",
-  "description": "A pi extension providing goal tracking with structural sign-off and LGTM workflow.",
+  "description": "Hermes-style evidence + judge task list for Pi, with proof-gated top-level completion and UAT logs.",
  "author": "wassname",
  "license": "MIT",
  "repository": {
@@ -16,9 +16,13 @@
    "pi-package",
    "pi",
    "pi-extension",
-    "lgtm",
-    "sign-off",
-    "goal-tracking"
+    "proof",
+    "judge",
+    "uat",
+    "evidence",
+    "task-list",
+    "failure-modes",
+    "hermes-style"
  ],
  "dependencies": {
    "@mariozechner/pi-coding-agent": "^0.62.0",
@@ -1,7 +1,7 @@
 import { getLatestRobotReview, getRobotReviews } from "./robot-review.js";
 import type { Task } from "./types.js";

-const STAGES = ["🛠", "🤖", "👀"] as const;
+const STAGES = ["🛠", "🤖", "✓"] as const;

 function hasCurrentEvidence(task: Task): boolean {
  return typeof task.metadata?.lgtm_evidence === "string" && task.metadata.lgtm_evidence.length > 0;
@@ -11,12 +11,12 @@ function hasEvidenceHistory(task: Task): boolean {
  return Array.isArray(task.metadata?.lgtm_history) && task.metadata.lgtm_history.length > 0;
 }

-/** Pipeline stages: `[🛠·🤖·👀]` fills left-to-right as evidence→review→signoff progresses. */
+/** Pipeline stages: `[🛠·🤖·✓]` fills left-to-right as evidence→review→completed progresses. */
 export function getReviewBadges(task: Task): string {
  const filled = [
    !!task.metadata?.lgtm_evidence,
    getRobotReviews(task).length > 0,
-    task.pending_approval && task.status !== "completed",
+    task.status === "completed",
  ];
  const slots = STAGES.map((emoji, i) => filled[i] ? emoji : "·");
  return `[${slots.join("")}]`;
@@ -25,77 +25,74 @@ export function getReviewBadges(task: Task): string {
 export const REVIEW_BADGES = {
  evidence: STAGES[0],
  robot: STAGES[1],
-  human: STAGES[2],
+  complete: STAGES[2],
  pipeline: STAGES,
 };

-export type DisplayStatus = "awaiting_signoff" | "in_progress" | "pending" | "completed";
+export type DisplayStatus = "in_progress" | "pending" | "completed";

-/** Derived display bucket. `awaiting_signoff` is pending_approval && !completed. */
 export function getDisplayStatus(task: Task): DisplayStatus {
-  if (task.status === "completed") return "completed";
-  if (task.pending_approval) return "awaiting_signoff";
  return task.status;
 }

-export type CompletionMode = "direct" | "lgtm";
+export type CompletionMode = "direct" | "proof";
 export type ReviewState =
-  | "no_evidence"
-  | "evidence_submitted"
+  | "no_claim"
+  | "claim_submitted"
  | "reviewer_failed_to_run"
  | "reviewer_rejected"
-  | "ready_for_human"
+  | "reviewer_accepted"
  | "superseded"
-  | "human_signed_off";
-export type StateTag = "READY" | "ACTIVE" | "PENDING" | "DONE";
+  | "completed";
+export type StateTag = "ACTIVE" | "PENDING" | "DONE";

 export function getCompletionMode(task: Task): CompletionMode {
-  return hasCurrentEvidence(task) || hasEvidenceHistory(task) || getRobotReviews(task).length > 0 || task.pending_approval
-    ? "lgtm"
-    : "direct";
+  return task.parentId ? "direct" : "proof";
 }

 export function getReviewState(task: Task): ReviewState {
-  if (task.status === "completed") return "human_signed_off";
+  if (task.status === "completed") return "completed";
  const latest = getLatestRobotReview(task);
  if (latest && !latest.accepted) return "reviewer_rejected";
-  if (task.pending_approval && hasCurrentEvidence(task)) return "ready_for_human";
+  if (latest?.accepted) return "reviewer_accepted";
  if (typeof task.metadata?.robot_review_last_error === "string") return "reviewer_failed_to_run";
-  if (hasCurrentEvidence(task)) return "evidence_submitted";
+  if (hasCurrentEvidence(task)) return "claim_submitted";
  if (hasEvidenceHistory(task)) return "superseded";
-  return "no_evidence";
+  return "no_claim";
 }

 export function getGateStatus(task: Task): string {
  const state = getReviewState(task);
-  if (state === "human_signed_off") return "human signed off";
-  if (state === "no_evidence") return "no lgtm evidence submitted";
-  if (state === "ready_for_human") {
+  if (task.parentId) {
+    return task.status === "completed" ? "completed directly as subtask" : "subtask: direct completion allowed";
+  }
+  if (task.status === "completed") {
    if (typeof task.metadata?.robot_review_last_error === "string") {
-      return `warning: automatic robot review failed, human sign-off still allowed via /lgtm ${task.id}: ${task.metadata.robot_review_last_error}`;
+      return `completed with reviewer unavailable: ${task.metadata.robot_review_last_error}`;
    }
-    return `ready for human sign-off via /lgtm ${task.id}`;
+    if (getLatestRobotReview(task)?.accepted) return "completed after accepted proof review";
+    return "completed";
  }
+  if (state === "no_claim") return "top-level task requires TaskClaimDone evidence before completion";
+  if (state === "reviewer_accepted") return "review accepted; task should be completed";
  if (state === "reviewer_failed_to_run") {
-    return `blocked: automatic robot review failed: ${task.metadata.robot_review_last_error}`;
+    return `review unavailable; autonomy continues: ${task.metadata.robot_review_last_error}`;
  }
-  if (state === "reviewer_rejected") return "blocked: latest robot review rejected the evidence";
-  if (state === "superseded") return "current evidence superseded, waiting for a new lgtm submission";
-  return "blocked: evidence submitted, robot review still required";
+  if (state === "reviewer_rejected") return "latest proof review rejected the evidence; strengthen the proof and try again";
+  if (state === "superseded") return "current evidence superseded, waiting for a new proof claim";
+  return "proof claim submitted, automatic review still required";
 }

-/** Short uppercase tag for the human ("can I /lgtm this?" at a glance). */
+/** Short uppercase tag for compact task-list display. */
 export function getStateTag(task: Task): StateTag {
  const s = getDisplayStatus(task);
  if (s === "completed") return "DONE";
-  if (s === "awaiting_signoff") return "READY";
  if (s === "in_progress") return "ACTIVE";
  return "PENDING";
 }

 /** Theme colour key for each state tag (only theme colours present in pi-tui are used). */
-export function getStateTagColor(tag: StateTag): "success" | "accent" | "dim" | undefined {
-  if (tag === "READY") return "success";
+export function getStateTagColor(tag: StateTag): "accent" | "dim" | undefined {
  if (tag === "ACTIVE") return "accent";
  if (tag === "DONE") return "dim";
  return undefined; // PENDING — default fg
@@ -7,6 +7,8 @@ export interface RobotReviewRecord {
  reviewer: string;
  scope: string;
  observations: string[];
+  concerns: string[];
+  suggestions: string[];
  blind_spots: string;
  accepted: boolean;
  evidence_complete: boolean;
@@ -46,6 +48,8 @@ function normalizeReview(value: unknown, index: number): RobotReviewRecord | und
    reviewer,
    scope,
    observations,
+    concerns: toStringArray(review.concerns),
+    suggestions: toStringArray(review.suggestions),
    blind_spots: typeof review.blind_spots === "string" ? review.blind_spots : "not recorded",
    accepted: typeof review.accepted === "boolean"
      ? review.accepted
@@ -69,6 +73,8 @@ function getLegacyRobotReview(task: Task): RobotReviewRecord | undefined {
    reviewer: typeof task.metadata?.robot_review_reviewer === "string" ? task.metadata.robot_review_reviewer : "unknown",
    scope: typeof task.metadata?.robot_review_scope === "string" ? task.metadata.robot_review_scope : "unknown",
    observations,
+    concerns: toStringArray(task.metadata?.robot_review_concerns),
+    suggestions: toStringArray(task.metadata?.robot_review_suggestions),
    blind_spots: typeof task.metadata?.robot_review_blind_spots === "string" ? task.metadata.robot_review_blind_spots : "not recorded",
    accepted: typeof task.metadata?.robot_review_accepted === "boolean"
      ? task.metadata.robot_review_accepted
@@ -101,14 +107,33 @@ export function getLatestRobotReview(task: Task): RobotReviewRecord | undefined
  return reviews.length > 0 ? reviews[reviews.length - 1] : undefined;
 }

-export function shouldOpenHumanSignoffGate(task: Task, reviewAccepted: boolean): boolean {
-  return reviewAccepted && typeof task.metadata?.lgtm_evidence === "string" && task.metadata.lgtm_evidence.length > 0;
+function hasNonEmptyString(value: unknown): boolean {
+  return typeof value === "string" && value.trim().length > 0;
+}
+
+export function hasCompleteProofClaim(task: Task): boolean {
+  const metadata = task.metadata ?? {};
+  return [
+    metadata.lgtm_evidence,
+    metadata.lgtm_failure_likely,
+    metadata.lgtm_failure_sneaky,
+    metadata.lgtm_failure_unknown,
+    metadata.lgtm_falsification_test,
+    metadata.lgtm_evidence_reasoning,
+    metadata.lgtm_remaining_uncertainty,
+  ].every(hasNonEmptyString)
+    && Array.isArray(metadata.lgtm_verification_hints)
+    && metadata.lgtm_verification_hints.some(hasNonEmptyString);
+}
+
+export function shouldCompleteAfterAcceptedReview(task: Task, reviewAccepted: boolean): boolean {
+  return reviewAccepted && hasCompleteProofClaim(task);
 }

 export function relaxAdvisoryVerificationHints(review: Omit<RobotReviewRecord, "iteration">): Omit<RobotReviewRecord, "iteration"> {
  const rubric = review.rubric;
  if (!rubric || review.evidence_complete !== true) return review;
-  const requiredCoreKeys = ["evidence_covers_done_criterion", "falsification_test_runnable", "failure_modes_addressed"];
+  const requiredCoreKeys = ["evidence_covers_done_criterion", "falsification_test_runnable", "failure_modes_addressed", "evidence_distinguishes_success"];
  if (!requiredCoreKeys.every((key) => rubric[key]?.pass === true)) return review;
  const failedKeys = Object.entries(rubric)
    .filter(([, item]) => item.pass !== true)
@@ -122,6 +147,8 @@ export function relaxAdvisoryVerificationHints(review: Omit<RobotReviewRecord, "
      ...review.observations,
      "Verification hints were weak, but treated as advisory because the verbatim evidence already covered the done criterion.",
    ],
+    concerns: review.concerns,
+    suggestions: review.suggestions,
    missing_evidence: review.missing_evidence.filter((item) => item !== "verification_hints_actionable" && !/verification hint/i.test(item)),
  };
 }
@@ -138,6 +165,8 @@ export function appendRobotReviewMetadata(task: Task, review: Omit<RobotReviewRe
    robot_review_reviewer: latest.reviewer,
    robot_review_scope: latest.scope,
    robot_review_observations: latest.observations,
+    robot_review_concerns: latest.concerns,
+    robot_review_suggestions: latest.suggestions,
    robot_review_blind_spots: latest.blind_spots,
    robot_review_accepted: latest.accepted,
    robot_review_evidence_complete: latest.evidence_complete,
@@ -83,13 +83,14 @@ export class TaskStore {
    finally { releaseLock(this.lockPath); }
  }

-  create(subject: string, description: string, done_criterion: string, progress_label?: string, metadata?: Record<string, any>): Task {
+  create(subject: string, description: string, done_criterion: string, progress_label?: string, metadata?: Record<string, any>, parentId?: string): Task {
    return this.withLock(() => {
+      if (parentId && !this.tasks.has(parentId)) throw new Error(`Parent task #${parentId} not found`);
      const now = Date.now();
      const task: Task = {
        id: String(this.nextId++),
        subject, description, done_criterion,
-        pending_approval: false,
+        parentId,
        status: "pending",
        progress_label,
        metadata: metadata ?? {},
@@ -116,9 +117,9 @@ export class TaskStore {
    subject?: string;
    description?: string;
    done_criterion?: string;
-    pending_approval?: boolean;
    progress_label?: string;
    metadata?: Record<string, any>;
+    parentId?: string | null;
    add_blocks?: string[];
    add_blocked_by?: string[];
  }): { task: Task | undefined; changedFields: string[]; warnings: string[] } {
@@ -129,13 +130,10 @@ export class TaskStore {
      const changedFields: string[] = [];
      const warnings: string[] = [];

-      // Self-completion is allowed for trivial tasks that never escalated to lgtm_ask.
-      // Once a task has stored lgtm evidence, completion must go through /lgtm so the
-      // human gate + robot review can't be skipped.
-      if (fields.status === "completed") {
-        if (task.pending_approval || task.metadata?.lgtm_evidence || (Array.isArray(task.metadata?.lgtm_history) && task.metadata.lgtm_history.length > 0)) {
-          throw new Error(`Use /lgtm ${id} to complete this task — completion_mode=lgtm because evidence was submitted.`);
-        }
+      // Subtasks are normal checklist items. Top-level tasks are goals and need a proof
+      // claim plus automatic review; TaskClaimDone is the only agent path that completes them.
+      if (fields.status === "completed" && !task.parentId) {
+        throw new Error(`Top-level task #${id} requires proof. Use TaskClaimDone with evidence and failure modes; subtasks can be completed directly.`);
      }

      if (fields.status === "deleted") {
@@ -151,7 +149,6 @@ export class TaskStore {
      if (fields.subject !== undefined) { task.subject = fields.subject; changedFields.push("subject"); }
      if (fields.description !== undefined) { task.description = fields.description; changedFields.push("description"); }
      if (fields.done_criterion !== undefined) { task.done_criterion = fields.done_criterion; changedFields.push("done_criterion"); }
-      if (fields.pending_approval !== undefined) { task.pending_approval = fields.pending_approval; changedFields.push("pending_approval"); }
      if (fields.progress_label !== undefined) { task.progress_label = fields.progress_label; changedFields.push("progress_label"); }

      if (fields.metadata !== undefined) {
@@ -162,6 +159,10 @@ export class TaskStore {
        changedFields.push("metadata");
      }

+      if (fields.parentId !== undefined) {
+        throw new Error("parentId is creation-only. Create subtasks with TaskCreate(parentId); do not downgrade top-level proof goals.");
+      }
+
      if (fields.add_blocks?.length) {
        for (const targetId of fields.add_blocks) {
          if (!task.blocks.includes(targetId)) task.blocks.push(targetId);
@@ -191,7 +192,7 @@ export class TaskStore {
    });
  }

-  /** Complete a task. Called only by /lgtm. The human-confirm gate lives in the command layer. */
+  /** Complete a task. Called by accepted proof review or direct subtask completion paths. */
  complete(id: string): Task {
    return this.withLock(() => {
      const task = this.tasks.get(id);
@@ -9,7 +9,7 @@ export interface Task {
  subject: string;
  description: string;
  done_criterion: string;      // required: what "done" looks like
-  pending_approval: boolean;   // set by lgtm_ask, required before /lgtm
+  parentId?: string;           // no parent = top-level goal, requires proof claim to complete
  status: TaskStatus;
  progress_label?: string;
  metadata: Record<string, any>;
@@ -125,12 +125,11 @@ export class TaskWidget {

    if (tasks.length === 0) return [];

-    const counts = { completed: 0, awaiting_signoff: 0, in_progress: 0, pending: 0 };
+    const counts = { completed: 0, in_progress: 0, pending: 0 };
    for (const t of tasks) counts[getDisplayStatus(t)]++;

    const parts: string[] = [];
    if (counts.completed > 0) parts.push(`${counts.completed} done`);
-    if (counts.awaiting_signoff > 0) parts.push(`${counts.awaiting_signoff} awaiting sign-off`);
    if (counts.in_progress > 0) parts.push(`${counts.in_progress} in progress`);
    if (counts.pending > 0) parts.push(`${counts.pending} open`);
    const statusText = `${tasks.length} tasks (${parts.join(", ")})`;
@@ -144,7 +143,7 @@ export class TaskWidget {
      const isActive = this.activeTaskIds.has(task.id) && task.status === "in_progress";
      const reviewSuffix = ` ${getReviewBadges(task)}`;
      const tag = getStateTag(task);
-      // [READY  ] [ACTIVE ] [PENDING] [DONE   ] — pad so columns line up.
+      // [ACTIVE ] [PENDING] [DONE   ] — pad so columns line up.
      const tagColour = getStateTagColor(tag);
      const tagBox = `[${tag.padEnd(7)}]`;
      const tagPrefix = (tagColour ? theme.fg(tagColour, tagBox) : tagBox) + " ";
@@ -14,7 +14,6 @@ describe("auto-clear: on_task_complete mode", () => {

  it("does not clear completed task before REMINDER_INTERVAL turns", () => {
    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -28,7 +27,6 @@ describe("auto-clear: on_task_complete mode", () => {

  it("clears completed task after REMINDER_INTERVAL turns", () => {
    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -42,11 +40,9 @@ describe("auto-clear: on_task_complete mode", () => {
    store.create("Task A", "Desc", "done");
    store.create("Task B", "Desc", "done");

-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

-    store.update("2", { pending_approval: true });
    store.complete("2");
    manager.trackCompletion("2", 3);

@@ -65,7 +61,6 @@ describe("auto-clear: on_task_complete mode", () => {
    store.create("In Progress", "Desc", "done");
    store.create("Completed", "Desc", "done");
    store.update("2", { status: "in_progress" });
-    store.update("3", { pending_approval: true });
    store.complete("3");
    manager.trackCompletion("3", 1);

@@ -78,8 +73,7 @@ describe("auto-clear: on_task_complete mode", () => {
  it("cleans up dependency edges when auto-clearing", () => {
    store.create("Blocker", "Desc", "done");
    store.create("Blocked", "Desc", "done");
-    store.update("1", { addBlocks: ["2"] });
-    store.update("1", { pending_approval: true });
+    store.update("1", { add_blocks: ["2"] });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -90,7 +84,6 @@ describe("auto-clear: on_task_complete mode", () => {

  it("returns true when tasks are cleared", () => {
    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -111,7 +104,6 @@ describe("auto-clear: on_list_complete mode", () => {
  it("does not clear when some tasks are still pending", () => {
    store.create("Done", "Desc", "done");
    store.create("Pending", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -125,9 +117,7 @@ describe("auto-clear: on_list_complete mode", () => {
  it("does not clear immediately when all tasks complete", () => {
    store.create("A", "Desc", "done");
    store.create("B", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
-    store.update("2", { pending_approval: true });
    store.complete("2");
    manager.trackCompletion("2", 1);

@@ -141,9 +131,7 @@ describe("auto-clear: on_list_complete mode", () => {
  it("clears all completed tasks after REMINDER_INTERVAL turns when all are completed", () => {
    store.create("A", "Desc", "done");
    store.create("B", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
-    store.update("2", { pending_approval: true });
    store.complete("2");
    manager.trackCompletion("2", 1);

@@ -153,7 +141,6 @@ describe("auto-clear: on_list_complete mode", () => {

  it("resets countdown when a new task is created before REMINDER_INTERVAL", () => {
    store.create("A", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -170,9 +157,7 @@ describe("auto-clear: on_list_complete mode", () => {
  it("resets countdown when a task goes back to in_progress", () => {
    store.create("A", "Desc", "done");
    store.create("B", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
-    store.update("2", { pending_approval: true });
    store.complete("2");
    manager.trackCompletion("2", 1);

@@ -188,7 +173,6 @@ describe("auto-clear: on_list_complete mode", () => {

  it("returns true when tasks are cleared", () => {
    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -209,9 +193,7 @@ describe("auto-clear: never mode", () => {
  it("never clears completed tasks regardless of turns", () => {
    store.create("A", "Desc", "done");
    store.create("B", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
-    store.update("2", { pending_approval: true });
    store.complete("2");
    manager.trackCompletion("1", 1);
    manager.trackCompletion("2", 1);
@@ -224,7 +206,6 @@ describe("auto-clear: never mode", () => {

  it("trackCompletion is a no-op", () => {
    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -240,7 +221,6 @@ describe("auto-clear: dynamic mode switching", () => {
    const manager = new AutoClearManager(() => store, () => mode);

    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");

    // Track in never mode — no-op
@@ -262,7 +242,6 @@ describe("auto-clear: store getter (session switch)", () => {
    const manager = new AutoClearManager(() => store, () => "on_task_complete");

    store.create("Old task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -284,7 +263,6 @@ describe("auto-clear: store getter (session switch)", () => {
    // Swap to new store with a completed task
    store = new TaskStore();
    store.create("Task in new store", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -299,7 +277,6 @@ describe("auto-clear: reset (new session)", () => {
    const manager = new AutoClearManager(() => store, () => "on_task_complete");

    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -316,7 +293,6 @@ describe("auto-clear: reset (new session)", () => {
    const manager = new AutoClearManager(() => store, () => "on_list_complete");

    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);

@@ -333,7 +309,6 @@ describe("auto-clear: reset (new session)", () => {
    const manager = new AutoClearManager(() => store, () => "on_task_complete");

    store.create("Task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    manager.trackCompletion("1", 1);
    manager.reset();
@@ -8,7 +8,6 @@ function makeTask(overrides: Partial<Task> = {}): Task {
    subject: "Test",
    description: "Desc",
    done_criterion: "done",
-    pending_approval: false,
    status: "pending",
    progress_label: undefined,
    metadata: {},
@@ -25,9 +24,8 @@ describe("getReviewBadges", () => {
    expect(getReviewBadges(makeTask())).toBe("[···]");
  });

-  it("fills tool/robot/human slots independently", () => {
+  it("fills evidence/review/completed slots independently", () => {
    const task = makeTask({
-      pending_approval: true,
      metadata: {
        lgtm_evidence: "npm test",
        robot_reviews: [{
@@ -35,6 +33,8 @@ describe("getReviewBadges", () => {
          reviewer: "opencode",
          scope: "task evidence",
          observations: ["Observed one unchecked edge case"],
+          concerns: ["Evidence does not cover prod traffic."],
+          suggestions: ["Inspect one prod traffic sample."],
          blind_spots: "Did not inspect prod traffic",
          accepted: false,
          evidence_complete: false,
@@ -46,27 +46,26 @@ describe("getReviewBadges", () => {
      },
    });

-    expect(getReviewBadges(task)).toBe("[🛠🤖👀]");
+    expect(getReviewBadges(task)).toBe("[🛠🤖·]");
  });

-  it("hides the human badge once the task is completed", () => {
+  it("fills the completed badge once the task is completed", () => {
    const task = makeTask({
-      pending_approval: true,
      status: "completed",
      metadata: { lgtm_evidence: "ok" },
    });

-    expect(getReviewBadges(task)).toBe("[🛠··]");
+    expect(getReviewBadges(task)).toBe("[🛠·✓]");
  });
 });

 describe("review state helpers", () => {
-  it("reports completion mode as direct before any lgtm evidence", () => {
-    expect(getCompletionMode(makeTask())).toBe("direct");
+  it("reports completion mode as proof for top-level tasks", () => {
+    expect(getCompletionMode(makeTask())).toBe("proof");
  });

-  it("reports completion mode as lgtm after evidence history exists", () => {
-    expect(getCompletionMode(makeTask({ metadata: { lgtm_history: [{ iteration: 1 }] } }))).toBe("lgtm");
+  it("reports completion mode as direct for subtasks", () => {
+    expect(getCompletionMode(makeTask({ parentId: "1" }))).toBe("direct");
  });

  it("reports superseded when only history remains", () => {
@@ -75,30 +74,17 @@ describe("review state helpers", () => {
 });

 describe("getGateStatus", () => {
-  it("reports ready when human sign-off is open", () => {
-    expect(getGateStatus(makeTask({
-      pending_approval: true,
-      metadata: { lgtm_evidence: "ok" },
-    }))).toBe("ready for human sign-off via /lgtm 1");
+  it("reports top-level proof requirement before evidence", () => {
+    expect(getGateStatus(makeTask())).toBe("top-level task requires TaskClaimDone evidence before completion");
  });

-  it("reports blocking reviewer failure when human sign-off is closed", () => {
+  it("reports non-blocking reviewer failure", () => {
    expect(getGateStatus(makeTask({
      metadata: {
        lgtm_evidence: "ok",
        robot_review_last_error: "Unexpected token 'a'",
      },
-    }))).toContain("blocked: automatic robot review failed");
-  });
-
-  it("reports reviewer failure as a warning when human sign-off stays open", () => {
-    expect(getGateStatus(makeTask({
-      pending_approval: true,
-      metadata: {
-        lgtm_evidence: "ok",
-        robot_review_last_error: "Unexpected token 'a'",
-      },
-    }))).toContain("warning: automatic robot review failed");
+    }))).toContain("review unavailable; autonomy continues");
  });

  it("reports rejected robot review when latest review does not accept", () => {
@@ -110,6 +96,8 @@ describe("getGateStatus", () => {
          reviewer: "opencode",
          scope: "task evidence",
          observations: ["Observed missing output"],
+          concerns: ["The current evidence is summary-only."],
+          suggestions: ["Paste the literal output."],
          blind_spots: "none",
          accepted: false,
          evidence_complete: false,
@@ -119,12 +107,11 @@ describe("getGateStatus", () => {
          mode: "manual",
        }],
      },
-    }))).toBe("blocked: latest robot review rejected the evidence");
+    }))).toBe("latest proof review rejected the evidence; strengthen the proof and try again");
  });

  it("keeps rejection higher priority than a later reviewer warning", () => {
    expect(getGateStatus(makeTask({
-      pending_approval: true,
      metadata: {
        lgtm_evidence: "ok",
        robot_review_last_error: "timeout",
@@ -133,6 +120,8 @@ describe("getGateStatus", () => {
          reviewer: "opencode",
          scope: "task evidence",
          observations: ["Observed missing output"],
+          concerns: ["The current evidence is summary-only."],
+          suggestions: ["Paste the literal output."],
          blind_spots: "none",
          accepted: false,
          evidence_complete: false,
@@ -142,7 +131,7 @@ describe("getGateStatus", () => {
          mode: "manual",
        }],
      },
-    }))).toBe("blocked: latest robot review rejected the evidence");
+    }))).toBe("latest proof review rejected the evidence; strengthen the proof and try again");
  });
 });

@@ -155,13 +144,7 @@ describe("getDisplayStatus", () => {
    expect(getDisplayStatus(makeTask({ status: "in_progress" }))).toBe("in_progress");
  });

-  it("returns awaiting_signoff when pending_approval is set", () => {
-    expect(getDisplayStatus(makeTask({ status: "in_progress", pending_approval: true })))
-      .toBe("awaiting_signoff");
-  });
-
-  it("returns completed regardless of pending_approval flag", () => {
-    expect(getDisplayStatus(makeTask({ status: "completed", pending_approval: true })))
-      .toBe("completed");
+  it("returns completed for completed tasks", () => {
+    expect(getDisplayStatus(makeTask({ status: "completed" }))).toBe("completed");
  });
 });
@@ -15,7 +15,7 @@ describe("robot review runner helpers", () => {
      command: "pi",
      args: ["--mode", "json"],
    });
-    expect(getPiInvocation(["-p"], { PI_LGTM_PI_BIN: "/custom/pi" } as NodeJS.ProcessEnv)).toEqual({
+    expect(getPiInvocation(["-p"], { PI_PROOF_TASKS_PI_BIN: "/custom/pi" } as NodeJS.ProcessEnv)).toEqual({
      command: "/custom/pi",
      args: ["-p"],
    });
@@ -46,8 +46,8 @@ describe("robot review runner helpers", () => {
  });

  it("uses configured timeout or falls back to default", () => {
-    expect(getRobotReviewTimeoutMs({ PI_LGTM_ROBOT_REVIEW_TIMEOUT_MS: "2500" } as NodeJS.ProcessEnv)).toBe(2500);
-    expect(getRobotReviewTimeoutMs({ PI_LGTM_ROBOT_REVIEW_TIMEOUT_MS: "bad" } as NodeJS.ProcessEnv)).toBe(DEFAULT_ROBOT_REVIEW_TIMEOUT_MS);
+    expect(getRobotReviewTimeoutMs({ PI_PROOF_TASKS_ROBOT_REVIEW_TIMEOUT_MS: "2500" } as NodeJS.ProcessEnv)).toBe(2500);
+    expect(getRobotReviewTimeoutMs({ PI_PROOF_TASKS_ROBOT_REVIEW_TIMEOUT_MS: "bad" } as NodeJS.ProcessEnv)).toBe(DEFAULT_ROBOT_REVIEW_TIMEOUT_MS);
  });

  it("formats the current model as the reviewer model ref", () => {
@@ -2,8 +2,8 @@ import { mkdtempSync, writeFileSync } from "node:fs";
 import { tmpdir } from "node:os";
 import { join } from "node:path";
 import { describe, expect, it } from "vitest";
-import { archiveCurrentEvidence, buildArtifactRecords, getCurrentEvidenceIteration, getEvidenceHistory } from "../src/index.js";
-import { appendRobotReviewMetadata, getLatestRobotReview, getRobotReviews, relaxAdvisoryVerificationHints, shouldOpenHumanSignoffGate } from "../src/robot-review.js";
+import { archiveCurrentEvidence, buildArtifactRecords, buildRobotReviewPrompt, getCurrentEvidenceIteration, getEvidenceHistory, renderEvidencePacket, renderProofLog } from "../src/index.js";
+import { appendRobotReviewMetadata, getLatestRobotReview, getRobotReviews, hasCompleteProofClaim, relaxAdvisoryVerificationHints, shouldCompleteAfterAcceptedReview } from "../src/robot-review.js";
 import type { Task } from "../src/types.js";

 function makeTask(overrides: Partial<Task> = {}): Task {
@@ -12,7 +12,6 @@ function makeTask(overrides: Partial<Task> = {}): Task {
    subject: "Test",
    description: "Desc",
    done_criterion: "done",
-    pending_approval: false,
    status: "pending",
    progress_label: undefined,
    metadata: {},
@@ -25,10 +24,23 @@ function makeTask(overrides: Partial<Task> = {}): Task {
 }

 describe("robot review helpers", () => {
-  it("reopens the human gate when accepted review exists for stored evidence", () => {
-    expect(shouldOpenHumanSignoffGate(makeTask({ metadata: { lgtm_evidence: "literal output" } }), true)).toBe(true);
-    expect(shouldOpenHumanSignoffGate(makeTask({ metadata: { lgtm_evidence: "literal output" } }), false)).toBe(false);
-    expect(shouldOpenHumanSignoffGate(makeTask(), true)).toBe(false);
+  it("completes only after accepted review and complete proof claim", () => {
+    const task = makeTask({
+      metadata: {
+        lgtm_evidence: "literal output",
+        lgtm_failure_likely: "wrong command",
+        lgtm_failure_sneaky: "right output for wrong reason",
+        lgtm_failure_unknown: "untested platform",
+        lgtm_falsification_test: "npm test\npass",
+        lgtm_evidence_reasoning: "the test output rules out the named failures for this scope",
+        lgtm_verification_hints: ["test/robot-review.test.ts shows the expectation"],
+        lgtm_remaining_uncertainty: "does not test prod install",
+      },
+    });
+    expect(hasCompleteProofClaim(task)).toBe(true);
+    expect(shouldCompleteAfterAcceptedReview(task, true)).toBe(true);
+    expect(shouldCompleteAfterAcceptedReview(task, false)).toBe(false);
+    expect(shouldCompleteAfterAcceptedReview(makeTask({ metadata: { lgtm_evidence: "literal output" } }), true)).toBe(false);
  });

  it("reads legacy single-review metadata", () => {
@@ -50,7 +62,7 @@ describe("robot review helpers", () => {
  });

  it("builds artifact records with absolute path and sha256", () => {
-    const dir = mkdtempSync(join(tmpdir(), "pi-lgtm-"));
+    const dir = mkdtempSync(join(tmpdir(), "pi-proof-tasks-"));
    const path = join(dir, "evidence.log");
    writeFileSync(path, "hello\n");

@@ -66,7 +78,9 @@ describe("robot review helpers", () => {
        lgtm_evidence: "literal output",
        lgtm_failure_likely: "wrong seed",
        lgtm_failure_sneaky: "wrong threshold",
+        lgtm_failure_unknown: "untested environment",
        lgtm_falsification_test: "pytest -k check",
+        lgtm_evidence_reasoning: "pytest output distinguishes the expected passing path from the named failures",
        lgtm_verification_hints: ["see line 5"],
        lgtm_remaining_uncertainty: "not load tested",
        lgtm_submitted_at: "2026-06-07T00:00:00.000Z",
@@ -86,6 +100,8 @@ describe("robot review helpers", () => {
      reviewer: "auto",
      scope: "task evidence",
      observations: ["Observed commit, push, and test logs"],
+      concerns: [],
+      suggestions: [],
      blind_spots: "Did not inspect interactive UI",
      accepted: false,
      evidence_complete: true,
@@ -97,6 +113,7 @@ describe("robot review helpers", () => {
        evidence_covers_done_criterion: { reason: "verbatim logs match", pass: true },
        falsification_test_runnable: { reason: "command and output shown", pass: true },
        failure_modes_addressed: { reason: "plausible top risks named", pass: true },
+        evidence_distinguishes_success: { reason: "evidence rules out named failures", pass: true },
        verification_hints_actionable: { reason: "paths are vague", pass: false },
      },
    });
@@ -112,6 +129,8 @@ describe("robot review helpers", () => {
      reviewer: "auto",
      scope: "task evidence",
      observations: ["Observed vague summary only"],
+      concerns: [],
+      suggestions: [],
      blind_spots: "Did not rerun tests",
      accepted: false,
      evidence_complete: true,
@@ -123,6 +142,7 @@ describe("robot review helpers", () => {
        evidence_covers_done_criterion: { reason: "summary only", pass: false },
        falsification_test_runnable: { reason: "command and output shown", pass: true },
        failure_modes_addressed: { reason: "plausible top risks named", pass: true },
+        evidence_distinguishes_success: { reason: "evidence does not rule out summary-only failure", pass: false },
        verification_hints_actionable: { reason: "paths are vague", pass: false },
      },
    });
@@ -131,12 +151,40 @@ describe("robot review helpers", () => {
    expect(review.evidence_convincing).toBe(false);
  });

+  it("renders one compact evidence packet for both human and robot review", () => {
+    const task = makeTask({
+      metadata: {
+        lgtm_evidence: "literal output",
+        lgtm_failure_likely: "wrong seed",
+        lgtm_failure_sneaky: "wrong threshold",
+        lgtm_failure_unknown: "does not test UI rendering",
+        lgtm_falsification_test: "pytest -k check\nPASSED",
+        lgtm_evidence_reasoning: "The passing pytest transcript distinguishes success from wrong-threshold and wrong-seed failures for this test scope.",
+        lgtm_verification_hints: ["test/robot-review.test.ts contains the new guard test"],
+        lgtm_remaining_uncertainty: "not load tested",
+        lgtm_submitted_at: "2026-06-14T00:00:00.000Z",
+        lgtm_commands: [{ cmd: "npm test", exit_code: 0, stdout_path: "/tmp/test.log" }],
+        lgtm_evidence_artifacts: [{ path: "/tmp/test.log", sha256: "abc", bytes: 123 }],
+      },
+    });
+
+    const packet = renderEvidencePacket(task);
+    const prompt = buildRobotReviewPrompt(task);
+    expect(packet).toContain("## Goal");
+    expect(packet).toContain("## Planned evidence / UAT");
+    expect(packet).toContain("## Attempt 1");
+    expect(prompt).toContain(packet);
+    expect(prompt).toContain("does this evidence prove success for the stated goal");
+  });
+
  it("appends robot reviews as iterations", () => {
    const task = makeTask();
    const metadata1 = appendRobotReviewMetadata(task, {
      reviewer: "opencode",
      scope: "task evidence",
      observations: ["Observed missing benchmark output"],
+      concerns: ["The current evidence does not show the claimed speedup."],
+      suggestions: ["Add the benchmark transcript for the claimed speedup."],
      blind_spots: "Did not inspect prod config",
      accepted: false,
      evidence_complete: false,
@@ -150,6 +198,8 @@ describe("robot review helpers", () => {
      reviewer: "opencode",
      scope: "updated task evidence",
      observations: ["Observed benchmark output and test transcript"],
+      concerns: [],
+      suggestions: [],
      blind_spots: "Did not inspect long-run stability",
      accepted: true,
      evidence_complete: true,
@@ -167,5 +217,73 @@ describe("robot review helpers", () => {
    expect(getLatestRobotReview(task2)?.evidence_convincing).toBe(true);
    expect(task2.metadata.robot_review_iteration_count).toBe(2);
  });
+
+  it("renders a simple proof log with judgement and suggestions", () => {
+    const taskWithEvidence = makeTask({
+      metadata: {
+        lgtm_evidence: "npm test\n125 passed",
+        lgtm_failure_likely: "old package name still in README",
+        lgtm_failure_sneaky: "top-level direct completion still slips through",
+        lgtm_failure_unknown: "fresh judge command fails in a real session",
+        lgtm_falsification_test: "npm test\n125 passed",
+        lgtm_evidence_reasoning: "The test transcript and grep distinguish the intended behavior from stale workflow regressions.",
+        lgtm_verification_hints: ["README.md install block shows pi-proof-tasks"],
+        lgtm_remaining_uncertainty: "Did not exercise every model provider.",
+        lgtm_submitted_at: "2026-06-14T00:00:00.000Z",
+      },
+    });
+    const task = makeTask({
+      metadata: {
+        ...taskWithEvidence.metadata,
+        ...appendRobotReviewMetadata(taskWithEvidence, {
+          reviewer: "auto",
+          scope: "proof log",
+          observations: ["Observed the test transcript and renamed package."],
+          concerns: ["The live Pi session path is still untested."],
+          suggestions: ["Run one self-hosted TaskClaimDone UAT."],
+          blind_spots: "Did not inspect external auth state",
+          accepted: false,
+          evidence_complete: true,
+          evidence_convincing: false,
+          missing_evidence: ["self-hosted TaskClaimDone UAT"],
+          submitted_at: "2026-06-14T00:01:00.000Z",
+          mode: "auto",
+        }),
+      },
+    });
+
+    const log = renderProofLog(task);
+    expect(log).toContain("# Task #1: Test");
+    expect(log).toContain("## Goal");
+    expect(log).toContain("## Planned evidence / UAT");
+    expect(log).toContain("## Attempt 1");
+    expect(log).toContain("### Submitted evidence");
+    expect(log).toContain("### Judgement");
+    expect(log).toContain("Refused by auto");
+    expect(log).toContain("Run one self-hosted TaskClaimDone UAT.");
+  });
+
+  it("renders reviewer-unavailable proof logs for fail-open completion notes", () => {
+    const task = makeTask({
+      status: "completed",
+      metadata: {
+        lgtm_evidence: "npm test\n125 passed",
+        lgtm_failure_likely: "old package name still in README",
+        lgtm_failure_sneaky: "top-level direct completion still slips through",
+        lgtm_failure_unknown: "fresh judge command fails in a real session",
+        lgtm_falsification_test: "npm test\n125 passed",
+        lgtm_evidence_reasoning: "The test transcript and grep distinguish the intended behavior from stale workflow regressions.",
+        lgtm_verification_hints: ["README.md install block shows pi-proof-tasks"],
+        lgtm_remaining_uncertainty: "Did not exercise every model provider.",
+        robot_review_last_error: "judge auth failed",
+      },
+    });
+
+    const log = renderProofLog(task);
+    expect(log).toContain("completed with reviewer unavailable");
+    expect(log).toContain("### Judgement");
+    expect(log).toContain("judge auth failed");
+    expect(log).toContain("Autonomy continued without blocking completion.");
+  });
 });

@@ -4,11 +4,10 @@ import { join } from "node:path";
 import { afterEach, beforeEach, describe, expect, it } from "vitest";
 import { TaskStore } from "../src/task-store.js";

-// Helper: create a task and set pending_approval so complete() works
-function createAndApprove(store: TaskStore, subject: string) {
-  const task = store.create(subject, "Desc", "done criterion");
-  store.update(task.id, { pending_approval: true });
-  return task;
+// Helper: create a subtask, which can be ticked off directly.
+function createSubtask(store: TaskStore, subject: string) {
+  const parent = store.create(`${subject} parent`, "Desc", "done criterion");
+  return store.create(subject, "Desc", "done criterion", undefined, undefined, parent.id);
 }

 describe("TaskStore (in-memory)", () => {
@@ -28,7 +27,6 @@ describe("TaskStore (in-memory)", () => {
    expect(t1.subject).toBe("First task");
    expect(t1.description).toBe("Description 1");
    expect(t1.done_criterion).toBe("criterion 1");
-    expect(t1.pending_approval).toBe(false);
  });

  it("creates tasks with optional fields", () => {
@@ -110,7 +108,7 @@ describe("TaskStore (in-memory)", () => {
    expect(task.metadata).toEqual({ a: 1, c: 3, d: 4 });
  });

-  it("sets up bidirectional blocks via addBlocks", () => {
+  it("sets up bidirectional blocks via add_blocks", () => {
    store.create("Blocker", "Desc", "done");
    store.create("Blocked", "Desc", "done");

@@ -122,7 +120,7 @@ describe("TaskStore (in-memory)", () => {
    expect(t2.blockedBy).toContain("1");
  });

-  it("sets up bidirectional blocks via addBlockedBy", () => {
+  it("sets up bidirectional blocks via add_blocked_by", () => {
    store.create("Blocker", "Desc", "done");
    store.create("Blocked", "Desc", "done");

@@ -157,7 +155,7 @@ describe("TaskStore (in-memory)", () => {
  });

  it("clears completed tasks", () => {
-    createAndApprove(store, "Completed");
+    store.create("Completed", "Desc", "done");
    store.create("Pending", "Desc", "done");
    store.complete("1");

@@ -168,36 +166,28 @@ describe("TaskStore (in-memory)", () => {
    expect(store.list()[0].id).toBe("2");
  });

-  it("allows TaskUpdate(status=completed) for trivial tasks (no lgtm evidence)", () => {
-    store.create("Trivial", "Desc", "done");
-    const { task, changedFields } = store.update("1", { status: "completed" });
+  it("allows TaskUpdate(status=completed) for subtasks", () => {
+    createSubtask(store, "Checklist item");
+    const { task, changedFields } = store.update("2", { status: "completed" });
    expect(task!.status).toBe("completed");
    expect(changedFields).toContain("status");
  });

-  it("blocks TaskUpdate(status=completed) when pending_approval=true", () => {
-    store.create("Significant", "Desc", "done");
-    store.update("1", { pending_approval: true });
-    expect(() => store.update("1", { status: "completed" })).toThrow("/lgtm");
+  it("blocks TaskUpdate(status=completed) for top-level tasks", () => {
+    store.create("Goal", "Desc", "done");
+    expect(() => store.update("1", { status: "completed" })).toThrow("Top-level task #1 requires proof");
  });

-  it("blocks TaskUpdate(status=completed) when lgtm evidence is stored (even if review rejected)", () => {
+  it("keeps top-level completion gated even after proof evidence exists", () => {
    store.create("Escalated", "Desc", "done");
-    // lgtm_ask path stores evidence; if robot review rejects, pending_approval becomes false.
-    // The agent must not be able to bypass the gate by self-completing afterwards.
-    store.update("1", { metadata: { lgtm_evidence: "literal output" }, pending_approval: false });
-    expect(() => store.update("1", { status: "completed" })).toThrow("/lgtm");
+    store.update("1", { metadata: { lgtm_evidence: "literal output" } });
+    expect(() => store.update("1", { status: "completed" })).toThrow("TaskClaimDone");
  });

-  it("blocks TaskUpdate(status=completed) after evidence was superseded into history", () => {
-    store.create("Superseded", "Desc", "done");
-    store.update("1", {
-      metadata: {
-        lgtm_history: [{ iteration: 1, supersede_reason: "threshold changed" }],
-      },
-      pending_approval: false,
-    });
-    expect(() => store.update("1", { status: "completed" })).toThrow("completion_mode=lgtm");
+  it("rejects changing parentId after creation", () => {
+    store.create("Parent", "Desc", "done");
+    store.create("Child", "Desc", "done");
+    expect(() => store.update("2", { parentId: "1" })).toThrow("parentId is creation-only");
  });

  it("returns not found for update on non-existent task", () => {
@@ -206,16 +196,15 @@ describe("TaskStore (in-memory)", () => {
    expect(changedFields).toEqual([]);
  });

-  it("complete() works without pending_approval (human override path)", () => {
-    // The /lgtm command layer is the human gate; complete() itself is permissive.
+  it("complete() is the internal proof-review completion path", () => {
    store.create("Test", "Desc", "done");
    const task = store.complete("1");
    expect(task.status).toBe("completed");
  });

-  it("complete() works when pending_approval=true", () => {
-    createAndApprove(store, "Test");
-    const task = store.complete("1");
+  it("complete() also works for subtasks", () => {
+    createSubtask(store, "Test");
+    const task = store.complete("2");
    expect(task.status).toBe("completed");
  });

@@ -307,8 +296,7 @@ describe("TaskStore (in-memory)", () => {
    store.create("Blocker", "Desc", "done");
    store.create("Blocked", "Desc", "done");
    store.update("1", { add_blocks: ["2"] });
-    // Set pending_approval on task 1 so complete() works via /lgtm path
-    store.update("1", { pending_approval: true });
+    // complete() is the internal proof-review completion path.
    store.complete("1");

    store.clearCompleted();
@@ -317,7 +305,7 @@ describe("TaskStore (in-memory)", () => {
    expect(t2.blockedBy).toEqual([]);
  });

-  it("handles multiple addBlocks in one call", () => {
+  it("handles multiple add_blocks in one call", () => {
    store.create("Blocker", "Desc", "done");
    store.create("B1", "Desc", "done");
    store.create("B2", "Desc", "done");
@@ -329,21 +317,21 @@ describe("TaskStore (in-memory)", () => {
    expect(store.get("3")!.blockedBy).toContain("1");
  });

-  it("addBlockedBy warns on self-dependency", () => {
+  it("add_blocked_by warns on self-dependency", () => {
    store.create("Self", "Desc", "done");
    const { warnings } = store.update("1", { add_blocked_by: ["1"] });
    expect(store.get("1")!.blockedBy).toContain("1");
    expect(warnings).toContain("#1 blocks itself");
  });

-  it("addBlockedBy warns on dangling ref", () => {
+  it("add_blocked_by warns on dangling ref", () => {
    store.create("Real", "Desc", "done");
    const { warnings } = store.update("1", { add_blocked_by: ["9999"] });
    expect(store.get("1")!.blockedBy).toContain("9999");
    expect(warnings).toContain("#9999 does not exist");
  });

-  it("addBlockedBy warns on cycle", () => {
+  it("add_blocked_by warns on cycle", () => {
    store.create("A", "Desc", "done");
    store.create("B", "Desc", "done");
    store.update("1", { add_blocks: ["2"] });
@@ -358,7 +346,7 @@ describe("TaskStore (in-memory)", () => {

  it("list sorts pending → in_progress → completed with all three present", () => {
    store.create("Pending task", "Desc", "done");
-    createAndApprove(store, "Completed task");
+    store.create("Completed task", "Desc", "done");
    store.create("In-progress task", "Desc", "done");
    store.create("Another pending", "Desc", "done");

@@ -413,7 +401,6 @@ describe("TaskStore (file-backed)", () => {
    const store1 = new TaskStore(testListId);
    store1.create("Done task", "Desc", "done");
    store1.create("Pending task", "Desc", "done");
-    store1.update("1", { pending_approval: true });
    store1.complete("1");

    const store2 = new TaskStore(testListId);
@@ -429,7 +416,6 @@ describe("TaskStore (file-backed)", () => {
    store1.create("In progress", "Desc", "done");
    store1.create("Done", "Desc", "done");
    store1.update("2", { status: "in_progress" });
-    store1.update("3", { pending_approval: true });
    store1.complete("3");

    const store2 = new TaskStore(testListId);
@@ -473,7 +459,6 @@ describe("TaskStore (absolute path)", () => {
    const store1 = new TaskStore(absFilePath);
    store1.create("Pending", "Desc", "done");
    store1.create("Completed", "Desc", "done");
-    store1.update("2", { pending_approval: true });
    store1.complete("2");

    const raw = JSON.parse(readFileSync(absFilePath, "utf-8"));
@@ -92,7 +92,6 @@ describe("TaskWidget", () => {

  it("renders completed tasks with ✔ icon and strikethrough", () => {
    store.create("Done task", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    widget.update();

@@ -105,7 +104,6 @@ describe("TaskWidget", () => {
    store.create("Done task", "Desc", "done");
    store.update("1", {
      metadata: { robot_review_observations: ["Observed output drift on seed 2"] },
-      pending_approval: true,
    });
    store.complete("1");
    widget.update();
@@ -143,7 +141,6 @@ describe("TaskWidget", () => {
    store.create("Blocker", "Desc", "done");
    store.create("Blocked", "Desc", "done");
    store.update("2", { add_blocked_by: ["1"] });
-    store.update("1", { pending_approval: true });
    store.complete("1");
    widget.update();

@@ -156,7 +153,6 @@ describe("TaskWidget", () => {
    store.create("Task A", "Desc", "done");
    store.create("Task B", "Desc", "done");
    store.create("Task C", "Desc", "done");
-    store.update("1", { pending_approval: true });
    store.complete("1");
    store.update("2", { status: "in_progress" });
    widget.update();
@@ -226,7 +222,6 @@ describe("TaskWidget", () => {
    widget.setActiveTask("1", true);

    // Complete the task externally
-    store.update("1", { pending_approval: true });
    store.complete("1");
    widget.update();