add lgtm evidence history and artifact metadata

2026-06-27 15:31:29 +08:00 · 2026-06-07 19:03:01 +08:00
parent 023a2c2263
commit 25116e399a
8 changed files with 443 additions and 45 deletions
@@ -65,7 +65,7 @@ Lists all tasks. `👀` indicates pending sign-off.

 ### `TaskGet`

-Full task details including `done_criterion`, approval state, and a one-line gate status such as `ready for human sign-off via /lgtm 5` or `blocked: automatic robot review failed: ...`.
+Full task details including `done_criterion`, approval state, `completion mode`, `review state`, a one-line gate status such as `ready for human sign-off via /lgtm 5` or `blocked: automatic robot review failed: ...`, and evidence-iteration history.

 ### `TaskUpdate`

@@ -84,13 +84,22 @@ The epistemic gate. Required fields:
 | `falsification_test` | What you ran and what you got, so both you and the human can sanity-check it. Why that result could not occur if a failure mode were real. |
 | `verification_hints` | Where to look and what to check. Descriptions of evidence locations. |
 | `remaining_uncertainty` | What is NOT tested, deferred edge cases, known limitations |
+| `commands` | Optional structured command records: `{ cmd, exit_code, stdout_path?, stderr_path? }` |
+| `evidence_paths` / `falsification_paths` | Optional local artifact paths. Stored as absolute path + sha256 + byte size |
+| `supersede_reason` | Optional reason when this replaces older evidence on the same task |

-After calling this, the task shows `👀` and is only completable via `/lgtm <id>`. Evidence is stored on the task so the human can review it hours later without scrolling back.
+After calling this, the task shows `👀` and is only completable via `/lgtm <id>`. Evidence is stored on the task so the human can review it hours later without scrolling back. Re-submitting evidence archives the prior package into superseded history instead of silently overwriting it.

 The tool result includes a non-blocking self-check prompt asking whether the evidence directly addresses the `done_criterion` and whether a skeptical reviewer would find it convincing.

 `lgtm_ask` always runs the robot-review stage immediately after storing evidence. A failing or errored robot review clears `pending_approval` until the evidence is strengthened and reviewed again.

+### `lgtm_supersede`
+
+Explicitly retire the current evidence package without completing the task.
+
+Use this when the claim changed or the prior evidence is stale. The tool archives the current evidence, current robot reviews, and reviewer-failure context into history with your reason, then closes the human gate until new evidence is submitted.
+
 ### `robot_review_ask`

 Attach a fresh-perspective robot review to a task.
@@ -143,9 +152,11 @@ Interactive menu: view tasks, create task, clear completed/all.

 ```
 pending -> in_progress -> (lgtm_ask)
-                       -> robot review iteration(s) 🤖
-                       -> pending_approval 👀   if latest robot review passes or no robot review is required
-                       -> strengthen evidence + rerun review   if latest robot review fails
+                       -> current evidence iteration N
+                       -> robot review iteration(s) on current evidence 🤖
+                       -> pending_approval 👀   if latest robot review passes
+                       -> reviewer_failed_to_run | reviewer_rejected
+                       -> lgtm_supersede or newer lgtm_ask -> superseded history + fresh current evidence
                       -> (/lgtm) -> completed
                       -> deleted
 ```
@@ -173,7 +184,7 @@ PI_TASKS_DEBUG=1      # trace to stderr

 ```
 src/
-├── index.ts        # 7 tools + /tasks + /lgtm commands + widget + event handlers
+├── index.ts        # 8 tools + /tasks + /lgtm commands + widget + event handlers
 ├── review-badges.ts # Review badge helpers for tool/robot/human lanes
 ├── robot-review.ts # Robot review iteration storage + compatibility helpers
 ├── types.ts        # Task, TaskStatus types
@@ -3,17 +3,22 @@
 ## Goal
 Fix the subset of reported LGTM-tool friction that seems clearly worth it now.

-Specifically: make auto-review parse failures less opaque, let an accepted manual robot review reopen the human gate, show a compact current gate status, and make stored evidence easier to read during `/lgtm` review.
+First batch done: make auto-review parse failures less opaque, let an accepted manual robot review reopen the human gate, show a compact current gate status, and make stored evidence easier to read during `/lgtm` review.
+
+Second batch: add structured command/artifact metadata, expose completion/review state more explicitly, and track superseded evidence iterations so stale claims stop polluting current review.

 ## Scope
-In: parser/error handling for automatic robot review, manual review gate semantics, gate-status rendering, `/lgtm` evidence formatting, focused tests, README updates.
-Out: evidence path attachments, supersede operations, schema redesign for commands/log artifacts, new completion-mode field, full state-machine overhaul.
+In: parser/error handling for automatic robot review, manual review gate semantics, gate-status rendering, `/lgtm` evidence formatting, structured command/artifact metadata, evidence supersession/iterations, completion/review state summaries, focused tests, README updates.
+Out: full schema/tool-calling reviewer redesign, giant workflow/state-machine rewrite, attachment upload plumbing beyond local path+hash metadata.

 ## Requirements
 - R1: Automatic robot review failures are diagnosable. Done means: parse errors include raw-output context and tolerate common wrapper noise around the JSON object. VERIFY: `npm test -- test/robot-review-runner.test.ts` shows parser fallback tests passing. If silent failure remained, the new malformed-output tests would still throw opaque errors.
 - R2: Accepted manual robot review can reopen human sign-off after a failed auto review. Done means: storing an accepted manual review can set `pending_approval=true` when LGTM evidence exists. VERIFY: focused test covers this transition. If broken, the task would stay blocked after accepted manual review.
 - R3: Task details expose current gate status in one short line. Done means: a helper summarizes states like ready, blocked on rejected review, or blocked on reviewer failure. VERIFY: focused tests assert representative summaries. If broken, TaskGet output would still require reading raw metadata to infer gate state.
 - R4: `/lgtm` review output is easier to scan. Done means: evidence and falsification output are rendered in explicit markdown sections/code fences instead of one flat blob. VERIFY: source shows sectioned formatter used by `/lgtm`. If broken, the old unstructured evidence block remains.
+- R5: LGTM submissions can store first-class command/artifact metadata. Done means: `lgtm_ask` accepts structured commands and artifact paths, stores hashes for artifacts, and renders them in human-facing task views. VERIFY: focused tests assert artifact hashing/formatting and metadata retention. If broken, TaskGet or `/lgtm` cannot show the structured records.
+- R6: Human-facing state is explicit, not inferred from raw booleans. Done means: TaskGet exposes completion mode plus a compact review-state line such as `reviewer_failed_to_run` or `superseded`. VERIFY: focused tests assert representative state strings. If broken, the output still requires reading metadata internals.
+- R7: Old evidence can be superseded cleanly. Done means: there is an explicit supersede path and repeated `lgtm_ask` calls archive prior evidence into history with reasons, rather than overwriting silently. VERIFY: focused tests cover history growth and direct-completion staying blocked after supersede. If broken, stale evidence disappears or the agent can bypass `/lgtm` after superseding.

 ## Tasks
 - [x] T1 (R1): Harden robot-review JSON extraction and error messages.
@@ -37,6 +42,27 @@ Out: evidence path attachments, supersede operations, schema redesign for comman
  - likely_fail: docs still imply only auto review can reopen approval
  - sneaky_fail: docs mention review generally but omit the new gate semantics
  - UAT: "when I read README, I can infer the new gate behavior without reading code"
+- [x] T4 (R5,R6): Add structured command/artifact metadata and explicit review/completion state.
+  - steps: extend `lgtm_ask` schema; hash artifact paths; add review-state and completion-mode helpers; render them in TaskGet and `/lgtm`
+  - verify: `npm test -- test/review-badges.test.ts test/task-store.test.ts`
+  - success: task details show completion mode, review state, commands, and artifact hashes
+  - likely_fail: metadata is stored but never shown to the human
+  - sneaky_fail: review state lies after supersede or reviewer failure; helper tests catch this
+  - UAT: "when I inspect a reviewed task, I observe command/artifact records and one explicit review state"
+- [x] T5 (R7): Add evidence history and supersede flow.
+  - steps: archive current evidence before replacement; add explicit supersede tool/reason; keep direct completion blocked once a task enters LGTM mode
+  - verify: `npm test -- test/task-store.test.ts test/robot-review.test.ts`
+  - success: old evidence appears in history with reason and agent self-completion remains blocked
+  - likely_fail: superseding clears current evidence and accidentally re-enables TaskUpdate(status=completed)
+  - sneaky_fail: robot reviews from prior evidence leak into the new current evidence; focused tests catch iteration reset/history behavior
+  - UAT: "when I replace or supersede evidence, I can still inspect the old claim and the new review starts fresh"
+- [x] T6 (R5-R7): Update README for the second batch.
+  - steps: document commands/artifacts, review/completion state, and supersede behavior
+  - verify: `rg -n "commands|artifact|supersede|completion mode|review state" README.md`
+  - success: README explains the new structured evidence workflow
+  - likely_fail: docs mention artifacts but omit supersede semantics
+  - sneaky_fail: docs imply a full reviewer-tool-calling redesign that does not exist
+  - UAT: "when I read README, I understand how to attach artifacts and retire stale evidence"

 ## Context
 - Keep fail-fast simplicity. Do not add a large state-machine rewrite.
@@ -47,10 +73,11 @@ Out: evidence path attachments, supersede operations, schema redesign for comman
 - 2026-06-07: User explicitly said not to fix everything, only the subset that makes sense. Chosen fixes are 1, 2, 10, plus `/lgtm` evidence readability because it is low-cost and directly reported.
 - 2026-06-07: Implemented best-effort JSON extraction from noisy marker blocks, persisted automatic-review failure context in task metadata, reopened `pending_approval` when any accepted review validates stored evidence, added `Gate status:` summaries, and reformatted human review output into explicit sections/code fences.
 - 2026-06-07: Verification passed with `npm test`, `npm run typecheck`, and `npm run lint`.
+- 2026-06-07: Second batch added structured command/artifact metadata, explicit `completion mode` and `review state`, evidence history with supersede reasons, and a dedicated `lgtm_supersede` tool.
+- 2026-06-07: Second-batch verification passed with `npm test -- test/review-badges.test.ts test/robot-review.test.ts test/task-store.test.ts`, then full `npm run lint && npm run typecheck && npm test`.

 ## TODO
- Evidence-path attachments may be worth adding later if pasted blobs remain too awkward.
- Superseded-evidence support probably wants explicit iteration objects, not another metadata flag.
+- Full reviewer tool-calling/schema enforcement is still probably a separate refactor if parse brittleness returns.

 ## Errors
 | Task | Error | Resolution |
@@ -23,11 +23,23 @@
 */

 import { spawn } from "node:child_process";
+import { createHash } from "node:crypto";
+import { readFileSync, statSync } from "node:fs";
 import { join, resolve } from "node:path";
 import type { ExtensionAPI, ExtensionCommandContext, ExtensionContext } from "@mariozechner/pi-coding-agent";
 import { Type } from "@sinclair/typebox";
 import { AutoClearManager } from "./auto-clear.js";
-import { type DisplayStatus, getDisplayStatus, getGateStatus, getReviewBadges, getStateTag } from "./review-badges.js";
+import {
+  type CompletionMode,
+  type DisplayStatus,
+  getCompletionMode,
+  getDisplayStatus,
+  getGateStatus,
+  getReviewBadges,
+  getReviewState,
+  getStateTag,
+  type ReviewState,
+} from "./review-badges.js";
 import {
  appendRobotReviewMetadata,
  getLatestRobotReview,
@@ -45,7 +57,7 @@ function textResult(msg: string) {
  return { content: [{ type: "text" as const, text: msg }], details: undefined as any };
 }

-const TASK_TOOL_NAMES = new Set(["TaskCreate", "TaskList", "TaskGet", "TaskUpdate", "lgtm_ask", "robot_review_ask", "robot_review_run"]);
+const TASK_TOOL_NAMES = new Set(["TaskCreate", "TaskList", "TaskGet", "TaskUpdate", "lgtm_ask", "lgtm_supersede", "robot_review_ask", "robot_review_run"]);
 const REMINDER_INTERVAL = 4;
 const AUTO_CLEAR_DELAY = 4;
 export const DEFAULT_ROBOT_REVIEW_TIMEOUT_MS = 120_000;
@@ -219,6 +231,37 @@ function extractBalancedJsonObject(text: string): string | undefined {
  return undefined;
 }

+interface EvidenceCommandRecord {
+  cmd: string;
+  exit_code: number;
+  stdout_path?: string;
+  stderr_path?: string;
+}
+
+interface EvidenceArtifactRecord {
+  path: string;
+  sha256: string;
+  bytes: number;
+}
+
+interface EvidenceIterationRecord {
+  iteration: number;
+  submitted_at: string;
+  superseded_at?: string;
+  supersede_reason?: string;
+  evidence: string;
+  failure_likely: string;
+  failure_sneaky: string;
+  falsification_test: string;
+  verification_hints: string[];
+  remaining_uncertainty: string;
+  commands: EvidenceCommandRecord[];
+  evidence_artifacts: EvidenceArtifactRecord[];
+  falsification_artifacts: EvidenceArtifactRecord[];
+  robot_reviews: RobotReviewRecord[];
+  automatic_review_failure?: { message: string; raw_output?: string };
+}
+
 function getAutomaticReviewFailureMetadata(message: string, rawOutput?: string): Record<string, unknown> {
  return {
    robot_review_last_error: message,
@@ -235,32 +278,173 @@ function clearAutomaticReviewFailureMetadata(): Record<string, unknown> {
  };
 }

+function clearRobotReviewMetadata(): Record<string, unknown> {
+  return {
+    robot_reviews: null,
+    robot_review_reviewer: null,
+    robot_review_scope: null,
+    robot_review_observations: null,
+    robot_review_blind_spots: null,
+    robot_review_accepted: null,
+    robot_review_evidence_complete: null,
+    robot_review_evidence_convincing: null,
+    robot_review_missing_evidence: null,
+    robot_review_submitted_at: null,
+    robot_review_mode: null,
+    robot_review_raw_output: null,
+    robot_review_requires_followup: null,
+    robot_review_iteration_count: null,
+  };
+}
+
+function clearCurrentEvidenceMetadata(): Record<string, unknown> {
+  return {
+    lgtm_evidence: null,
+    lgtm_failure_likely: null,
+    lgtm_failure_sneaky: null,
+    lgtm_falsification_test: null,
+    lgtm_verification_hints: null,
+    lgtm_remaining_uncertainty: null,
+    lgtm_submitted_at: null,
+    lgtm_commands: null,
+    lgtm_evidence_artifacts: null,
+    lgtm_falsification_artifacts: null,
+  };
+}
+
+function normalizeCommandRecords(value: unknown): EvidenceCommandRecord[] {
+  return Array.isArray(value)
+    ? value.flatMap((entry) => {
+      if (!entry || typeof entry !== "object") return [];
+      const command = entry as Record<string, unknown>;
+      if (typeof command.cmd !== "string" || typeof command.exit_code !== "number") return [];
+      return [{
+        cmd: command.cmd,
+        exit_code: command.exit_code,
+        stdout_path: typeof command.stdout_path === "string" ? command.stdout_path : undefined,
+        stderr_path: typeof command.stderr_path === "string" ? command.stderr_path : undefined,
+      }];
+    })
+    : [];
+}
+
+function normalizeArtifactRecords(value: unknown): EvidenceArtifactRecord[] {
+  return Array.isArray(value)
+    ? value.flatMap((entry) => {
+      if (!entry || typeof entry !== "object") return [];
+      const artifact = entry as Record<string, unknown>;
+      if (typeof artifact.path !== "string" || typeof artifact.sha256 !== "string" || typeof artifact.bytes !== "number") return [];
+      return [{ path: artifact.path, sha256: artifact.sha256, bytes: artifact.bytes }];
+    })
+    : [];
+}
+
+export function buildArtifactRecords(paths?: string[]): EvidenceArtifactRecord[] {
+  return (paths ?? []).map((path) => {
+    const resolvedPath = resolve(path);
+    const bytes = statSync(resolvedPath).size;
+    const sha256 = createHash("sha256").update(readFileSync(resolvedPath)).digest("hex");
+    return { path: resolvedPath, sha256, bytes };
+  });
+}
+
+export function getEvidenceHistory(task: Task): EvidenceIterationRecord[] {
+  return Array.isArray(task.metadata?.lgtm_history)
+    ? task.metadata.lgtm_history.filter((entry: unknown): entry is EvidenceIterationRecord => !!entry && typeof entry === "object")
+    : [];
+}
+
+export function getCurrentEvidenceIteration(task: Task): EvidenceIterationRecord | undefined {
+  const metadata = task.metadata ?? {};
+  if (typeof metadata.lgtm_evidence !== "string") return undefined;
+  return {
+    iteration: getEvidenceHistory(task).length + 1,
+    submitted_at: typeof metadata.lgtm_submitted_at === "string" ? metadata.lgtm_submitted_at : new Date(0).toISOString(),
+    evidence: metadata.lgtm_evidence,
+    failure_likely: typeof metadata.lgtm_failure_likely === "string" ? metadata.lgtm_failure_likely : "",
+    failure_sneaky: typeof metadata.lgtm_failure_sneaky === "string" ? metadata.lgtm_failure_sneaky : "",
+    falsification_test: typeof metadata.lgtm_falsification_test === "string" ? metadata.lgtm_falsification_test : "",
+    verification_hints: Array.isArray(metadata.lgtm_verification_hints) ? metadata.lgtm_verification_hints.filter((hint: unknown): hint is string => typeof hint === "string") : [],
+    remaining_uncertainty: typeof metadata.lgtm_remaining_uncertainty === "string" ? metadata.lgtm_remaining_uncertainty : "",
+    commands: normalizeCommandRecords(metadata.lgtm_commands),
+    evidence_artifacts: normalizeArtifactRecords(metadata.lgtm_evidence_artifacts),
+    falsification_artifacts: normalizeArtifactRecords(metadata.lgtm_falsification_artifacts),
+    robot_reviews: getRobotReviews(task),
+    automatic_review_failure: typeof metadata.robot_review_last_error === "string"
+      ? {
+        message: metadata.robot_review_last_error,
+        raw_output: typeof metadata.robot_review_last_error_output === "string" ? metadata.robot_review_last_error_output : undefined,
+      }
+      : undefined,
+  };
+}
+
+export function getEvidenceIterationCount(task: Task): number {
+  return getEvidenceHistory(task).length + (getCurrentEvidenceIteration(task) ? 1 : 0);
+}
+
+export function archiveCurrentEvidence(task: Task, reason: string): Record<string, unknown> {
+  const current = getCurrentEvidenceIteration(task);
+  if (!current) return {};
+  return {
+    lgtm_history: [
+      ...getEvidenceHistory(task),
+      {
+        ...current,
+        superseded_at: new Date().toISOString(),
+        supersede_reason: reason,
+      },
+    ],
+  };
+}
+
 function formatReviewTextBlock(title: string, body: string): string {
  return `### ${title}\n\n\`\`\`text\n${body}\n\`\`\``;
 }

+function formatCommandRecords(commands: EvidenceCommandRecord[]): string | undefined {
+  if (commands.length === 0) return undefined;
+  return `### Commands\n${commands.map((command, index) => {
+    const parts = [
+      `${index + 1}. \`${command.cmd}\``,
+      `exit=${command.exit_code}`,
+    ];
+    if (command.stdout_path) parts.push(`stdout=${command.stdout_path}`);
+    if (command.stderr_path) parts.push(`stderr=${command.stderr_path}`);
+    return `- ${parts.join(" | ")}`;
+  }).join("\n")}`;
+}
+
+function formatArtifactRecords(title: string, artifacts: EvidenceArtifactRecord[]): string | undefined {
+  if (artifacts.length === 0) return undefined;
+  return `### ${title}\n${artifacts.map((artifact) => `- ${artifact.path} | sha256=${artifact.sha256} | bytes=${artifact.bytes}`).join("\n")}`;
+}
+
 function formatEvidencePackage(task: Task): string[] {
-  const metadata = task.metadata ?? {};
+  const current = getCurrentEvidenceIteration(task);
  const sections: string[] = [];
-  if (typeof metadata.lgtm_evidence === "string") {
-    sections.push(formatReviewTextBlock("Evidence", metadata.lgtm_evidence));
-    if (typeof metadata.lgtm_failure_likely === "string") sections.push(`### Failure (likely)\n${metadata.lgtm_failure_likely}`);
-    if (typeof metadata.lgtm_failure_sneaky === "string") sections.push(`### Failure (sneaky)\n${metadata.lgtm_failure_sneaky}`);
-    if (typeof metadata.lgtm_falsification_test === "string") {
-      sections.push(formatReviewTextBlock("Falsification test", metadata.lgtm_falsification_test));
+  if (current) {
+    sections.push(`Evidence iteration: ${current.iteration} of ${getEvidenceIterationCount(task)}`);
+    sections.push(formatReviewTextBlock("Evidence", current.evidence));
+    if (current.failure_likely) sections.push(`### Failure (likely)\n${current.failure_likely}`);
+    if (current.failure_sneaky) sections.push(`### Failure (sneaky)\n${current.failure_sneaky}`);
+    if (current.falsification_test) sections.push(formatReviewTextBlock("Falsification test", current.falsification_test));
+    const commands = formatCommandRecords(current.commands);
+    if (commands) sections.push(commands);
+    const evidenceArtifacts = formatArtifactRecords("Evidence artifacts", current.evidence_artifacts);
+    if (evidenceArtifacts) sections.push(evidenceArtifacts);
+    const falsificationArtifacts = formatArtifactRecords("Falsification artifacts", current.falsification_artifacts);
+    if (falsificationArtifacts) sections.push(falsificationArtifacts);
+    if (current.verification_hints.length > 0) {
+      sections.push(`### Verification hints\n${current.verification_hints.map((hint) => `- ${hint}`).join("\n")}`);
    }
-    if (Array.isArray(metadata.lgtm_verification_hints) && metadata.lgtm_verification_hints.length > 0) {
-      sections.push(`### Verification hints\n${metadata.lgtm_verification_hints.map((hint: string) => `- ${hint}`).join("\n")}`);
-    }
-    if (typeof metadata.lgtm_remaining_uncertainty === "string") {
-      sections.push(`### Remaining uncertainty\n${metadata.lgtm_remaining_uncertainty}`);
-    }
-    if (typeof metadata.lgtm_submitted_at === "string") sections.push(`Submitted: ${metadata.lgtm_submitted_at}`);
+    if (current.remaining_uncertainty) sections.push(`### Remaining uncertainty\n${current.remaining_uncertainty}`);
+    sections.push(`Submitted: ${current.submitted_at}`);
  }
-  if (typeof metadata.robot_review_last_error === "string") {
-    sections.push(`### Automatic robot review failure\n${metadata.robot_review_last_error}`);
-    if (typeof metadata.robot_review_last_error_output === "string" && metadata.robot_review_last_error_output.trim()) {
-      sections.push(formatReviewTextBlock("Reviewer raw output", metadata.robot_review_last_error_output));
+  if (typeof task.metadata?.robot_review_last_error === "string") {
+    sections.push(`### Automatic robot review failure\n${task.metadata.robot_review_last_error}`);
+    if (typeof task.metadata?.robot_review_last_error_output === "string" && task.metadata.robot_review_last_error_output.trim()) {
+      sections.push(formatReviewTextBlock("Reviewer raw output", task.metadata.robot_review_last_error_output));
    }
  }
  return sections;
@@ -268,10 +452,18 @@ function formatEvidencePackage(task: Task): string[] {

 function getNonReviewMetadata(task: Task): Record<string, unknown> {
  return Object.fromEntries(
-    Object.entries(task.metadata ?? {}).filter(([key]) => !key.startsWith("lgtm_") && !key.startsWith("robot_review_")),
+    Object.entries(task.metadata ?? {}).filter(([key]) =>
+      !key.startsWith("lgtm_") && !key.startsWith("robot_review_") && key !== "lgtm_history"
+    ),
  );
 }

+function formatHistorySummary(task: Task): string | undefined {
+  const history = getEvidenceHistory(task);
+  if (history.length === 0) return undefined;
+  return `Superseded evidence:\n${history.map((entry) => `- #${entry.iteration} superseded ${entry.superseded_at ?? "?"}: ${entry.supersede_reason ?? "(no reason recorded)"}`).join("\n")}`;
+}
+
 export function extractRobotReviewJson(output: string): Record<string, unknown> {
  const match = output.match(/ROBOT_REVIEW_JSON_START\s*([\s\S]*?)\s*ROBOT_REVIEW_JSON_END/);
  const source = match ? match[1] : output;
@@ -364,6 +556,9 @@ function buildRobotReviewPrompt(task: any): string {
    `Falsification test: ${task.metadata?.lgtm_falsification_test ?? "(missing)"}`,
    `Verification hints: ${Array.isArray(task.metadata?.lgtm_verification_hints) ? task.metadata.lgtm_verification_hints.join(" | ") : "(missing)"}`,
    `Remaining uncertainty: ${task.metadata?.lgtm_remaining_uncertainty ?? "(missing)"}`,
+    `Commands: ${JSON.stringify(normalizeCommandRecords(task.metadata?.lgtm_commands))}`,
+    `Evidence artifacts: ${JSON.stringify(normalizeArtifactRecords(task.metadata?.lgtm_evidence_artifacts))}`,
+    `Falsification artifacts: ${JSON.stringify(normalizeArtifactRecords(task.metadata?.lgtm_falsification_artifacts))}`,
    priorSection,
    "Output format:",
    "ROBOT_REVIEW_JSON_START",
@@ -680,19 +875,29 @@ export default function (pi: ExtensionAPI) {

      const desc = task.description.replace(/\\n/g, "\n");
      const robotReviews = getRobotReviews(task);
+      const completionMode: CompletionMode = getCompletionMode(task);
+      const reviewState: ReviewState = getReviewState(task);
+      const currentEvidence = getCurrentEvidenceIteration(task);
+      const history = getEvidenceHistory(task);
      const lines: string[] = [
        `Task #${task.id}: ${task.subject}`,
        `Status: ${task.status} ${getReviewBadges(task)}${task.pending_approval && task.status !== "completed" ? " (pending human sign-off)" : ""}`,
+        `Completion mode: ${completionMode}`,
+        `Review state: ${reviewState}`,
        `Gate status: ${getGateStatus(task)}`,
        `Done criterion: ${task.done_criterion}`,
        `Description: ${desc}`,
      ];
+      lines.push(`Evidence iterations: total=${getEvidenceIterationCount(task)}, current=${currentEvidence ? currentEvidence.iteration : 0}, superseded=${history.length}`);
+      lines.push(`Human sign-off pending: ${task.pending_approval ? "yes" : "no"}`);
      if (robotReviews.length > 0) {
        const latest = robotReviews[robotReviews.length - 1];
-        lines.push(`Robot reviews: ${robotReviews.length} (latest: accepted=${latest.accepted ? "yes" : "no"}, complete=${latest.evidence_complete ? "yes" : "no"}, convincing=${latest.evidence_convincing ? "yes" : "no"})`);
+        lines.push(`Robot reviews on current evidence: ${robotReviews.length} (latest: accepted=${latest.accepted ? "yes" : "no"}, complete=${latest.evidence_complete ? "yes" : "no"}, convincing=${latest.evidence_convincing ? "yes" : "no"})`);
      }
      const evidenceSections = formatEvidencePackage(task);
      if (evidenceSections.length > 0) lines.push(...evidenceSections);
+      const historySummary = formatHistorySummary(task);
+      if (historySummary) lines.push(historySummary);
      if (task.blockedBy.length > 0) {
        const openBlockers = task.blockedBy.filter(bid => {
          const blocker = store.get(bid);
@@ -794,7 +999,10 @@ Do NOT summarize or interpret. Paste literal command output, exact log lines, ma
 - **failure_sneaky**: Most perverse or sneaky failure -- one that looks like success superficially, corrupts silently, or only breaks under specific conditions (scale, time, edge case). E.g. feature active but wrong mechanism, works in tests but degrades in prod, correct output for wrong reason.
 - **falsification_test**: What you ran and the literal output you got, with reasoning why that output disproves the failure mode
 - **verification_hints**: Where to look and what to check, with specific content quoted (not bare paths or counts)
- **remaining_uncertainty**: What's NOT tested, known limitations, deferred edge cases`,
+- **remaining_uncertainty**: What's NOT tested, known limitations, deferred edge cases
+- **commands**: Optional first-class command records for the evidence package
+- **evidence_paths / falsification_paths**: Optional local artifact paths. The tool stores absolute path, sha256, and byte size for auditability.
+- **supersede_reason**: Optional reason when this submission replaces an older one on the same task`,
    parameters: Type.Object({
      taskId: Type.String({ description: "Task ID to submit for sign-off" }),
      evidence: Type.String({ description: "Verbatim auditable proof: literal command output, exact log lines, markdown block quotes, table rows, URLs. NOT summaries or interpretations. 'I ran X and got Y' is not evidence -- paste the actual output of X. A human must verify from this alone without re-running. (One short paragraph is fine; verbatim matters more than length.)" }),
@@ -803,6 +1011,15 @@ Do NOT summarize or interpret. Paste literal command output, exact log lines, ma
      falsification_test: Type.String({ description: "What you ran and the literal output you got. Include verbatim command + output, not 'it worked'. State why that output could not occur if a failure mode were real. Brevity is fine; the verbatim output is what counts." }),
      verification_hints: Type.Array(Type.String(), { description: "Where to look, with specific content quoted (not bare paths or counts). E.g. 'src/loss.py:45-60 shows grad_norm=0.001'. One or two short hints is enough." }),
      remaining_uncertainty: Type.String({ description: "What's NOT tested, known limitations, deferred edges. One short sentence preferred. If you can't articulate uncertainty, you haven't thought hard enough." }),
+      commands: Type.Optional(Type.Array(Type.Object({
+        cmd: Type.String({ description: "Exact command that was run" }),
+        exit_code: Type.Number({ description: "Process exit code" }),
+        stdout_path: Type.Optional(Type.String({ description: "Optional path to captured stdout" })),
+        stderr_path: Type.Optional(Type.String({ description: "Optional path to captured stderr" })),
+      }))),
+      evidence_paths: Type.Optional(Type.Array(Type.String(), { description: "Optional local artifact paths backing the evidence. Stored as absolute path + sha256 + byte size." })),
+      falsification_paths: Type.Optional(Type.Array(Type.String(), { description: "Optional local artifact paths backing the falsification test. Stored as absolute path + sha256 + byte size." })),
+      supersede_reason: Type.Optional(Type.String({ description: "Why this evidence replaces an older submission on the same task." })),
    }),

    async execute(_toolCallId, params, signal, _onUpdate, _ctx) {
@@ -815,6 +1032,9 @@ Do NOT summarize or interpret. Paste literal command output, exact log lines, ma
      store.update(params.taskId, {
        pending_approval: true,
        metadata: {
+          ...archiveCurrentEvidence(task, params.supersede_reason ?? "replaced by newer lgtm submission"),
+          ...clearCurrentEvidenceMetadata(),
+          ...clearRobotReviewMetadata(),
          lgtm_evidence: params.evidence,
          lgtm_failure_likely: params.failure_likely,
          lgtm_failure_sneaky: params.failure_sneaky,
@@ -822,6 +1042,9 @@ Do NOT summarize or interpret. Paste literal command output, exact log lines, ma
          lgtm_verification_hints: params.verification_hints,
          lgtm_remaining_uncertainty: params.remaining_uncertainty,
          lgtm_submitted_at: new Date().toISOString(),
+          lgtm_commands: params.commands ?? [],
+          lgtm_evidence_artifacts: buildArtifactRecords(params.evidence_paths),
+          lgtm_falsification_artifacts: buildArtifactRecords(params.falsification_paths),
          ...clearAutomaticReviewFailureMetadata(),
        },
      });
@@ -885,6 +1108,46 @@ Do NOT summarize or interpret. Paste literal command output, exact log lines, ma
    },
  });

+  pi.registerTool({
+    name: "lgtm_supersede",
+    label: "lgtm_supersede",
+    description: `Mark the current LGTM evidence package as superseded without completing the task.
+
+Use this when a prior claim is stale or wrong and reviewers should stop treating it as the current evidence. The current evidence, robot reviews, and reviewer-failure context are archived into history with your reason. Human /lgtm remains the only completion path.`,
+    parameters: Type.Object({
+      taskId: Type.String({ description: "Task ID whose current evidence should be superseded" }),
+      reason: Type.String({ description: "Why the current evidence is stale or replaced" }),
+    }),
+
+    execute(_toolCallId, params, _signal, _onUpdate, _ctx) {
+      const task = store.get(params.taskId);
+      if (!task) return Promise.resolve(textResult(`Task #${params.taskId} not found`));
+      if (!getCurrentEvidenceIteration(task)) {
+        return Promise.resolve(textResult(`Task #${params.taskId} has no current evidence to supersede.`));
+      }
+
+      store.update(params.taskId, {
+        pending_approval: false,
+        metadata: {
+          ...archiveCurrentEvidence(task, params.reason),
+          ...clearCurrentEvidenceMetadata(),
+          ...clearRobotReviewMetadata(),
+          ...clearAutomaticReviewFailureMetadata(),
+        },
+      });
+      widget.update();
+
+      const updatedTask = store.get(params.taskId) ?? task;
+      return Promise.resolve(textResult(
+        `## Evidence superseded for task #${task.id}: ${task.subject}\n` +
+        `Reason: ${params.reason}\n\n` +
+        `Review state: ${getReviewState(updatedTask)}\n` +
+        `Gate status: ${getGateStatus(updatedTask)}\n\n` +
+        `${formatHistorySummary(updatedTask) ?? "No evidence history found."}`,
+      ));
+    },
+  });
+
  pi.registerTool({
    name: "robot_review_ask",
    label: "robot_review_ask",
@@ -1166,8 +1429,10 @@ This appends a new robot-review iteration. If the reviewer marks evidence incomp
    if (m.lgtm_evidence) {
      evidenceParts.push(...formatEvidencePackage(task));
    } else {
-      evidenceParts.push(`(No agent-submitted evidence — agent never called lgtm_ask. Human override.)`);
+      evidenceParts.push(`(No current agent-submitted evidence — agent never called lgtm_ask, or the prior evidence was superseded.)`);
    }
+    const historySummary = formatHistorySummary(task);
+    if (historySummary) evidenceParts.push(historySummary);
    if (robotReviews.length > 0) {
      evidenceParts.push(
        `Robot reviews (${robotReviews.length} total):\n${robotReviews.map(formatRobotReview).join("\n\n")}`,
@@ -3,6 +3,14 @@ import type { Task } from "./types.js";

 const STAGES = ["🛠", "🤖", "👀"] as const;

+function hasCurrentEvidence(task: Task): boolean {
+  return typeof task.metadata?.lgtm_evidence === "string" && task.metadata.lgtm_evidence.length > 0;
+}
+
+function hasEvidenceHistory(task: Task): boolean {
+  return Array.isArray(task.metadata?.lgtm_history) && task.metadata.lgtm_history.length > 0;
+}
+
 /** Pipeline stages: `[🛠·🤖·👀]` fills left-to-right as evidence→review→signoff progresses. */
 export function getReviewBadges(task: Task): string {
  const filled = [
@@ -30,18 +38,44 @@ export function getDisplayStatus(task: Task): DisplayStatus {
  return task.status;
 }

+export type CompletionMode = "direct" | "lgtm";
+export type ReviewState =
+  | "no_evidence"
+  | "evidence_submitted"
+  | "reviewer_failed_to_run"
+  | "reviewer_rejected"
+  | "ready_for_human"
+  | "superseded"
+  | "human_signed_off";
 export type StateTag = "READY" | "ACTIVE" | "PENDING" | "DONE";

+export function getCompletionMode(task: Task): CompletionMode {
+  return hasCurrentEvidence(task) || hasEvidenceHistory(task) || getRobotReviews(task).length > 0 || task.pending_approval
+    ? "lgtm"
+    : "direct";
+}
+
+export function getReviewState(task: Task): ReviewState {
+  if (task.status === "completed") return "human_signed_off";
+  if (task.pending_approval && hasCurrentEvidence(task)) return "ready_for_human";
+  if (typeof task.metadata?.robot_review_last_error === "string") return "reviewer_failed_to_run";
+  const latest = getLatestRobotReview(task);
+  if (latest && !latest.accepted) return "reviewer_rejected";
+  if (hasCurrentEvidence(task)) return "evidence_submitted";
+  if (hasEvidenceHistory(task)) return "superseded";
+  return "no_evidence";
+}
+
 export function getGateStatus(task: Task): string {
-  if (task.status === "completed") return "human signed off";
-  if (!task.metadata?.lgtm_evidence) return "no lgtm evidence submitted";
-  if (task.pending_approval) return `ready for human sign-off via /lgtm ${task.id}`;
-  if (typeof task.metadata?.robot_review_last_error === "string") {
+  const state = getReviewState(task);
+  if (state === "human_signed_off") return "human signed off";
+  if (state === "no_evidence") return "no lgtm evidence submitted";
+  if (state === "ready_for_human") return `ready for human sign-off via /lgtm ${task.id}`;
+  if (state === "reviewer_failed_to_run") {
    return `blocked: automatic robot review failed: ${task.metadata.robot_review_last_error}`;
  }
-  const latest = getLatestRobotReview(task);
-  if (latest && !latest.accepted) return "blocked: latest robot review rejected the evidence";
-  if (latest?.accepted) return "accepted robot review present, but the human gate is still closed";
+  if (state === "reviewer_rejected") return "blocked: latest robot review rejected the evidence";
+  if (state === "superseded") return "current evidence superseded, waiting for a new lgtm submission";
  return "blocked: evidence submitted, robot review still required";
 }

@@ -133,8 +133,8 @@ export class TaskStore {
      // Once a task has stored lgtm evidence, completion must go through /lgtm so the
      // human gate + robot review can't be skipped.
      if (fields.status === "completed") {
-        if (task.pending_approval || task.metadata?.lgtm_evidence) {
-          throw new Error(`Use /lgtm ${id} to complete this task — it has lgtm evidence pending review.`);
+        if (task.pending_approval || task.metadata?.lgtm_evidence || (Array.isArray(task.metadata?.lgtm_history) && task.metadata.lgtm_history.length > 0)) {
+          throw new Error(`Use /lgtm ${id} to complete this task — completion_mode=lgtm because evidence was submitted.`);
        }
      }

@@ -1,5 +1,5 @@
 import { describe, expect, it } from "vitest";
-import { getDisplayStatus, getGateStatus, getReviewBadges } from "../src/review-badges.js";
+import { getCompletionMode, getDisplayStatus, getGateStatus, getReviewBadges, getReviewState } from "../src/review-badges.js";
 import type { Task } from "../src/types.js";

 function makeTask(overrides: Partial<Task> = {}): Task {
@@ -60,6 +60,20 @@ describe("getReviewBadges", () => {
  });
 });

+describe("review state helpers", () => {
+  it("reports completion mode as direct before any lgtm evidence", () => {
+    expect(getCompletionMode(makeTask())).toBe("direct");
+  });
+
+  it("reports completion mode as lgtm after evidence history exists", () => {
+    expect(getCompletionMode(makeTask({ metadata: { lgtm_history: [{ iteration: 1 }] } }))).toBe("lgtm");
+  });
+
+  it("reports superseded when only history remains", () => {
+    expect(getReviewState(makeTask({ metadata: { lgtm_history: [{ iteration: 1 }] } }))).toBe("superseded");
+  });
+});
+
 describe("getGateStatus", () => {
  it("reports ready when human sign-off is open", () => {
    expect(getGateStatus(makeTask({
@@ -1,4 +1,8 @@
+import { mkdtempSync, writeFileSync } from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
 import { describe, expect, it } from "vitest";
+import { archiveCurrentEvidence, buildArtifactRecords, getCurrentEvidenceIteration, getEvidenceHistory } from "../src/index.js";
 import { appendRobotReviewMetadata, getLatestRobotReview, getRobotReviews, shouldOpenHumanSignoffGate } from "../src/robot-review.js";
 import type { Task } from "../src/types.js";

@@ -45,6 +49,38 @@ describe("robot review helpers", () => {
    expect(reviews[0].accepted).toBe(true);
  });

+  it("builds artifact records with absolute path and sha256", () => {
+    const dir = mkdtempSync(join(tmpdir(), "pi-lgtm-"));
+    const path = join(dir, "evidence.log");
+    writeFileSync(path, "hello\n");
+
+    const [artifact] = buildArtifactRecords([path]);
+    expect(artifact.path).toBe(path);
+    expect(artifact.bytes).toBe(6);
+    expect(artifact.sha256).toHaveLength(64);
+  });
+
+  it("archives current evidence with reason", () => {
+    const task = makeTask({
+      metadata: {
+        lgtm_evidence: "literal output",
+        lgtm_failure_likely: "wrong seed",
+        lgtm_failure_sneaky: "wrong threshold",
+        lgtm_falsification_test: "pytest -k check",
+        lgtm_verification_hints: ["see line 5"],
+        lgtm_remaining_uncertainty: "not load tested",
+        lgtm_submitted_at: "2026-06-07T00:00:00.000Z",
+        lgtm_commands: [{ cmd: "pytest", exit_code: 0 }],
+      },
+    });
+
+    const archived = archiveCurrentEvidence(task, "threshold changed");
+    const taskWithHistory = makeTask({ metadata: archived });
+    expect(getCurrentEvidenceIteration(task)?.iteration).toBe(1);
+    expect(getEvidenceHistory(taskWithHistory)).toHaveLength(1);
+    expect(getEvidenceHistory(taskWithHistory)[0].supersede_reason).toBe("threshold changed");
+  });
+
  it("appends robot reviews as iterations", () => {
    const task = makeTask();
    const metadata1 = appendRobotReviewMetadata(task, {
@@ -189,6 +189,17 @@ describe("TaskStore (in-memory)", () => {
    expect(() => store.update("1", { status: "completed" })).toThrow("/lgtm");
  });

+  it("blocks TaskUpdate(status=completed) after evidence was superseded into history", () => {
+    store.create("Superseded", "Desc", "done");
+    store.update("1", {
+      metadata: {
+        lgtm_history: [{ iteration: 1, supersede_reason: "threshold changed" }],
+      },
+      pending_approval: false,
+    });
+    expect(() => store.update("1", { status: "completed" })).toThrow("completion_mode=lgtm");
+  });
+
  it("returns not found for update on non-existent task", () => {
    const { task, changedFields } = store.update("999", { status: "in_progress" });
    expect(task).toBeUndefined();