8.6 KiB
LGTM friction fixes
Goal
Fix the subset of reported LGTM-tool friction that seems clearly worth it now.
First batch done: make auto-review parse failures less opaque, let an accepted manual robot review reopen the human gate, show a compact current gate status, and make stored evidence easier to read during /lgtm review.
Second batch: add structured command/artifact metadata, expose completion/review state more explicitly, and track superseded evidence iterations so stale claims stop polluting current review.
Scope
In: parser/error handling for automatic robot review, manual review gate semantics, gate-status rendering, /lgtm evidence formatting, structured command/artifact metadata, evidence supersession/iterations, completion/review state summaries, focused tests, README updates.
Out: full schema/tool-calling reviewer redesign, giant workflow/state-machine rewrite, attachment upload plumbing beyond local path+hash metadata.
Requirements
- R1: Automatic robot review failures are diagnosable. Done means: parse errors include raw-output context and tolerate common wrapper noise around the JSON object. VERIFY:
npm test -- test/robot-review-runner.test.tsshows parser fallback tests passing. If silent failure remained, the new malformed-output tests would still throw opaque errors. - R2: Accepted manual robot review can reopen human sign-off after a failed auto review. Done means: storing an accepted manual review can set
pending_approval=truewhen LGTM evidence exists. VERIFY: focused test covers this transition. If broken, the task would stay blocked after accepted manual review. - R3: Task details expose current gate status in one short line. Done means: a helper summarizes states like ready, blocked on rejected review, or blocked on reviewer failure. VERIFY: focused tests assert representative summaries. If broken, TaskGet output would still require reading raw metadata to infer gate state.
- R4:
/lgtmreview output is easier to scan. Done means: evidence and falsification output are rendered in explicit markdown sections/code fences instead of one flat blob. VERIFY: source shows sectioned formatter used by/lgtm. If broken, the old unstructured evidence block remains. - R5: LGTM submissions can store first-class command/artifact metadata. Done means:
lgtm_askaccepts structured commands and artifact paths, stores hashes for artifacts, and renders them in human-facing task views. VERIFY: focused tests assert artifact hashing/formatting and metadata retention. If broken, TaskGet or/lgtmcannot show the structured records. - R6: Human-facing state is explicit, not inferred from raw booleans. Done means: TaskGet exposes completion mode plus a compact review-state line such as
reviewer_failed_to_runorsuperseded. VERIFY: focused tests assert representative state strings. If broken, the output still requires reading metadata internals. - R7: Old evidence can be superseded cleanly. Done means: there is an explicit supersede path and repeated
lgtm_askcalls archive prior evidence into history with reasons, rather than overwriting silently. VERIFY: focused tests cover history growth and direct-completion staying blocked after supersede. If broken, stale evidence disappears or the agent can bypass/lgtmafter superseding.
Tasks
- T1 (R1): Harden robot-review JSON extraction and error messages.
- steps: add parser helpers; tolerate fenced/extra prose around a JSON object inside markers; include output snippet on failure
- verify:
npm test -- test/robot-review-runner.test.ts - success: parser tests pass, including noisy-marker cases
- likely_fail: fallback never runs because markers are missing or parsing still uses raw
JSON.parse - sneaky_fail: fallback grabs the wrong braces and accepts junk; targeted tests catch nested/noisy cases
- UAT: "when robot review returns markers plus noise, I observe it still parses or at least shows the raw failure context"
- T2 (R2,R3,R4): Fix gate semantics and human-facing formatting.
- steps: add gate-status helper; use it in TaskGet and lgtm_ask result; let accepted manual review set pending approval when evidence exists; format
/lgtmevidence sections - verify:
npm test -- test/robot-review.test.ts test/task-store.test.ts - success: manual-review gating tests pass and task summaries expose gate status
- likely_fail: accepted manual review still preserves old false gate state
- sneaky_fail: gate summary says ready while latest review actually failed; summary tests catch mismatches
- UAT: "when I inspect a task after review, I observe one clear gate-status line and readable evidence sections"
- steps: add gate-status helper; use it in TaskGet and lgtm_ask result; let accepted manual review set pending approval when evidence exists; format
- T3 (R1-R4): Update README for new semantics.
- steps: mention accepted manual review reopening the gate, clearer TaskGet state, and structured
/lgtmevidence display - verify:
rg -n "manual review|gate status|/lgtm" README.md - success: README mentions the changed behavior
- likely_fail: docs still imply only auto review can reopen approval
- sneaky_fail: docs mention review generally but omit the new gate semantics
- UAT: "when I read README, I can infer the new gate behavior without reading code"
- steps: mention accepted manual review reopening the gate, clearer TaskGet state, and structured
- T4 (R5,R6): Add structured command/artifact metadata and explicit review/completion state.
- steps: extend
lgtm_askschema; hash artifact paths; add review-state and completion-mode helpers; render them in TaskGet and/lgtm - verify:
npm test -- test/review-badges.test.ts test/task-store.test.ts - success: task details show completion mode, review state, commands, and artifact hashes
- likely_fail: metadata is stored but never shown to the human
- sneaky_fail: review state lies after supersede or reviewer failure; helper tests catch this
- UAT: "when I inspect a reviewed task, I observe command/artifact records and one explicit review state"
- steps: extend
- T5 (R7): Add evidence history and supersede flow.
- steps: archive current evidence before replacement; add explicit supersede tool/reason; keep direct completion blocked once a task enters LGTM mode
- verify:
npm test -- test/task-store.test.ts test/robot-review.test.ts - success: old evidence appears in history with reason and agent self-completion remains blocked
- likely_fail: superseding clears current evidence and accidentally re-enables TaskUpdate(status=completed)
- sneaky_fail: robot reviews from prior evidence leak into the new current evidence; focused tests catch iteration reset/history behavior
- UAT: "when I replace or supersede evidence, I can still inspect the old claim and the new review starts fresh"
- T6 (R5-R7): Update README for the second batch.
- steps: document commands/artifacts, review/completion state, and supersede behavior
- verify:
rg -n "commands|artifact|supersede|completion mode|review state" README.md - success: README explains the new structured evidence workflow
- likely_fail: docs mention artifacts but omit supersede semantics
- sneaky_fail: docs imply a full reviewer-tool-calling redesign that does not exist
- UAT: "when I read README, I understand how to attach artifacts and retire stale evidence"
Context
- Keep fail-fast simplicity. Do not add a large state-machine rewrite.
- Preserve the human
/lgtmoverride path. - Manual review should not complete tasks, only reopen the human gate when it accepts existing evidence.
Log
- 2026-06-07: User explicitly said not to fix everything, only the subset that makes sense. Chosen fixes are 1, 2, 10, plus
/lgtmevidence readability because it is low-cost and directly reported. - 2026-06-07: Implemented best-effort JSON extraction from noisy marker blocks, persisted automatic-review failure context in task metadata, reopened
pending_approvalwhen any accepted review validates stored evidence, addedGate status:summaries, and reformatted human review output into explicit sections/code fences. - 2026-06-07: Verification passed with
npm test,npm run typecheck, andnpm run lint. - 2026-06-07: Second batch added structured command/artifact metadata, explicit
completion modeandreview state, evidence history with supersede reasons, and a dedicatedlgtm_supersedetool. - 2026-06-07: Second-batch verification passed with
npm test -- test/review-badges.test.ts test/robot-review.test.ts test/task-store.test.ts, then fullnpm run lint && npm run typecheck && npm test. - 2026-06-07: Skipped the bigger reviewer-output redesign. Did a smaller cleanup pass instead: consolidated metadata-clearing helpers and simplified artifact hashing without changing behavior.
TODO
- Full reviewer tool-calling/schema enforcement is still probably a separate refactor if parse brittleness returns.
Errors
| Task | Error | Resolution |
|---|