Skip to main content

Agentic CLI Tools

The agent that uses the CLI is the first customer. If it can't discover commands, validate inputs, or parse outputs without guessing, the tool is human-grade with agent hopes attached.

Agent-first CLIs need predictable contracts, runtime introspection, context discipline, input hardening, dry-run safety, and explicit operating invariants. This page gives a copy-paste checklist and a design-review template with review prompts and a scorecard.

Current Tools

These are the CLI surfaces currently relevant to our stack or adjacent enough to evaluate now.

ToolCategoryWhat it doesRelevance
Firecrawl CLIWeb dataScrape, crawl, search, browser automation, and agent workflows from the terminalHigh — live web extraction, competitor analysis, and research workflows
Google Workspace CLIWorkspace opsGoogle Workspace administration and automation from the terminalMedium — useful when email, calendar, or Drive workflows move out of the UI
GitHub CLICode + reposIssues, PRs, releases, and repo operationsHigh — already overlaps with GitHub MCP and remains a lean default
Gemini CLICoding agentTerminal-native coding and agent workflows with Gemini modelsMedium — useful comparison surface for large-context or low-cost coding tasks
llmLLM evaluation + loggingUniversal LLM CLI — run prompts, chain outputs, log to SQLite, query history via Datasette. Plugin ecosystem covers Claude, Gemini, local models.High — fills the prompt logging and evaluation gap in our stack

llm is the most recent addition. It fills the evaluation and logging gap: every prompt run is logged to SQLite and queryable via Datasette. Direct complement to the plan-cli stack for measuring agent output quality. Firecrawl already appears elsewhere in the repo as an MCP/server and Hermes-backed web capability, but the CLI catalogue did not name the CLI itself.

Retrofit Order

When upgrading a human-first CLI to agent-grade, this sequence minimises risk and maximises early value:

StepWhatWhy First
1--output json on every commandAgents can't parse prose
2Validate all inputs aggressivelyAgents hallucinate differently than humans typo
3schema / --describe / --help --jsonRuntime discoverability replaces stale docs
4Field masking (--fields)Context window discipline
5--dry-run on mutationsRehearse before committing side effects
6Document invariants in CONTEXT.md / SKILL.mdWhat the agent can't infer from --help
7Expose via MCP / tool protocolOne capability model, many surfaces

Raw JSON payloads (--params '{"fields": "id,name"}') beat bespoke flags. Ten individual flags is ten hallucination opportunities. One JSON object is one contract.

Context

Questions

Which dimension would you relax first under schedule pressure — and what would that cost the first time an agent hallucinates a flag?

  • Where does your CLI sit today on “one model, many surfaces” versus parallel logic trees?
  • What must always be true for your tool that you have not yet written down?
  • If the primary operator is already an agent, what does your CONTEXT.md or SKILL.md assume that the model cannot infer?