Agentic CLI Tools

The agent that uses the CLI is the first customer. If it can't discover commands, validate inputs, or parse outputs without guessing, the tool is human-grade with agent hopes attached.

Agent-first CLIs need predictable contracts, runtime introspection, context discipline, input hardening, dry-run safety, and explicit operating invariants. This page gives a copy-paste checklist and a design-review template with review prompts and a scorecard.

Current Tools

These are the CLI surfaces currently relevant to our stack or adjacent enough to evaluate now.

Tool	Category	What it does	Relevance
Firecrawl CLI	Web data	Scrape, crawl, search, browser automation, and agent workflows from the terminal	High — live web extraction, competitor analysis, and research workflows
Google Workspace CLI	Workspace ops	Google Workspace administration and automation from the terminal	Medium — useful when email, calendar, or Drive workflows move out of the UI
GitHub CLI	Code + repos	Issues, PRs, releases, and repo operations	High — already overlaps with GitHub MCP and remains a lean default
Gemini CLI	Coding agent	Terminal-native coding and agent workflows with Gemini models	Medium — useful comparison surface for large-context or low-cost coding tasks
`llm`	LLM evaluation + logging	Universal LLM CLI — run prompts, chain outputs, log to SQLite, query history via Datasette. Plugin ecosystem covers Claude, Gemini, local models.	High — fills the prompt logging and evaluation gap in our stack

llm is the most recent addition. It fills the evaluation and logging gap: every prompt run is logged to SQLite and queryable via Datasette. Direct complement to the plan-cli stack for measuring agent output quality. Firecrawl already appears elsewhere in the repo as an MCP/server and Hermes-backed web capability, but the CLI catalogue did not name the CLI itself.

Retrofit Order

When upgrading a human-first CLI to agent-grade, this sequence minimises risk and maximises early value:

Step	What	Why First
1	`--output json` on every command	Agents can't parse prose
2	Validate all inputs aggressively	Agents hallucinate differently than humans typo
3	`schema` / `--describe` / `--help --json`	Runtime discoverability replaces stale docs
4	Field masking (`--fields`)	Context window discipline
5	`--dry-run` on mutations	Rehearse before committing side effects
6	Document invariants in CONTEXT.md / SKILL.md	What the agent can't infer from `--help`
7	Expose via MCP / tool protocol	One capability model, many surfaces

Raw JSON payloads (--params '{"fields": "id,name"}') beat bespoke flags. Ten individual flags is ten hallucination opportunities. One JSON object is one contract.

Context

Platform Architecture — Where CLIs fit in the stack
Protocols — Agent communication standards
Flow Engineering — How tools serve the development loop

Questions

Which dimension would you relax first under schedule pressure — and what would that cost the first time an agent hallucinates a flag?

Where does your CLI sit today on “one model, many surfaces” versus parallel logic trees?
What must always be true for your tool that you have not yet written down?
If the primary operator is already an agent, what does your CONTEXT.md or SKILL.md assume that the model cannot infer?

Current Tools​

Retrofit Order​

Context​

Links​

Questions​

Current Tools

Retrofit Order

Context

Links

Questions