Agentic CLI Tools
The agent that uses the CLI is the first customer. If it can't discover commands, validate inputs, or parse outputs without guessing, the tool is human-grade with agent hopes attached.
Agent-first CLIs need predictable contracts, runtime introspection, context discipline, input hardening, dry-run safety, and explicit operating invariants. This page gives a copy-paste checklist and a design-review template with review prompts and a scorecard.
Current Tools
These are the CLI surfaces currently relevant to our stack or adjacent enough to evaluate now.
| Tool | Category | What it does | Relevance |
|---|---|---|---|
| Firecrawl CLI | Web data | Scrape, crawl, search, browser automation, and agent workflows from the terminal | High — live web extraction, competitor analysis, and research workflows |
| Google Workspace CLI | Workspace ops | Google Workspace administration and automation from the terminal | Medium — useful when email, calendar, or Drive workflows move out of the UI |
| GitHub CLI | Code + repos | Issues, PRs, releases, and repo operations | High — already overlaps with GitHub MCP and remains a lean default |
| Gemini CLI | Coding agent | Terminal-native coding and agent workflows with Gemini models | Medium — useful comparison surface for large-context or low-cost coding tasks |
llm | LLM evaluation + logging | Universal LLM CLI — run prompts, chain outputs, log to SQLite, query history via Datasette. Plugin ecosystem covers Claude, Gemini, local models. | High — fills the prompt logging and evaluation gap in our stack |
llm is the most recent addition. It fills the evaluation and logging gap: every prompt run is logged to SQLite and queryable via Datasette. Direct complement to the plan-cli stack for measuring agent output quality. Firecrawl already appears elsewhere in the repo as an MCP/server and Hermes-backed web capability, but the CLI catalogue did not name the CLI itself.
Retrofit Order
When upgrading a human-first CLI to agent-grade, this sequence minimises risk and maximises early value:
| Step | What | Why First |
|---|---|---|
| 1 | --output json on every command | Agents can't parse prose |
| 2 | Validate all inputs aggressively | Agents hallucinate differently than humans typo |
| 3 | schema / --describe / --help --json | Runtime discoverability replaces stale docs |
| 4 | Field masking (--fields) | Context window discipline |
| 5 | --dry-run on mutations | Rehearse before committing side effects |
| 6 | Document invariants in CONTEXT.md / SKILL.md | What the agent can't infer from --help |
| 7 | Expose via MCP / tool protocol | One capability model, many surfaces |
Raw JSON payloads (--params '{"fields": "id,name"}') beat bespoke flags. Ten individual flags is ten hallucination opportunities. One JSON object is one contract.
Context
- Platform Architecture — Where CLIs fit in the stack
- Protocols — Agent communication standards
- Flow Engineering — How tools serve the development loop
Links
- Perplexity
- Justin Poehnelt
- Playbooks CLI guidelines
- Building secure AI agents (MCP)
- Gemini CLI skills
- InfoQ AI agent CLI
- Salesforce Agent DX test-run.
Questions
Which dimension would you relax first under schedule pressure — and what would that cost the first time an agent hallucinates a flag?
- Where does your CLI sit today on “one model, many surfaces” versus parallel logic trees?
- What must always be true for your tool that you have not yet written down?
- If the primary operator is already an agent, what does your CONTEXT.md or SKILL.md assume that the model cannot infer?