Agentic CLI Tools
The agent that uses the CLI is the first customer. If it can't discover commands, validate inputs, or parse outputs without guessing, the tool is human-grade with agent hopes attached.
Agent-first CLIs need predictable contracts, runtime introspection, context discipline, input hardening, dry-run safety, and explicit operating invariants. This page gives a copy-paste checklist and a design-review template with review prompts and a scorecard.
Current Tools
These are the CLI surfaces currently relevant to our stack or adjacent enough to evaluate now.
| Tool | Category | What it does | Relevance |
|---|---|---|---|
| Firecrawl CLI | Web data | Scrape, crawl, search, browser automation, and agent workflows from the terminal | High — live web extraction, competitor analysis, and research workflows |
| Google Workspace CLI | Workspace ops | Google Workspace administration and automation from the terminal | Medium — useful when email, calendar, or Drive workflows move out of the UI |
| GitHub CLI | Code + repos | Issues, PRs, releases, and repo operations | High — already overlaps with GitHub MCP and remains a lean default |
| Gemini CLI | Coding agent | Terminal-native coding and agent workflows with Gemini models | Medium — useful comparison surface for large-context or low-cost coding tasks |
llm | LLM evaluation + logging | Universal LLM CLI — run prompts, chain outputs, log to SQLite, query history via Datasette. Plugin ecosystem covers Claude, Gemini, local models. | High — fills the prompt logging and evaluation gap in our stack |
llm is the most recent addition. It fills the evaluation and logging gap: every prompt run is logged to SQLite and queryable via Datasette. Direct complement to the DRMG CLI stack for measuring agent output quality. Firecrawl already appears elsewhere in the repo as an MCP/server and Hermes-backed web capability, but the CLI catalogue did not name the CLI itself.
Retrofit Order
When upgrading a human-first CLI to agent-grade, this sequence minimises risk and maximises early value:
| Step | What | Why First |
|---|---|---|
| 1 | --output json on every command | Agents can't parse prose |
| 2 | Validate all inputs aggressively | Agents hallucinate differently than humans typo |
| 3 | schema / --describe / --help --json | Runtime discoverability replaces stale docs |
| 4 | Field masking (--fields) | Context window discipline |
| 5 | --dry-run on mutations | Rehearse before committing side effects |
| 6 | Document invariants in CONTEXT.md / SKILL.md | What the agent can't infer from --help |
| 7 | Expose via MCP / tool protocol | One capability model, many surfaces |
Raw JSON payloads (--params '{"fields": "id,name"}') beat bespoke flags. Ten individual flags is ten hallucination opportunities. One JSON object is one contract.
Context
- Platform Architecture — Where CLIs fit in the stack
- Protocols — Agent communication standards
- Flow Engineering — How tools serve the development loop
Links
- Perplexity
- Justin Poehnelt
- Playbooks CLI guidelines
- Building secure AI agents (MCP)
- Gemini CLI skills
- InfoQ AI agent CLI
- Salesforce Agent DX test-run.
Questions
Which dimension would you relax first under schedule pressure — and what would that cost the first time an agent hallucinates a flag?
- Where does your CLI sit today on “one model, many surfaces” versus parallel logic trees?
- What must always be true for your tool that you have not yet written down?
- If the primary operator is already an agent, what does your CONTEXT.md or SKILL.md assume that the model cannot infer?