Agent CLI Tools
Agent-first CLIs need predictable contracts, runtime introspection, context discipline, input hardening, dry-run safety, and explicit operating invariants. This page gives a copy-paste checklist and a design-review template with review prompts and a scorecard.
Engineering spec: CLI Improvement Spec — Plan, Comms, ETL — acceptance criteria to bring the Plan, Comms, and Agent ETL CLIs to agent-grade.
Agent CLI checklist
# Agent CLI Review Checklist
## 1. Identity
- [ ] The CLI’s purpose is clearly stated
- [ ] The system(s) it controls are clearly named
- [ ] The primary operator is identified as human, agent, or hybrid
- [ ] The trust boundary is explicit, for example files, APIs, credentials, or state mutation
- [ ] The source of truth for capabilities is identified
## 2. Contract
- [ ] The command accepts structured input, such as JSON, params, stdin, or file input
- [ ] The command returns machine-readable output, such as JSON or NDJSON
- [ ] Human-friendly output is separated from machine-friendly output
- [ ] Exit codes are documented and stable
- [ ] The output shape is treated as a contract, not an implementation detail
- [ ] Backward compatibility expectations are documented
## 3. Runtime introspection
- [ ] The CLI exposes `schema`, `describe`, or equivalent runtime introspection
- [ ] An agent can discover accepted parameters at runtime
- [ ] An agent can discover request body shape at runtime
- [ ] An agent can discover response shape at runtime
- [ ] Constraints and required fields are discoverable without external docs
## 4. Context discipline
- [ ] The CLI supports field masks, projections, or `--fields`
- [ ] The CLI supports pagination for large result sets
- [ ] The CLI supports streaming or NDJSON where appropriate
- [ ] Defaults avoid oversized responses
- [ ] High-volume commands include guidance for minimal-output usage
- [ ] Bulk workflows avoid loading unnecessary data into memory or context
## 5. Input hardening
- [ ] Inputs are treated as untrusted by default
- [ ] File paths are validated against traversal and unsafe destinations
- [ ] Resource IDs are validated and reject malformed embedded query fragments
- [ ] Control characters are rejected or normalized safely
- [ ] Double-encoding and malformed encoding cases are handled
- [ ] Error messages are explicit enough for agents to recover cleanly
## 6. Safety rails
- [ ] All mutating operations support `--dry-run` or equivalent validation mode
- [ ] Destructive operations require explicit confirmation or an explicit override
- [ ] Read-only paths are clearly separated from write paths
- [ ] File output locations can be sandboxed or constrained
- [ ] Retry behavior is safe and documented
- [ ] Idempotency expectations are documented for repeated calls
## 7. Response safety
- [ ] Returned content is treated as untrusted
- [ ] The CLI supports sanitization, filtering, or safe rendering for returned content
- [ ] Prompt-injection risk from returned data has been considered
- [ ] Unsafe raw content is not silently mixed into trusted instruction channels
- [ ] The design documents what happens when unsafe content is detected
## 8. Agent guidance
- [ ] The repo includes `CONTEXT.md`, `AGENTS.md`, `SKILL.md`, or equivalent
- [ ] Non-obvious operating invariants are explicitly written down
- [ ] Guidance covers confirmation-before-delete or equivalent write safety
- [ ] Guidance covers using fields/projections on list and get operations
- [ ] Guidance includes safe examples for common workflows
- [ ] Guidance is maintained close to the tool, not only in external docs
## 9. Multi-surface consistency
- [ ] The CLI surface and tool surface are derived from the same capability model
- [ ] MCP or other tool protocols do not introduce drift from core behavior
- [ ] Shell usage and typed-tool usage produce consistent semantics
- [ ] Environment variables are documented and behave predictably
- [ ] There is one source of truth for commands, schemas, and auth behavior
## 10. Auth and operations
- [ ] Headless authentication is supported
- [ ] Browser-only auth flows are not required for automated use
- [ ] Secrets can be injected safely through files or environment variables
- [ ] Credential scope is minimized
- [ ] Auditability is considered, for example logs, request IDs, or action traces
- [ ] Operational failure modes are documented
## 11. Testing and failure modes
- [ ] Tests cover malformed agent-style inputs
- [ ] Tests cover destructive-path safety behavior
- [ ] Tests cover schema drift and output regressions
- [ ] Tests cover prompt-injection or unsafe returned-content scenarios
- [ ] Tests verify stable machine-readable output
- [ ] Observability is sufficient to diagnose failures in unattended runs
## 12. Release decision
- [ ] Structured input and output are production-ready
- [ ] Input validation is production-ready
- [ ] Dry-run for mutating commands is production-ready
- [ ] Stable machine contract is production-ready
- [ ] Known gaps are listed with owners
- [ ] Review outcome is recorded: reject, pilot, or approve
## Scoring
- [ ] Score each section 0, 1, or 2
- [ ] Total score recorded out of 24
- [ ] No zero scores in Contract, Input hardening, Safety rails, or Testing
- [ ] Launch threshold met
Design review template
Use the checklist above for coverage; use the prompts below to stress-test each dimension. The point of the review is decision quality, not document completion — is this truly agent-grade, or human-grade with agent hopes attached?
1. Identity
- CLI name, system it controls, primary jobs-to-be-done
- Primary operator: human, agent, or hybrid
- Trust boundary crossed: local files, remote API, credentials, state mutation, external content
- Source of truth for capability model: API schema, discovery doc, internal spec, other
Review prompt: What is this tool for, who is the real operator, and what real-world system does it have power over? A strong agent CLI starts by naming the boundary clearly; agents are fast and confident but can still be wrong in new ways.
2. Contract
- Canonical command shape; structured input (e.g.
--json,--params, stdin, file); structured output (JSON, NDJSON, or both) - Stable exit codes documented; human-friendly output isolated from machine output
- Runtime introspection:
schema,describe,help --json, or equivalent
Review prompt: Can an agent discover how to call this tool and parse the result without scraping prose or inventing structure? The contract should be treated like an API, not a convenience layer.
3. Context discipline
- Output minimization:
--fields, field masks, projection; pagination; streaming - Defaults optimized for small payloads; bulk operations batched safely
- Known high-token commands documented with safe usage guidance
Review prompt: Does the CLI help the agent preserve reasoning capacity by returning only what is needed right now? Context-window discipline should be designed in, not left to prompt cleverness.
4. Threat model
- Assumed failure modes: hallucinated flags, malformed IDs, path traversal, double encoding, hidden control chars, prompt injection in returned data
- Untrusted input and response classes identified; validation rules documented
- Rejection behavior explicit and testable
Review prompt: If the agent is wrong, vague, overconfident, or manipulated by hostile data, what bad thing can happen? Assume adversarial input and tainted output even when the caller is “helpful.”
5. Safety rails
--dry-runfor every mutating action; confirmation model for destructive actions- Read-only mode; sandboxing for file outputs; sanitization/filtering for returned content
- Idempotency and retry semantics documented
Review prompt: What are the built-in brakes? A safe agent CLI lets the agent validate, narrow, and rehearse before it commits side effects.
6. Guidance
CONTEXT.md,AGENTS.md, orSKILL.mdpresent; non-obvious invariants encoded- Required workflow rules (e.g. confirm-before-delete, always-use-fields); example calls for high-value workflows; agent-specific gotchas documented
Review prompt: What must the agent know that it cannot infer from --help alone? Package that knowledge close to the tool instead of assuming the model will “just know.”
7. Surfaces
- CLI, MCP/tool, extension/plugin, headless automation surfaces
- Shared source of truth across surfaces; surface-specific drift risks identified
Review prompt: Are all invocation surfaces generated from one capability model, or drifting into separate implementations? One core binary with multiple surfaces is safer than parallel logic trees.
8. Auth and identity
- Auth methods; headless auth; service account or delegated auth
- Secret injection path; secret exposure risks and credential scope minimization reviewed
Review prompt: Can this tool run unattended without unsafe browser flows or leaky secret handling? Agent-grade auth should work cleanly in automation and be scoped tightly.
9. Failure modes
- Common bad inputs tested; destructive-path tests; prompt-injection response tests; schema drift tests
- Backward compatibility guarantees; observability (logs, stderr, audit trail, request IDs)
Review prompt: What happens when things go wrong, and how will you know? Good failure design is part of the product, not just a QA task.
10. Decision
- Score each standard 0, 1, or 2
- Non-negotiables passed: structured I/O, input validation, dry-run for writes, stable output contract
- Launch recommendation: no, pilot only, production ready; required fixes before next stage; owner; review date
Review prompt: Is this truly agent-grade? The review exists for decision quality.
Scoring sheet
Use this scorecard in reviews. Score each dimension 0, 1, or 2.
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Structured I/O | No machine contract | Partial | Complete and stable |
| Runtime introspection | None | Basic help only | Schema/describe is usable |
| Context discipline | Verbose by default | Some filtering | Selective and streamable |
| Input hardening | Trusts caller | Some validation | Adversarial by design |
| Safety rails | No rehearsal | Partial dry-run | Full dry-run + guards |
| Response safety | None | Manual warning only | Sanitized or filtered |
| Packaged guidance | None | Sparse docs | Agent-facing invariants shipped |
| Multi-surface coherence | Fragmented | Partial reuse | One model, many surfaces |
| Headless auth | Human-only | Works awkwardly | Automation-native |
| Failure design | Untested | Some tests | Regression-tested and observable |
Threshold: 16 out of 20 for production readiness, with no zeros on structured I/O, input hardening, safety rails, or contract stability. That keeps the standard aligned with safe execution rather than superficial feature completeness.
Intent Validation
Frame the review as a five-part truth check:
- Who are we serving? The operator is often an agent, not a person.
- What boundary are we crossing? Files, APIs, credentials, and external data all change the risk profile.
- What must always be true? Deterministic contract, bounded output, validated input, safe mutation path.
- How can it fail? Hallucination, injection, schema drift, secret leakage, destructive action.
- What standard compounds? One source of truth, explicit invariants, stable contracts, testable safety rails.
Quality Thresholds
Use 0 for absent, 1 for partial, and 2 for solid.
Production bar: 16 out of 20 on the scoring sheet above. Block release if structured I/O, input hardening, safety rails, or contract stability scores zero. The 12-section checklist expands coverage; the 10-dimension scorecard drives the go/no-go decision.
Links
- Perplexity
- Justin Poehnelt
- Playbooks CLI guidelines
- Building secure AI agents (MCP)
- Gemini CLI skills
- InfoQ AI agent CLI
- Salesforce Agent DX test-run.
Questions
Which dimension would you relax first under schedule pressure — and what would that cost the first time an agent hallucinates a flag?
- Where does your CLI sit today on “one model, many surfaces” versus parallel logic trees?
- What must always be true for your tool that you have not yet written down?
- If the primary operator is already an agent, what does your CONTEXT.md or SKILL.md assume that the model cannot infer?