Agent CLI Tools

The agent that uses the CLI is the first customer. If it can't discover commands, validate inputs, or parse outputs without guessing, the tool is human-grade with agent hopes attached.

Agent-first CLIs need predictable contracts, runtime introspection, context discipline, input hardening, dry-run safety, and explicit operating invariants. This page gives a copy-paste checklist and a design-review template with review prompts and a scorecard.

Retrofit Order

When upgrading a human-first CLI to agent-grade, this sequence minimises risk and maximises early value:

Step	What	Why First
1	`--output json` on every command	Agents can't parse prose
2	Validate all inputs aggressively	Agents hallucinate differently than humans typo
3	`schema` / `--describe` / `--help --json`	Runtime discoverability replaces stale docs
4	Field masking (`--fields`)	Context window discipline
5	`--dry-run` on mutations	Rehearse before committing side effects
6	Document invariants in CONTEXT.md / SKILL.md	What the agent can't infer from `--help`
7	Expose via MCP / tool protocol	One capability model, many surfaces

Raw JSON payloads (--params '{"fields": "id,name"}') beat bespoke flags. Ten individual flags is ten hallucination opportunities. One JSON object is one contract.

Agent CLI checklist

# Agent CLI Review Checklist

## 1. Identity

- [ ] The CLI’s purpose is clearly stated
- [ ] The system(s) it controls are clearly named
- [ ] The primary operator is identified as human, agent, or hybrid
- [ ] The trust boundary is explicit, for example files, APIs, credentials, or state mutation
- [ ] The source of truth for capabilities is identified

## 2. Contract

- [ ] The command accepts structured input, such as JSON, params, stdin, or file input
- [ ] The command returns machine-readable output, such as JSON or NDJSON
- [ ] Human-friendly output is separated from machine-friendly output
- [ ] Exit codes are documented and stable
- [ ] The output shape is treated as a contract, not an implementation detail
- [ ] Backward compatibility expectations are documented

## 3. Runtime introspection

- [ ] The CLI exposes `schema`, `describe`, or equivalent runtime introspection
- [ ] An agent can discover accepted parameters at runtime
- [ ] An agent can discover request body shape at runtime
- [ ] An agent can discover response shape at runtime
- [ ] Constraints and required fields are discoverable without external docs

## 4. Context discipline

- [ ] The CLI supports field masks, projections, or `--fields`
- [ ] The CLI supports pagination for large result sets
- [ ] The CLI supports streaming or NDJSON where appropriate
- [ ] Defaults avoid oversized responses
- [ ] High-volume commands include guidance for minimal-output usage
- [ ] Bulk workflows avoid loading unnecessary data into memory or context

## 5. Input hardening

- [ ] Inputs are treated as untrusted by default
- [ ] File paths are validated against traversal and unsafe destinations
- [ ] Resource IDs are validated and reject malformed embedded query fragments
- [ ] Control characters are rejected or normalized safely
- [ ] Double-encoding and malformed encoding cases are handled
- [ ] Error messages are explicit enough for agents to recover cleanly

## 6. Safety rails

- [ ] All mutating operations support `--dry-run` or equivalent validation mode
- [ ] Destructive operations require explicit confirmation or an explicit override
- [ ] Read-only paths are clearly separated from write paths
- [ ] File output locations can be sandboxed or constrained
- [ ] Retry behavior is safe and documented
- [ ] Idempotency expectations are documented for repeated calls

## 7. Response safety

- [ ] Returned content is treated as untrusted
- [ ] The CLI supports sanitization, filtering, or safe rendering for returned content
- [ ] Prompt-injection risk from returned data has been considered
- [ ] Unsafe raw content is not silently mixed into trusted instruction channels
- [ ] The design documents what happens when unsafe content is detected

## 8. Agent guidance

- [ ] The repo includes `CONTEXT.md`, `AGENTS.md`, `SKILL.md`, or equivalent
- [ ] Non-obvious operating invariants are explicitly written down
- [ ] Guidance covers confirmation-before-delete or equivalent write safety
- [ ] Guidance covers using fields/projections on list and get operations
- [ ] Guidance includes safe examples for common workflows
- [ ] Guidance is maintained close to the tool, not only in external docs

## 9. Multi-surface consistency

- [ ] The CLI surface and tool surface are derived from the same capability model
- [ ] MCP or other tool protocols do not introduce drift from core behavior
- [ ] Shell usage and typed-tool usage produce consistent semantics
- [ ] Environment variables are documented and behave predictably
- [ ] There is one source of truth for commands, schemas, and auth behavior

## 10. Auth and operations

- [ ] Headless authentication is supported
- [ ] Browser-only auth flows are not required for automated use
- [ ] Secrets can be injected safely through files or environment variables
- [ ] Credential scope is minimized
- [ ] Auditability is considered, for example logs, request IDs, or action traces
- [ ] Operational failure modes are documented

## 11. Testing and failure modes

- [ ] Tests cover malformed agent-style inputs
- [ ] Tests cover destructive-path safety behavior
- [ ] Tests cover schema drift and output regressions
- [ ] Tests cover prompt-injection or unsafe returned-content scenarios
- [ ] Tests verify stable machine-readable output
- [ ] Observability is sufficient to diagnose failures in unattended runs

## 12. Release decision

- [ ] Structured input and output are production-ready
- [ ] Input validation is production-ready
- [ ] Dry-run for mutating commands is production-ready
- [ ] Stable machine contract is production-ready
- [ ] Known gaps are listed with owners
- [ ] Review outcome is recorded: reject, pilot, or approve

## Scoring

- [ ] Score each section 0, 1, or 2
- [ ] Total score recorded out of 24
- [ ] No zero scores in Contract, Input hardening, Safety rails, or Testing
- [ ] Launch threshold met

Design review template

Use the checklist above for coverage; use the prompts below to stress-test each dimension. The point of the review is decision quality, not document completion — is this truly agent-grade, or human-grade with agent hopes attached?

1. Identity

CLI name, system it controls, primary jobs-to-be-done
Primary operator: human, agent, or hybrid
Trust boundary crossed: local files, remote API, credentials, state mutation, external content
Source of truth for capability model: API schema, discovery doc, internal spec, other

Review prompt: What is this tool for, who is the real operator, and what real-world system does it have power over? A strong agent CLI starts by naming the boundary clearly; agents are fast and confident but can still be wrong in new ways.

2. Contract

Canonical command shape; structured input (e.g. --json, --params, stdin, file); structured output (JSON, NDJSON, or both)
Stable exit codes documented; human-friendly output isolated from machine output
Runtime introspection: schema, describe, help --json, or equivalent

Review prompt: Can an agent discover how to call this tool and parse the result without scraping prose or inventing structure? The contract should be treated like an API, not a convenience layer.

3. Context discipline

Output minimization: --fields, field masks, projection; pagination; streaming
Defaults optimized for small payloads; bulk operations batched safely
Known high-token commands documented with safe usage guidance

Review prompt: Does the CLI help the agent preserve reasoning capacity by returning only what is needed right now? Context-window discipline should be designed in, not left to prompt cleverness.

4. Threat model

Assumed failure modes: hallucinated flags, malformed IDs, path traversal, double encoding, hidden control chars, prompt injection in returned data
Untrusted input and response classes identified; validation rules documented
Rejection behavior explicit and testable

Review prompt: If the agent is wrong, vague, overconfident, or manipulated by hostile data, what bad thing can happen? Assume adversarial input and tainted output even when the caller is “helpful.”

5. Safety rails

--dry-run for every mutating action; confirmation model for destructive actions
Read-only mode; sandboxing for file outputs; sanitization/filtering for returned content
Idempotency and retry semantics documented

Review prompt: What are the built-in brakes? A safe agent CLI lets the agent validate, narrow, and rehearse before it commits side effects.

6. Guidance

CONTEXT.md, AGENTS.md, or SKILL.md present; non-obvious invariants encoded
Required workflow rules (e.g. confirm-before-delete, always-use-fields); example calls for high-value workflows; agent-specific gotchas documented

Review prompt: What must the agent know that it cannot infer from --help alone? Package that knowledge close to the tool instead of assuming the model will “just know.”

7. Surfaces

CLI, MCP/tool, extension/plugin, headless automation surfaces
Shared source of truth across surfaces; surface-specific drift risks identified

Review prompt: Are all invocation surfaces generated from one capability model, or drifting into separate implementations? One core binary with multiple surfaces is safer than parallel logic trees.

8. Auth and identity

Auth methods; headless auth; service account or delegated auth
Secret injection path; secret exposure risks and credential scope minimization reviewed

Review prompt: Can this tool run unattended without unsafe browser flows or leaky secret handling? Agent-grade auth should work cleanly in automation and be scoped tightly.

9. Failure modes

Common bad inputs tested; destructive-path tests; prompt-injection response tests; schema drift tests
Backward compatibility guarantees; observability (logs, stderr, audit trail, request IDs)

Review prompt: What happens when things go wrong, and how will you know? Good failure design is part of the product, not just a QA task.

10. Decision

Score each standard 0, 1, or 2
Non-negotiables passed: structured I/O, input validation, dry-run for writes, stable output contract
Launch recommendation: no, pilot only, production ready; required fixes before next stage; owner; review date

Review prompt: Is this truly agent-grade? The review exists for decision quality.

Scoring sheet

Use this scorecard in reviews. Score each dimension 0, 1, or 2.

Dimension	0	1	2
Structured I/O	No machine contract	Partial	Complete and stable
Runtime introspection	None	Basic help only	Schema/describe is usable
Context discipline	Verbose by default	Some filtering	Selective and streamable
Input hardening	Trusts caller	Some validation	Adversarial by design
Safety rails	No rehearsal	Partial dry-run	Full dry-run + guards
Response safety	None	Manual warning only	Sanitized or filtered
Packaged guidance	None	Sparse docs	Agent-facing invariants shipped
Multi-surface coherence	Fragmented	Partial reuse	One model, many surfaces
Headless auth	Human-only	Works awkwardly	Automation-native
Failure design	Untested	Some tests	Regression-tested and observable

Threshold: 16 out of 20 for production readiness, with no zeros on structured I/O, input hardening, safety rails, or contract stability. That keeps the standard aligned with safe execution rather than superficial feature completeness.

Intent Validation

Frame the review as a five-part truth check:

Who are we serving? The operator is often an agent, not a person.
What boundary are we crossing? Files, APIs, credentials, and external data all change the risk profile.
What must always be true? Deterministic contract, bounded output, validated input, safe mutation path.
How can it fail? Hallucination, injection, schema drift, secret leakage, destructive action.
What standard compounds? One source of truth, explicit invariants, stable contracts, testable safety rails.

Quality Thresholds

Use 0 for absent, 1 for partial, and 2 for solid.

Production bar: 16 out of 20 on the scoring sheet above. Block release if structured I/O, input hardening, safety rails, or contract stability scores zero. The 12-section checklist expands coverage; the 10-dimension scorecard drives the go/no-go decision.

Failure Patterns

Every principle above was proven by failure. These patterns recur when agent CLIs are built human-first.

Pattern	What Happens	Which Gate Catches It
Silent rejection	Command exits non-zero with no message. Agent retries blindly.	Contract (exit codes documented)
Type-gated commands	Some message types work, others exit 4. Agent discovers this at runtime.	Input hardening (validate before DB hit)
Missing `--help`	Agent can't discover usage. Falls back to guessing flags.	Runtime introspection
Payload shape drift	`--message` works for one type, `--payload` required for another. No error explains why.	Contract (stable input shape)
Deploy/runtime mismatch	Schema deployed to dev, CLI hits prod. Code exists but rejects at runtime.	Multi-surface coherence (one deployment)
Recording layer failure	Can't log the issue about the broken issue logger. System can't improve itself.	Safety rails (mutation paths tested)

The last pattern is the most dangerous: when the tool for improving the tool is broken, errors compound silently.

Context

Platform Architecture — Where CLIs fit in the stack
Protocols — Agent communication standards
Flow Engineering — How tools serve the development loop

Questions

Which dimension would you relax first under schedule pressure — and what would that cost the first time an agent hallucinates a flag?

Where does your CLI sit today on “one model, many surfaces” versus parallel logic trees?
What must always be true for your tool that you have not yet written down?
If the primary operator is already an agent, what does your CONTEXT.md or SKILL.md assume that the model cannot infer?

Retrofit Order​

Agent CLI checklist​

Design review template​

1. Identity​

2. Contract​

3. Context discipline​

4. Threat model​

5. Safety rails​

6. Guidance​

7. Surfaces​

8. Auth and identity​

9. Failure modes​

10. Decision​

Scoring sheet​

Intent Validation​

Quality Thresholds​

Failure Patterns​

Context​

Links​

Questions​