Voice Agents

Voice is not just another input field. It is a higher-bandwidth way to gather intent, urgency, emotion, objections, and constraints before an agent acts.

The product question is not "can the app talk?" The question is: which workflow becomes more valuable when the system can listen, reason, act, and leave an audit trail?

Who Needs This

Product builders use this page to decide whether voice is a feature, a workflow wedge, or a full product surface.

Engineers use it to place voice inside the architecture without binding the domain model to a voice vendor.

Operators use it to identify the first voice workflow worth automating and the point where a human must take over.

Core Pattern

A voice agent becomes valuable when it owns a job from first utterance to final action.

The weak version:

A user talks.
The app transcribes.
The app summarizes.
A human still does the work.

The strong version:

A user talks.
The agent captures intent and constraints.
The agent verifies identity and permission.
The agent calls the right tools.
The agent completes the workflow or escalates with context preserved.
The system records what happened, why, and whether the outcome was good.

That last line is the moat. The eval loop compounds.

Architecture

Use a ports-and-adapters shape.

Voice port
  -> speech-to-text adapter
  -> intent and workflow core
  -> tool adapters
  -> system-of-record adapters
  -> audit and eval trail
  -> speech response or human handoff

The core workflow should not know whether the user came through Vapi, LiveKit, Twilio, a browser microphone, or a future speech-native model. Voice is a port. Telephony, speech-to-text, text-to-speech, and orchestration tools are adapters.

This matters because voice infrastructure will change quickly. The workflow logic, permissions, evals, and system integrations are the durable product.

Stack Decisions

Build the parts that compound into trust:

Workflow logic: how the job gets done.
Vertical integration: CRM, EMR, ERP, scheduling, billing, documents.
Eval logic: how calls are judged, scored, replayed, and improved.
Permission model: what the agent can see, say, change, and escalate.
Handoff state: how the human receives the call context without making the caller repeat themselves.

Buy or abstract the parts that are likely to commoditize:

Speech-to-text.
Text-to-speech.
Base language models.
Telephony.
Commodity voice orchestration.

The strategic mistake is spending energy on the voice wrapper while the workflow core remains generic.

Eval Loop

Voice agents fail in ways ordinary monitoring will not catch. They may stay online, return 200 responses, and still damage trust by misunderstanding intent, looping, sounding wrong, or escalating too late.

Track these signals from day one:

Self-serve resolution rate: did the call finish without a human?
Escalation rate by scenario: where does the agent hit its boundary?
Call termination rate: did the caller abandon the interaction?
Per-turn latency: did the conversation rhythm break?
Function-call accuracy: did the agent call the right tool with the right fields?
CSAT versus human baseline: is the agent good enough for this workflow?
Audit completeness: can a reviewer reconstruct what happened?

The first product milestone is not "voice works." It is "we can tell when voice worked."

Trust Gates

Voice increases context, but it also increases risk. A buyer will only let an agent act when controls are visible.

Every production voice workflow needs:

Identity verification before sensitive information is released.
Permission gating before records are changed.
Reversible actions where possible.
Human handoff as a first-class state.
Call transcript and action audit — transcripts are source trail; reviewed interpretations drive the eval loop.
Scenario-level evals before each prompt, model, or tool change ships.
Clear refusal and escalation rules for low-confidence turns.

In regulated workflows, the controls are not a compliance appendix. They are the product.

Best Wedges

Start where conversation is already the work.

Good wedges:

Legal intake where the first conversation determines matter quality.
Insurance prior authorization where voice, documents, and policy rules collide.
Expert knowledge capture where tacit knowledge is trapped in senior people's heads.
Field inspection where a worker narrates while capturing images or video.
Industrial operations where radio, phone, paper, and time pressure still dominate.
Complex financial workflows where a caller needs guided decision support.

Weak wedges:

Generic FAQ bots.
Basic appointment reminders.
Simple outbound scheduling.
Any flow where the system only summarizes and a human still owns the action.

The strongest wedge has a measurable outcome, high labor cost, messy context, and a narrow enough first workflow to evaluate.

Prioritization Test

Use these five questions before adding voice to the app:

Does voice capture context that text or forms miss?
Can the agent complete a real job, not just summarize one?
Is the outcome measurable within the system?
Does the workflow justify trust architecture and audit overhead?
Can failures become eval fixtures that improve the next release?

If the answer is no to two or more, voice is probably a demo surface. If the answer is yes to all five, voice may be a product layer.

App Use Cases

For Stackmates, the highest-value voice use cases are not generic assistants. They are context capture and workflow ownership.

Voice intake for work creation: a founder speaks a messy idea; the app extracts intent, constraints, desired outcome, candidate workchart, and unresolved questions.

Voice eval for customer calls: a call becomes a traceable object with transcript, decisions, score, objections, follow-up actions, and quality gates.

Voice handoff for agent workflows: when an agent lacks confidence, it transfers to a human with summary, transcript, attempted actions, and recommended next move.

Voice knowledge capture: senior operators narrate how work really happens; the app structures it into process, workchart, playbook, and demand stories.

Voice outcome readback: the app reads back what it understood before execution, captures correction, and stores the final intent record.

These use cases share the same substrate: voice port, intent extraction, context graph, workflow routing, permission gate, eval trace, and outcome ledger.

Context

Voice & Speech - modality-level stack reference
Agentic Workflows - workflow design principles
Trust Architecture - permission, audit, and intent verification
Hexagonal Architecture - ports and adapters
Context Graphs - durable memory and state

Questions

Which workflow in the app becomes more valuable when the user can speak a messy, emotional, incomplete version of the truth?

What is the first voice workflow where a measurable outcome can be reached without a human?
What must the agent prove before it earns permission to act?
Which failures should become eval fixtures before the next voice release?

Who Needs This​

Core Pattern​

Architecture​

Stack Decisions​

Eval Loop​

Trust Gates​

Best Wedges​

Prioritization Test​

App Use Cases​

Context​

Questions​