Voice & Speech
Text-to-Speech (TTS), Speech-to-Text (STT), voice cloning, and conversational AI.
The Voice Stack
INPUT PROCESSING OUTPUT
┌───────────────┐ ┌───────────────────┐ ┌─────────────┐
│ Human Voice │ ──→ │ STT (Transcribe) │ ──→ │ Text │
└───────────────┘ └───────────────────┘ └─────────────┘
┌───────────────┐ ┌───────────────────┐ ┌─────────────┐
│ Text │ ──→ │ TTS (Synthesize) │ ──→ │ Speech │
└───────────────┘ └───────────────────┘ └─────────────┘
┌───────────────┐ ┌───────────────────┐ ┌─────────────┐
│ Human Voice │ ──→ │ Voice-to-Voice │ ──→ │ Speech │
└───────────────┘ └───────────────────┘ └─────── ──────┘
Capability Matrix
Text-to-Speech (TTS)
| Provider | Model | Latency | Quality | Cloning | Open Source | Notes |
|---|---|---|---|---|---|---|
| ElevenLabs | Turbo v2.5 | ~150ms | Excellent | Yes | No | Market leader |
| OpenAI | TTS-1-HD | ~200ms | Very Good | No | No | Simple API |
| Qwen | Qwen3-TTS | Variable | Good | Yes | Yes | Open weights |
| Cartesia | Sonic | ~90ms | Very Good | Yes | No | Ultra low latency |
| PlayHT | PlayHT 2.0 | ~150ms | Very Good | Yes | No | |
| Coqui | XTTS v2 | Variable | Good | Yes | Yes | Self-hostable |
| Fish Audio | Fish Speech | Variable | Good | Yes | Yes | |
| Parler TTS | Parler Mini | Variable | Good | No | Yes | Prompt-based control |
Speech-to-Text (STT)
| Provider | Model | Latency | Accuracy | Languages | Open Source |
|---|---|---|---|---|---|
| OpenAI | Whisper | Real-time | Excellent | 100+ | Yes |
| Deepgram | Nova-2 | ~100ms | Excellent | 30+ | No |
| AssemblyAI | Universal | ~200ms | Excellent | 30+ | No |
| Chirp | ~150ms | Very Good | 125+ | No | |
| Groq | Whisper | ~50ms | Excellent | 100+ | Via API |
Voice Cloning
| Provider | Sample Required | Quality | Real-time | Notes |
|---|---|---|---|---|
| ElevenLabs | 30 sec - 3 min | Excellent | Yes | Best quality |
| OpenAI | Not available | — | — | Policy restriction |
| Cartesia | 10 sec | Very Good | Yes | |
| PlayHT | 30 sec | Good | Yes | |
| Coqui XTTS | 6 sec | Good | Yes | Open source |
Use Cases
| Application | Stack | Latency Requirement |
|---|---|---|
| Voice Assistants | STT → LLM → TTS | < 500ms total |
| Audiobooks | TTS (long-form) | Batch OK |
| Podcasts | TTS + voice cloning | Batch OK |
| Call Centers | Real-time STT + TTS | < 200ms |
| Gaming NPCs | TTS + emotion control | < 100ms |
| Accessibility | Screen reader TTS | < 50ms |
| Translation | STT → Translate → TTS | < 1s |
Quality Dimensions
| Dimension | What It Measures | Best For |
|---|---|---|
| Naturalness | Sounds human? | Assistants, audiobooks |
| Latency | Time to first byte | Real-time conversation |
| Expressiveness | Emotion, tone control | Gaming, media |
| Consistency | Same voice across calls | Brand voice |
| Languages | Multi-lingual support | Global products |
| Cost | Per character/second | Scale applications |
Architecture Patterns
Pattern 1: Voice Assistant
User speaks
↓
STT (Whisper/Deepgram)
↓
LLM (Claude/GPT)
↓
TTS (ElevenLabs/Cartesia)
↓
User hears response
Latency budget: ~300-500ms end-to-end for natural conversation.
Pattern 2: Voice Cloning Pipeline
Sample audio (30 sec)
↓
Voice embedding extraction
↓
Model fine-tuning (or prompt)
↓
TTS with cloned voice
↓
Quality verification
Pattern 3: Real-time Translation
Source language audio
↓
STT (source)
↓
Translation LLM
↓
TTS (target language)
↓
Translated speech
Open Source Options
For self-hosting and privacy:
| Tool | Function | License | Notes |
|---|---|---|---|
| Whisper | STT | MIT | OpenAI, excellent quality |
| Coqui TTS | TTS | MPL 2.0 | XTTS for cloning |
| Piper | TTS | MIT | Fast, lightweight |
| Fish Speech | TTS + cloning | Apache 2.0 | |
| Parler TTS | TTS | Apache 2.0 | Prompt-controlled style |
| Qwen TTS | TTS | Apache 2.0 | Emerging quality |
The DePIN Angle
Voice services as decentralized infrastructure:
| Traditional | DePIN Potential |
|---|---|
| Centralized API | Distributed inference nodes |
| Per-request pricing | Token-based access |
| Single provider | Community operators |
| Privacy concerns | Local processing |
Opportunity: Voice inference as edge compute. Low latency requires proximity. Community operators deploy local nodes, earn tokens for serving requests.
Selection Guide
| If You Need | Consider |
|---|---|
| Best quality, budget exists | ElevenLabs |
| Lowest latency | Cartesia |
| Simple integration | OpenAI TTS |
| Self-hosting | Coqui XTTS, Piper |
| Voice cloning, open source | Fish Speech, Qwen TTS |
| Budget conscious | Qwen TTS, Piper |
| Enterprise scale STT | Deepgram, AssemblyAI |
Context
- AI Modalities — All AI capability types
- LLM Models — Text models for voice pipelines
- Agent Frameworks — Building voice agents