Voice & Speech

Text-to-Speech (TTS), Speech-to-Text (STT), voice cloning, and conversational AI.

The Voice Stack

           INPUT                    PROCESSING                 OUTPUT
    ┌───────────────┐         ┌───────────────────┐      ┌─────────────┐
    │  Human Voice  │  ──→    │  STT (Transcribe) │  ──→ │    Text     │
    └───────────────┘         └───────────────────┘      └─────────────┘

    ┌───────────────┐         ┌───────────────────┐      ┌─────────────┐
    │     Text      │  ──→    │  TTS (Synthesize) │  ──→ │   Speech    │
    └───────────────┘         └───────────────────┘      └─────────────┘

    ┌───────────────┐         ┌───────────────────┐      ┌─────────────┐
    │  Human Voice  │  ──→    │  Voice-to-Voice   │  ──→ │   Speech    │
    └───────────────┘         └───────────────────┘      └─────────────┘

Capability Matrix

Text-to-Speech (TTS)

Provider	Model	Latency	Quality	Cloning	Open Source	Notes
ElevenLabs	Turbo v2.5	~150ms	Excellent	Yes	No	Market leader
OpenAI	TTS-1-HD	~200ms	Very Good	No	No	Simple API
Qwen	Qwen3-TTS	Variable	Good	Yes	Yes	Open weights
Cartesia	Sonic	~90ms	Very Good	Yes	No	Ultra low latency
PlayHT	PlayHT 2.0	~150ms	Very Good	Yes	No
Coqui	XTTS v2	Variable	Good	Yes	Yes	Self-hostable
Fish Audio	Fish Speech	Variable	Good	Yes	Yes
Parler TTS	Parler Mini	Variable	Good	No	Yes	Prompt-based control

Speech-to-Text (STT)

Provider	Model	Latency	Accuracy	Languages	Open Source
OpenAI	Whisper	Real-time	Excellent	100+	Yes
Deepgram	Nova-2	~100ms	Excellent	30+	No
AssemblyAI	Universal	~200ms	Excellent	30+	No
Google	Chirp	~150ms	Very Good	125+	No
Groq	Whisper	~50ms	Excellent	100+	Via API

Voice Cloning

Provider	Sample Required	Quality	Real-time	Notes
ElevenLabs	30 sec - 3 min	Excellent	Yes	Best quality
OpenAI	Not available	—	—	Policy restriction
Cartesia	10 sec	Very Good	Yes
PlayHT	30 sec	Good	Yes
Coqui XTTS	6 sec	Good	Yes	Open source

Use Cases

Application	Stack	Latency Requirement
Voice Assistants	STT → LLM → TTS	< 500ms total
Audiobooks	TTS (long-form)	Batch OK
Podcasts	TTS + voice cloning	Batch OK
Call Centers	Real-time STT + TTS	< 200ms
Gaming NPCs	TTS + emotion control	< 100ms
Accessibility	Screen reader TTS	< 50ms
Translation	STT → Translate → TTS	< 1s

Quality Dimensions

Dimension	What It Measures	Best For
Naturalness	Sounds human?	Assistants, audiobooks
Latency	Time to first byte	Real-time conversation
Expressiveness	Emotion, tone control	Gaming, media
Consistency	Same voice across calls	Brand voice
Languages	Multi-lingual support	Global products
Cost	Per character/second	Scale applications

Architecture Patterns

Pattern 1: Voice Assistant

User speaks
    ↓
STT (Whisper/Deepgram)
    ↓
LLM (Claude/GPT)
    ↓
TTS (ElevenLabs/Cartesia)
    ↓
User hears response

Latency budget: ~300-500ms end-to-end for natural conversation.

Pattern 2: Voice Cloning Pipeline

Sample audio (30 sec)
    ↓
Voice embedding extraction
    ↓
Model fine-tuning (or prompt)
    ↓
TTS with cloned voice
    ↓
Quality verification

Pattern 3: Real-time Translation

Source language audio
    ↓
STT (source)
    ↓
Translation LLM
    ↓
TTS (target language)
    ↓
Translated speech

Open Source Options

For self-hosting and privacy:

Tool	Function	License	Notes
Whisper	STT	MIT	OpenAI, excellent quality
Coqui TTS	TTS	MPL 2.0	XTTS for cloning
Piper	TTS	MIT	Fast, lightweight
Fish Speech	TTS + cloning	Apache 2.0
Parler TTS	TTS	Apache 2.0	Prompt-controlled style
Qwen TTS	TTS	Apache 2.0	Emerging quality

The DePIN Angle

Voice services as decentralized infrastructure:

Traditional	DePIN Potential
Centralized API	Distributed inference nodes
Per-request pricing	Token-based access
Single provider	Community operators
Privacy concerns	Local processing

Opportunity: Voice inference as edge compute. Low latency requires proximity. Community operators deploy local nodes, earn tokens for serving requests.

Selection Guide

If You Need	Consider
Best quality, budget exists	ElevenLabs
Lowest latency	Cartesia
Simple integration	OpenAI TTS
Self-hosting	Coqui XTTS, Piper
Voice cloning, open source	Fish Speech, Qwen TTS
Budget conscious	Qwen TTS, Piper
Enterprise scale STT	Deepgram, AssemblyAI

Context

AI Modalities — All AI capability types
LLM Models — Text models for voice pipelines
Agent Frameworks — Building voice agents

The Voice Stack​

Capability Matrix​

Text-to-Speech (TTS)​

Speech-to-Text (STT)​

Voice Cloning​

Use Cases​

Quality Dimensions​

Architecture Patterns​

Pattern 1: Voice Assistant​

Pattern 2: Voice Cloning Pipeline​

Pattern 3: Real-time Translation​

Open Source Options​

The DePIN Angle​

Selection Guide​

Context​