Skip to main content

Voice & Speech

Text-to-Speech (TTS), Speech-to-Text (STT), voice cloning, and conversational AI.

The Voice Stack

           INPUT                    PROCESSING                 OUTPUT
┌───────────────┐ ┌───────────────────┐ ┌─────────────┐
│ Human Voice │ ──→ │ STT (Transcribe) │ ──→ │ Text │
└───────────────┘ └───────────────────┘ └─────────────┘

┌───────────────┐ ┌───────────────────┐ ┌─────────────┐
│ Text │ ──→ │ TTS (Synthesize) │ ──→ │ Speech │
└───────────────┘ └───────────────────┘ └─────────────┘

┌───────────────┐ ┌───────────────────┐ ┌─────────────┐
│ Human Voice │ ──→ │ Voice-to-Voice │ ──→ │ Speech │
└───────────────┘ └───────────────────┘ └─────────────┘

Capability Matrix

Text-to-Speech (TTS)

ProviderModelLatencyQualityCloningOpen SourceNotes
ElevenLabsTurbo v2.5~150msExcellentYesNoMarket leader
OpenAITTS-1-HD~200msVery GoodNoNoSimple API
QwenQwen3-TTSVariableGoodYesYesOpen weights
CartesiaSonic~90msVery GoodYesNoUltra low latency
PlayHTPlayHT 2.0~150msVery GoodYesNo
CoquiXTTS v2VariableGoodYesYesSelf-hostable
Fish AudioFish SpeechVariableGoodYesYes
Parler TTSParler MiniVariableGoodNoYesPrompt-based control

Speech-to-Text (STT)

ProviderModelLatencyAccuracyLanguagesOpen Source
OpenAIWhisperReal-timeExcellent100+Yes
DeepgramNova-2~100msExcellent30+No
AssemblyAIUniversal~200msExcellent30+No
GoogleChirp~150msVery Good125+No
GroqWhisper~50msExcellent100+Via API

Voice Cloning

ProviderSample RequiredQualityReal-timeNotes
ElevenLabs30 sec - 3 minExcellentYesBest quality
OpenAINot availablePolicy restriction
Cartesia10 secVery GoodYes
PlayHT30 secGoodYes
Coqui XTTS6 secGoodYesOpen source

Use Cases

ApplicationStackLatency Requirement
Voice AssistantsSTT → LLM → TTS< 500ms total
AudiobooksTTS (long-form)Batch OK
PodcastsTTS + voice cloningBatch OK
Call CentersReal-time STT + TTS< 200ms
Gaming NPCsTTS + emotion control< 100ms
AccessibilityScreen reader TTS< 50ms
TranslationSTT → Translate → TTS< 1s

Quality Dimensions

DimensionWhat It MeasuresBest For
NaturalnessSounds human?Assistants, audiobooks
LatencyTime to first byteReal-time conversation
ExpressivenessEmotion, tone controlGaming, media
ConsistencySame voice across callsBrand voice
LanguagesMulti-lingual supportGlobal products
CostPer character/secondScale applications

Architecture Patterns

Pattern 1: Voice Assistant

User speaks

STT (Whisper/Deepgram)

LLM (Claude/GPT)

TTS (ElevenLabs/Cartesia)

User hears response

Latency budget: ~300-500ms end-to-end for natural conversation.

Pattern 2: Voice Cloning Pipeline

Sample audio (30 sec)

Voice embedding extraction

Model fine-tuning (or prompt)

TTS with cloned voice

Quality verification

Pattern 3: Real-time Translation

Source language audio

STT (source)

Translation LLM

TTS (target language)

Translated speech

Open Source Options

For self-hosting and privacy:

ToolFunctionLicenseNotes
WhisperSTTMITOpenAI, excellent quality
Coqui TTSTTSMPL 2.0XTTS for cloning
PiperTTSMITFast, lightweight
Fish SpeechTTS + cloningApache 2.0
Parler TTSTTSApache 2.0Prompt-controlled style
Qwen TTSTTSApache 2.0Emerging quality

The DePIN Angle

Voice services as decentralized infrastructure:

TraditionalDePIN Potential
Centralized APIDistributed inference nodes
Per-request pricingToken-based access
Single providerCommunity operators
Privacy concernsLocal processing

Opportunity: Voice inference as edge compute. Low latency requires proximity. Community operators deploy local nodes, earn tokens for serving requests.


Selection Guide

If You NeedConsider
Best quality, budget existsElevenLabs
Lowest latencyCartesia
Simple integrationOpenAI TTS
Self-hostingCoqui XTTS, Piper
Voice cloning, open sourceFish Speech, Qwen TTS
Budget consciousQwen TTS, Piper
Enterprise scale STTDeepgram, AssemblyAI

Context