AI Modalities
What AI can do. The interfaces between intelligence and the world.
The Modality Matrix
| Modality | Input | Output | Key Players | Stage |
|---|---|---|---|---|
| Text | Text | Text | OpenAI, Anthropic, Google | Mature |
| Voice | Text/Audio | Speech | ElevenLabs, OpenAI, Qwen | Maturing |
| Vision | Image/Video | Text/Analysis | GPT-4V, Claude, Gemini | Maturing |
| Image Gen | Text | Image | Midjourney, DALL-E, Flux | Mature |
| Video | Text/Image | Video | Sora, Runway, Kling | Early |
| Audio | Text | Music/Sound | Suno, Udio, Stable Audio | Early |
| 3D | Text/Image | 3D Models | Tripo, Meshy, Rodin | Emerging |
| Code | Text | Code | Claude, Cursor, Copilot | Mature |
Modality Convergence
The trend: multimodal models that handle all inputs/outputs natively.
2023 2025 2027
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ TEXT │ │ MULTI │ │ NATIVE │
│ ONLY │ → │ MODAL │ → │ OMNI │
│ │ │ (chained) │ │ (unified) │
└─────────────┘ └─────────────┘ └─────────────┘
Separate models Models + APIs Single model
for each task piped together any in → any out
The 2027 thesis: Native omnimodal models eliminate the need for specialized modality APIs. Input anything, output anything.
Selection Criteria
When choosing modality tools:
| Criterion | Question | Why It Matters |
|---|---|---|
| Latency | Real-time or batch? | Voice/video need <200ms |
| Quality | Good enough vs perfect? | Production vs prototype |
| Cost | Per token/second/image? | Scale economics |
| Control | Fine-tuning available? | Brand voice, style |
| Privacy | Data stays where? | Enterprise requirements |
| Integration | API quality? | Developer experience |
The VVFL View
Each modality is a feedback loop interface:
| Modality | Captures | Generates | Loop Closure |
|---|---|---|---|
| Voice | Human speech | Synthetic speech | Conversation |
| Vision | Visual world | Understanding | See → Act |
| Image | Text intent | Visual output | Imagine → Create |
| Video | Narrative | Moving images | Story → Film |
| Audio | Musical intent | Sound | Compose → Listen |
| 3D | Spatial intent | Physical models | Design → Build |
Context
- LLM Models — Text modality providers
- Agent Frameworks — Orchestrating modalities
- AI Principles — How it all works