AI Modalities
What can become what? Every modality is both input and output. The matrix of possibilities — not a list of tools.
The Modality Matrix
Every cell is a transformation. Filled cells have tools. ? cells are gaps waiting to be filled.
| Input \ Output | Text | Voice | Image | Video | Audio | 3D | Code |
|---|---|---|---|---|---|---|---|
| Text | LLMs (Claude, GPT) | TTS (ElevenLabs, Cartesia) | Image gen (Midjourney, Flux) | Video gen (Sora, Kling) | Music gen (Suno, Udio) | 3D gen (Tripo, Meshy) | Code gen (Claude, Cursor) |
| Voice | STT (Whisper, Deepgram) | Voice-to-voice (translation) | ? | ? | Voice isolation | ? | Voice coding (Copilot Voice) |
| Image | Vision (GPT-4V, Claude) | Image description → TTS | Style transfer, upscale | Image-to-video (Runway, Kling) | ? | Image-to-3D (Tripo, Rodin) | Screenshot-to-code (v0) |
| Video | Video understanding (Gemini) | Extract dialogue | Frame extraction | Video editing, re-style | Extract audio, score | ? | ? |
| Audio | Transcription, tagging | Stems → voice | Album art gen | Music video | Remix, stem split | ? | ? |
| 3D | Scene description | ? | Rendering | Animation | Spatial audio | 3D-to-3D (retopo) | ? |
| Code | Explanation, docs | ? | UI preview | Demo gen | ? | ? | Refactor, translate |
Seven modalities. 49 cells. ~30 filled. ~19 gaps. Each ? is an opportunity or a pipeline waiting to be chained.
Convergence
2023 2025 2027
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ TEXT │ │ MULTI │ │ NATIVE │
│ ONLY │ → │ MODAL │ → │ OMNI │
│ │ │ (chained) │ │ (unified) │
└─────────────┘ └─────────────┘ └─────────────┘
Separate models Models + APIs Single model
for each task piped together any in → any out
2023: one model per cell. 2025: chain models to fill cells (Voice → STT → LLM → Image gen). 2027: one model fills the whole matrix natively. The empty cells disappear — not because tools fill them, but because the model IS the matrix.
Chaining
Today's ? cells are tomorrow's pipelines. Any transformation chains through text as the universal bridge:
Voice → [STT] → Text → [Image Gen] → Image (voice-to-image)
Voice → [STT] → Text → [Video Gen] → Video (voice-to-video)
Image → [Vision] → Text → [TTS] → Voice (image narration)
Image → [Vision] → Text → [Music Gen] → Audio (image-to-soundtrack)
3D → [Render] → Image → [Video Gen] → Video (3D-to-video)
Text is the routing layer. Every cross-modal transformation passes through it — until omnimodal models skip the hop.
Selection Criteria
| Criterion | Question | Why It Matters |
|---|---|---|
| Latency | Real-time or batch? | Voice/video need <200ms |
| Quality | Good enough vs perfect? | Production vs prototype |
| Cost | Per token/second/image? | Scale economics |
| Control | Fine-tuning available? | Brand voice, style |
| Privacy | Data stays where? | Enterprise requirements |
| Integration | API quality? | Developer experience |
Technique
Tools tell you WHAT to use. Technique tells you HOW.
- Visual Prompting — Structure prompts for image and video generation
Context
- Matrix Thinking — The method: empty cells are prompts for your subconscious
- Prompts — Technique for every modality
- Voice & Speech — TTS, STT, voice agents, DePIN angle
- JTBD Function Superset — Voice as platform capability (AI-009 to AI-012)
- LLM Models — Text modality providers
- Agent Frameworks — Orchestrating cross-modal pipelines
- AI Principles — How it all works
Questions
When the model IS the matrix — any input to any output natively — what happens to the tools that currently fill individual cells?
- Which empty cells in the matrix above have the highest demand but no tooling yet?
- If text is the universal routing layer today, what replaces it when omnimodal models skip the hop?
- Which cross-modal chains matter most for the ventures — and are they chained or gapped?