AI Modalities

What can become what? Every modality is both input and output. The matrix of possibilities — not a list of tools.

The Modality Matrix

Every cell is a transformation. Filled cells have tools. ? cells are gaps waiting to be filled.

Input \ Output	Text	Voice	Image	Video	Audio	3D	Code
Text	LLMs (Claude, GPT)	TTS (ElevenLabs, Cartesia)	Image gen (Midjourney, Flux)	Video gen (Sora, Kling)	Music gen (Suno, Udio)	3D gen (Tripo, Meshy)	Code gen (Claude, Cursor)
Voice	STT (Whisper, Deepgram)	Voice-to-voice (translation)	`?`	`?`	Voice isolation	`?`	Voice coding (Copilot Voice)
Image	Vision (GPT-4V, Claude)	Image description → TTS	Style transfer, upscale	Image-to-video (Runway, Kling)	`?`	Image-to-3D (Tripo, Rodin)	Screenshot-to-code (v0)
Video	Video understanding (Gemini)	Extract dialogue	Frame extraction	Video editing, re-style	Extract audio, score	`?`	`?`
Audio	Transcription, tagging	Stems → voice	Album art gen	Music video	Remix, stem split	`?`	`?`
3D	Scene description	`?`	Rendering	Animation	Spatial audio	3D-to-3D (retopo)	`?`
Code	Explanation, docs	`?`	UI preview	Demo gen	`?`	`?`	Refactor, translate

Seven modalities. 49 cells. ~30 filled. ~19 gaps. Each ? is an opportunity or a pipeline waiting to be chained.

Convergence

        2023                    2025                    2027
    ┌─────────────┐        ┌─────────────┐        ┌─────────────┐
    │   TEXT      │        │   MULTI     │        │   NATIVE    │
    │   ONLY      │   →    │   MODAL     │   →    │   OMNI      │
    │             │        │   (chained) │        │   (unified) │
    └─────────────┘        └─────────────┘        └─────────────┘

    Separate models        Models + APIs          Single model
    for each task          piped together         any in → any out

2023: one model per cell. 2025: chain models to fill cells (Voice → STT → LLM → Image gen). 2027: one model fills the whole matrix natively. The empty cells disappear — not because tools fill them, but because the model IS the matrix.

Chaining

Today's ? cells are tomorrow's pipelines. Any transformation chains through text as the universal bridge:

Voice → [STT] → Text → [Image Gen] → Image      (voice-to-image)
Voice → [STT] → Text → [Video Gen] → Video       (voice-to-video)
Image → [Vision] → Text → [TTS] → Voice          (image narration)
Image → [Vision] → Text → [Music Gen] → Audio    (image-to-soundtrack)
3D → [Render] → Image → [Video Gen] → Video      (3D-to-video)

Text is the routing layer. Every cross-modal transformation passes through it — until omnimodal models skip the hop.

Selection Criteria

Criterion	Question	Why It Matters
Latency	Real-time or batch?	Voice/video need <200ms
Quality	Good enough vs perfect?	Production vs prototype
Cost	Per token/second/image?	Scale economics
Control	Fine-tuning available?	Brand voice, style
Privacy	Data stays where?	Enterprise requirements
Integration	API quality?	Developer experience

Technique

Tools tell you WHAT to use. Technique tells you HOW.

Visual Prompting — Structure prompts for image and video generation

Context

Matrix Thinking — The method: empty cells are prompts for your subconscious
Prompts — Technique for every modality
Voice & Speech — TTS, STT, voice agents, DePIN angle
JTBD Function Superset — Voice as platform capability (AI-009 to AI-012)
LLM Models — Text modality providers
Agent Frameworks — Orchestrating cross-modal pipelines
AI Principles — How it all works

Questions

When the model IS the matrix — any input to any output natively — what happens to the tools that currently fill individual cells?

Which empty cells in the matrix above have the highest demand but no tooling yet?
If text is the universal routing layer today, what replaces it when omnimodal models skip the hop?
Which cross-modal chains matter most for the ventures — and are they chained or gapped?

The Modality Matrix​

Convergence​

Chaining​

Selection Criteria​

Technique​

Context​

Questions​