Skip to main content

AI Modalities

What can become what? Every modality is both input and output. The matrix of possibilities — not a list of tools.

The Modality Matrix

Every cell is a transformation. Filled cells have tools. ? cells are gaps waiting to be filled.

Input \ OutputTextVoiceImageVideoAudio3DCode
TextLLMs (Claude, GPT)TTS (ElevenLabs, Cartesia)Image gen (Midjourney, Flux)Video gen (Sora, Kling)Music gen (Suno, Udio)3D gen (Tripo, Meshy)Code gen (Claude, Cursor)
VoiceSTT (Whisper, Deepgram)Voice-to-voice (translation)??Voice isolation?Voice coding (Copilot Voice)
ImageVision (GPT-4V, Claude)Image description → TTSStyle transfer, upscaleImage-to-video (Runway, Kling)?Image-to-3D (Tripo, Rodin)Screenshot-to-code (v0)
VideoVideo understanding (Gemini)Extract dialogueFrame extractionVideo editing, re-styleExtract audio, score??
AudioTranscription, taggingStems → voiceAlbum art genMusic videoRemix, stem split??
3DScene description?RenderingAnimationSpatial audio3D-to-3D (retopo)?
CodeExplanation, docs?UI previewDemo gen??Refactor, translate

Seven modalities. 49 cells. ~30 filled. ~19 gaps. Each ? is an opportunity or a pipeline waiting to be chained.

Convergence

        2023                    2025                    2027
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ TEXT │ │ MULTI │ │ NATIVE │
│ ONLY │ → │ MODAL │ → │ OMNI │
│ │ │ (chained) │ │ (unified) │
└─────────────┘ └─────────────┘ └─────────────┘

Separate models Models + APIs Single model
for each task piped together any in → any out

2023: one model per cell. 2025: chain models to fill cells (Voice → STT → LLM → Image gen). 2027: one model fills the whole matrix natively. The empty cells disappear — not because tools fill them, but because the model IS the matrix.

Chaining

Today's ? cells are tomorrow's pipelines. Any transformation chains through text as the universal bridge:

Voice → [STT] → Text → [Image Gen] → Image      (voice-to-image)
Voice → [STT] → Text → [Video Gen] → Video (voice-to-video)
Image → [Vision] → Text → [TTS] → Voice (image narration)
Image → [Vision] → Text → [Music Gen] → Audio (image-to-soundtrack)
3D → [Render] → Image → [Video Gen] → Video (3D-to-video)

Text is the routing layer. Every cross-modal transformation passes through it — until omnimodal models skip the hop.

Selection Criteria

CriterionQuestionWhy It Matters
LatencyReal-time or batch?Voice/video need <200ms
QualityGood enough vs perfect?Production vs prototype
CostPer token/second/image?Scale economics
ControlFine-tuning available?Brand voice, style
PrivacyData stays where?Enterprise requirements
IntegrationAPI quality?Developer experience

Technique

Tools tell you WHAT to use. Technique tells you HOW.

Context

Questions

When the model IS the matrix — any input to any output natively — what happens to the tools that currently fill individual cells?

  • Which empty cells in the matrix above have the highest demand but no tooling yet?
  • If text is the universal routing layer today, what replaces it when omnimodal models skip the hop?
  • Which cross-modal chains matter most for the ventures — and are they chained or gapped?