Skip to main content

AI Modalities

What AI can do. The interfaces between intelligence and the world.

The Modality Matrix

ModalityInputOutputKey PlayersStage
TextTextTextOpenAI, Anthropic, GoogleMature
VoiceText/AudioSpeechElevenLabs, OpenAI, QwenMaturing
VisionImage/VideoText/AnalysisGPT-4V, Claude, GeminiMaturing
Image GenTextImageMidjourney, DALL-E, FluxMature
VideoText/ImageVideoSora, Runway, KlingEarly
AudioTextMusic/SoundSuno, Udio, Stable AudioEarly
3DText/Image3D ModelsTripo, Meshy, RodinEmerging
CodeTextCodeClaude, Cursor, CopilotMature

Modality Convergence

The trend: multimodal models that handle all inputs/outputs natively.

        2023                    2025                    2027
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ TEXT │ │ MULTI │ │ NATIVE │
│ ONLY │ → │ MODAL │ → │ OMNI │
│ │ │ (chained) │ │ (unified) │
└─────────────┘ └─────────────┘ └─────────────┘

Separate models Models + APIs Single model
for each task piped together any in → any out

The 2027 thesis: Native omnimodal models eliminate the need for specialized modality APIs. Input anything, output anything.


Selection Criteria

When choosing modality tools:

CriterionQuestionWhy It Matters
LatencyReal-time or batch?Voice/video need <200ms
QualityGood enough vs perfect?Production vs prototype
CostPer token/second/image?Scale economics
ControlFine-tuning available?Brand voice, style
PrivacyData stays where?Enterprise requirements
IntegrationAPI quality?Developer experience

The VVFL View

Each modality is a feedback loop interface:

ModalityCapturesGeneratesLoop Closure
VoiceHuman speechSynthetic speechConversation
VisionVisual worldUnderstandingSee → Act
ImageText intentVisual outputImagine → Create
VideoNarrativeMoving imagesStory → Film
AudioMusical intentSoundCompose → Listen
3DSpatial intentPhysical modelsDesign → Build

Context