AI Modalities

What AI can do. The interfaces between intelligence and the world.

The Modality Matrix

Modality	Input	Output	Key Players	Stage
Text	Text	Text	OpenAI, Anthropic, Google	Mature
Voice	Text/Audio	Speech	ElevenLabs, OpenAI, Qwen	Maturing
Vision	Image/Video	Text/Analysis	GPT-4V, Claude, Gemini	Maturing
Image Gen	Text	Image	Midjourney, DALL-E, Flux	Mature
Video	Text/Image	Video	Sora, Runway, Kling	Early
Audio	Text	Music/Sound	Suno, Udio, Stable Audio	Early
3D	Text/Image	3D Models	Tripo, Meshy, Rodin	Emerging
Code	Text	Code	Claude, Cursor, Copilot	Mature

Modality Convergence

The trend: multimodal models that handle all inputs/outputs natively.

        2023                    2025                    2027
    ┌─────────────┐        ┌─────────────┐        ┌─────────────┐
    │   TEXT      │        │   MULTI     │        │   NATIVE    │
    │   ONLY      │   →    │   MODAL     │   →    │   OMNI      │
    │             │        │   (chained) │        │   (unified) │
    └─────────────┘        └─────────────┘        └─────────────┘

    Separate models        Models + APIs          Single model
    for each task          piped together         any in → any out

The 2027 thesis: Native omnimodal models eliminate the need for specialized modality APIs. Input anything, output anything.

Selection Criteria

When choosing modality tools:

Criterion	Question	Why It Matters
Latency	Real-time or batch?	Voice/video need `<200ms`
Quality	Good enough vs perfect?	Production vs prototype
Cost	Per token/second/image?	Scale economics
Control	Fine-tuning available?	Brand voice, style
Privacy	Data stays where?	Enterprise requirements
Integration	API quality?	Developer experience

The VVFL View

Each modality is a feedback loop interface:

Modality	Captures	Generates	Loop Closure
Voice	Human speech	Synthetic speech	Conversation
Vision	Visual world	Understanding	See → Act
Image	Text intent	Visual output	Imagine → Create
Video	Narrative	Moving images	Story → Film
Audio	Musical intent	Sound	Compose → Listen
3D	Spatial intent	Physical models	Design → Build

Technique

Tools tell you WHAT to use. Technique tells you HOW.

Visual Prompting — Structure prompts for image and video generation. Templates, worked examples, failure modes.

Context

LLM Models — Text modality providers
Agent Frameworks — Orchestrating modalities
AI Principles — How it all works

The Modality Matrix​

Modality Convergence​

Selection Criteria​

The VVFL View​

Technique​

Context​