Skip to main content

Vision Language Models

Key to robotics

VLM Use Cases
Vision Transformers
OpenAI's CLIP Model
DeepMind's Flamingo
Instruction Tuning with LAVA
MMMU Benchmark
Pre-training with QNVL
InternVL Model Series
Cross-Attention vs. Self-Attention
Hybrid Architectures
Early vs. Late Fusion
VQA and DocVQA Benchmarks
The Blink Benchmark
Generative Pre-training
Multimodal Generation

Questions

Which engineering decision related to this topic has the highest switching cost once made — and how do you make it well with incomplete information?

At what scale or complexity level does the right answer to this topic change significantly?
How does the introduction of AI-native workflows change the conventional wisdom about this technology?
Which anti-pattern in this area is most commonly introduced by developers who know enough to be dangerous but not enough to know what they don't know?

Questions