Vision Language Models
Key to robotics
- VLM Use Cases
- Vision Transformers
- OpenAI's CLIP Model
- DeepMind's Flamingo
- Instruction Tuning with LAVA
- MMMU Benchmark
- Pre-training with QNVL
- InternVL Model Series
- Cross-Attention vs. Self-Attention
- Hybrid Architectures
- Early vs. Late Fusion
- VQA and DocVQA Benchmarks
- The Blink Benchmark
- Generative Pre-training
- Multimodal Generation
Questions
Which engineering decision related to this topic has the highest switching cost once made — and how do you make it well with incomplete information?
- At what scale or complexity level does the right answer to this topic change significantly?
- How does the introduction of AI-native workflows change the conventional wisdom about this technology?
- Which anti-pattern in this area is most commonly introduced by developers who know enough to be dangerous but not enough to know what they don't know?