Skip to main content

Vision Language Models

Key to robotics

  • VLM Use Cases
  • Vision Transformers
  • OpenAI's CLIP Model
  • DeepMind's Flamingo
  • Instruction Tuning with LAVA
  • MMMU Benchmark
  • Pre-training with QNVL
  • InternVL Model Series
  • Cross-Attention vs. Self-Attention
  • Hybrid Architectures
  • Early vs. Late Fusion
  • VQA and DocVQA Benchmarks
  • The Blink Benchmark
  • Generative Pre-training
  • Multimodal Generation

Questions

Which engineering decision related to this topic has the highest switching cost once made — and how do you make it well with incomplete information?

  • At what scale or complexity level does the right answer to this topic change significantly?
  • How does the introduction of AI-native workflows change the conventional wisdom about this technology?
  • Which anti-pattern in this area is most commonly introduced by developers who know enough to be dangerous but not enough to know what they don't know?