Skip to main content

Vision Language Models

Key to robotics

  • VLM Use Cases
  • Vision Transformers
  • OpenAI's CLIP Model
  • DeepMind's Flamingo
  • Instruction Tuning with LAVA
  • MMMU Benchmark
  • Pre-training with QNVL
  • InternVL Model Series
  • Cross-Attention vs. Self-Attention
  • Hybrid Architectures
  • Early vs. Late Fusion
  • VQA and DocVQA Benchmarks
  • The Blink Benchmark
  • Generative Pre-training
  • Multimodal Generation