Vision Language Models
Key to robotics
- VLM Use Cases
- Vision Transformers
- OpenAI's CLIP Model
- DeepMind's Flamingo
- Instruction Tuning with LAVA
- MMMU Benchmark
- Pre-training with QNVL
- InternVL Model Series
- Cross-Attention vs. Self-Attention
- Hybrid Architectures
- Early vs. Late Fusion
- VQA and DocVQA Benchmarks
- The Blink Benchmark
- Generative Pre-training
- Multimodal Generation