Flamingo
text+image+video
research
Flamingo (NeurIPS 2022) is DeepMind's visual-language model that processes images, videos, and text together. It bridges...
Version: 1.0
Released: 3y 6m 3d ago on 04/29/2022
Architecture
- parameters: 80000000000
- context_length: 2048
- training_data: Large-scale multimodal web data (interleaved images/videos and text) with cross-attention layers
- inference: few-shot visual-language generation
Capabilities
- One-shot image
- video understanding
- handles images or video frames
- text to perform tasks like visual question answering
- captioning
- state-of-the-art few-shot results
Benchmarks
- VQA: SOTA (few-shot)
- ImageCaptions: SOTA (few-shot)
Safety
- Prone to visual biases
- no public alignment details available.
Deployment
- regions: private
- hosting: No public API
- integrations: used in research only.
Tags
vision-languagemultimodalfew-shotresearch