Flamingo

text+image+video
research
Flamingo (NeurIPS 2022) is DeepMind's visual-language model that processes images, videos, and text together. It bridges...
Version: 1.0
Released: 3y 6m 3d ago on 04/29/2022

Architecture

  • parameters: 80000000000
  • context_length: 2048
  • training_data: Large-scale multimodal web data (interleaved images/videos and text) with cross-attention layers
  • inference: few-shot visual-language generation

Capabilities

  • One-shot image
  • video understanding
  • handles images or video frames
  • text to perform tasks like visual question answering
  • captioning
  • state-of-the-art few-shot results

Benchmarks

  • VQA: SOTA (few-shot)
  • ImageCaptions: SOTA (few-shot)

Safety

  • Prone to visual biases
  • no public alignment details available.

Deployment

  • regions: private
  • hosting: No public API
  • integrations: used in research only.

Tags

vision-languagemultimodalfew-shotresearch

Join our community

Connect with others, share experiences, and stay in the loop.