CLIP

multimodal
open-weight
OpenAI CLIP (2021) is a multimodal model trained to associate images with text captions. It was trained on 400 million i...
Version: 1.0
Released: 4y 8m 27d ago on 02/05/2021

Architecture

  • parameters: ResNet50 or ViT for image; Transformer for text; pre-trained on 400M pairs
  • context_length: Image: 224×224 pixels; Text: ≤77 tokens
  • training_data: 400 million image-caption pairs from the internet
  • inference: Dual-encoder contrastive model

Capabilities

  • Zero-shot image classification and retrieval using text prompts
  • Understands visual concepts through natural language
  • Can rank images based on textual descriptions without fine-tuning

Benchmarks

  • ImageNet Zero-shot: ~76% top-1 accuracy (reported in original paper)

Safety

  • Open model
  • users should be aware of potential biases in training data affecting outputs.

Deployment

  • regions: private
  • hosting: HuggingFace, GitHub
  • integrations: integrated into various vision-language applications

Tags

vision-languagecontrastive learningopen-sourcezero-shot

Join our community

Connect with others, share experiences, and stay in the loop.