CLIP
multimodal
open-weight
OpenAI CLIP (2021) is a multimodal model trained to associate images with text captions. It was trained on 400 million i...
Version: 1.0
Released: 4y 8m 27d ago on 02/05/2021
Architecture
- parameters: ResNet50 or ViT for image; Transformer for text; pre-trained on 400M pairs
- context_length: Image: 224×224 pixels; Text: ≤77 tokens
- training_data: 400 million image-caption pairs from the internet
- inference: Dual-encoder contrastive model
Capabilities
- Zero-shot image classification and retrieval using text prompts
- Understands visual concepts through natural language
- Can rank images based on textual descriptions without fine-tuning
Benchmarks
- ImageNet Zero-shot: ~76% top-1 accuracy (reported in original paper)
Safety
- Open model
- users should be aware of potential biases in training data affecting outputs.
Deployment
- regions: private
- hosting: HuggingFace, GitHub
- integrations: integrated into various vision-language applications
Tags
vision-languagecontrastive learningopen-sourcezero-shot