Ilya Gusev, Nov 30, 2022

In this post, I highlight influential papers about matching texts and images. OpenAI published an original CLIP model in March 2021, and many things have changed since then. I will try to answer the following questions:

  1. What is so special about CLIP?
  2. What alternatives to the CLIP model can we use right now?
  3. What are loss and model architecture choices possible?
  4. What are potential applications for CLIP besides image classification?

"one robot standing back to camera, staring into two screens, left screen displays colorful image, right screen displays some document, futuristic", Midjourney V4****

"one robot standing back to camera, staring into two screens, left screen displays colorful image, right screen displays some document, futuristic", Midjourney V4****

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaU5ETmxaVFUwWW1ZM09UYzBNVFExWVRSaU1ERTBaR0ZsTURSaVlqRXlaU0lzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklsUnRjR293VTA1RVpFSmxPSEZtU0VKMFdHdzJJbjA9

Table of contents

Original CLIP: Contrastive Language-Image Pre-training

Paper: Radford et al., 2021

Date: February 2021

PR post: https://openai.com/blog/clip/

Organization: OpenAI

Availability: Models and code are open, and the dataset is closed

Model: https://huggingface.co/openai/clip-vit-base-patch32

Code: https://github.com/openai/CLIP

Main idea