Clip

Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into latents. Its role is to transform into embeddings, representations that can be operated upon in the latent space.

The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license, just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license. CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it bridges proprietary and open source ecosystems.

It also shows how elements that appear to has been trained on four hundred million text-to-image pairs