Clip

From CTPwiki

Clip

Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into embeddings, statistical representations, that can be operated upon in the latent space.[1]

The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license,[2] just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license.[3] CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.

Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP, that has been trained on four hundred million text-to-image pairs,[4] cannot be retrained without incurring exorbitant costs. Therefore, the presence of CLIP is due to its open licensing. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach.

[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316

[2] Openai. “GitHub - Openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the Most Relevant Text Snippet given an Image.” GitHub, n.d. Accessed August 22, 2025. https://github.com/openai/CLIP.

[3] Xiang, Chloe. “OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit.” Vice, February 28, 2023. https://www.vice.com/en/article/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit/.

[4] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. Critical AI 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066