Clip

From CTPwiki

Revision as of 10:39, 20 August 2025 by NicolasMaleve (talk | contribs) (Clip)

Clip

Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into latents. Its role is to transform into embeddings, representations that can be operated upon in the latent space.[1]

The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license, just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license. CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.

Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP that has been trained on four hundred million text-to-image pairs [2] cannot be retrained without incurring exorbitant costs. Therefore the presence of CLIP is due to its open licensing, the sheer cost of its production makes it a black box even for advanced users; and its inspection and customization out of reach.

[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316

[2] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. Critical AI 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066