Clip: Difference between revisions

Latest revision as of 13:03, 26 August 2025

Clip

Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into embeddings, statistical representations, that can be operated upon in the latent space.[1]

The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license,[2] just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license.[3] CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.

Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP, that has been trained on four hundred million text-to-image pairs,[4] cannot be retrained without incurring exorbitant costs. Therefore, the presence of CLIP is due to its open licensing. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach.

[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316

[2] Openai. “GitHub - Openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the Most Relevant Text Snippet given an Image.” GitHub, n.d. Accessed August 22, 2025. https://github.com/openai/CLIP.

[3] Xiang, Chloe. “OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit.” Vice, February 28, 2023. https://www.vice.com/en/article/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit/.

[4] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. Critical AI 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066

@@ Line 1: / Line 1: @@
-== What is the network that sustains this object? ==
+== Clip ==
-* How does it move from person to person, person to software, to platform, what things are attached to it (visual culture)
+Like the [[Variational Autoencoder, VAE|variational autoencoder (VAE),]] the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into embeddings, statistical representations, that can be operated upon in the [[Latent space|latent space.]][1]
-* Networks of attachments
-* How does it relate / sustain a collective? (human + non-human)
-== How does it evolve through time? ==
+The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license,[2] just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license.[3] CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.
-Evolution of the interface for these objects. Early chatgpt offered two parameters through the API: prompt and temperature. Today extremely complex object with all kinds of components and parameters. Visually what is the difference? Richness of the interface in decentralization (the more options, the better...)
-== How does it create value? Or decrease / affect value? ==
+Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP, that has been trained on four hundred million text-to-image pairs,[4] cannot be retrained without incurring exorbitant costs. Therefore, the presence of CLIP is due to its open licensing. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach.
-== What is its place/role in techno cultural strategies? ==
+[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316
-== How does it relate to autonomous infrastructure? ==
+[2] Openai. “GitHub - Openai/CLIP: CLIP (Contrastive Language-Image Pretraining),  Predict the Most Relevant Text Snippet given an Image.” ''GitHub'', n.d. Accessed August 22, 2025. https://github.com/openai/CLIP.
+[3] Xiang, Chloe. “OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit.” ''Vice'', February 28, 2023. https://www.vice.com/en/article/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit/.
+[4] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. ''Critical AI'' 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066
 [[Category:Objects of Interest and Necessity]]