Clip: Difference between revisions
Created page with "== What is the network that sustains this object? == * How does it move from person to person, person to software, to platform, what things are attached to it (visual culture) * Networks of attachments * How does it relate / sustain a collective? (human + non-human) == How does it evolve through time? == Evolution of the interface for these objects. Early chatgpt offered two parameters through the API: prompt and temperature. Today extremely complex object with all..." |
No edit summary |
||
(7 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
== | == Clip == | ||
Like the [[Variational Autoencoder, VAE|variational autoencoder (VAE),]] the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into embeddings, statistical representations, that can be operated upon in the [[Latent space|latent space.]][1] | |||
The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license,[2] just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license.[3] CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems. | |||
Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP, that has been trained on four hundred million text-to-image pairs,[4] cannot be retrained without incurring exorbitant costs. Therefore, the presence of CLIP is due to its open licensing. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach. | |||
[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316 | |||
[2] Openai. “GitHub - Openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the Most Relevant Text Snippet given an Image.” ''GitHub'', n.d. Accessed August 22, 2025. https://github.com/openai/CLIP. | |||
[3] Xiang, Chloe. “OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit.” ''Vice'', February 28, 2023. https://www.vice.com/en/article/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit/. | |||
[4] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. ''Critical AI'' 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066 | |||
[[Category:Objects of Interest and Necessity]] | [[Category:Objects of Interest and Necessity]] |
Latest revision as of 13:03, 26 August 2025
Clip
Like the variational autoencoder (VAE), the vision model CLIP (contrastive language-image pre-training) is largely unknown to the general public. As the VAE, it is used in the image generation pipeline as a component to encode input into embeddings, statistical representations, that can be operated upon in the latent space.[1]
The presence of CLIP in the pipeline illustrates the complexity of the relations between the various ecosystems of image generation. CLIP was first released in 2021 by OpenAI under an open source license,[2] just before the company changed its politics of openness. Subsequent products such as DALL-E are governed by a proprietary license.[3] CLIP is in its own right a foundational model and serves multiple purposes such as image retrieval and classification. Its use as a secondary component in the image generation pipeline shows the composite nature of these architectures where existing elements are borrowed from different sources and repurposed according to needs. If technically, CLIP bridges prompts and the latent space, politically it travels between proprietary and open source ecosystems.
Comparing CLIP to the VAE also shows how elements that perform similar technical functions allow for strikingly different forms of social appropriations. Amateurs train and retrain VAEs to improve image realism where as CLIP, that has been trained on four hundred million text-to-image pairs,[4] cannot be retrained without incurring exorbitant costs. Therefore, the presence of CLIP is due to its open licensing. The sheer cost of its production makes it a black box even for advanced users, and its inspection and customization out of reach.
[1] Offert, Fabian: On the Concept of History (in Foundation Models). In: IMAGE. Zeitschrift für interdisziplinäre Bildwissenschaft, Jg. 19 (2023), Nr. 1, S. 121-134.http://dx.doi.org/10.25969/mediarep/22316
[2] Openai. “GitHub - Openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the Most Relevant Text Snippet given an Image.” GitHub, n.d. Accessed August 22, 2025. https://github.com/openai/CLIP.
[3] Xiang, Chloe. “OpenAI Is Now Everything It Promised Not to Be: Corporate, Closed-Source, and For-Profit.” Vice, February 28, 2023. https://www.vice.com/en/article/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit/.
[4] Nicolas Malevé, Katrina Sluis; The Photographic Pipeline of Machine Vision; or, Machine Vision's Latent Photographic Theory. Critical AI 1 October 2023; 1 (1-2): No Pagination Specified. doi: https://doi.org/10.1215/2834703X-10734066